Professional Documents
Culture Documents
Benjamin Hindman
Andrew Konwinski
Matei Zaharia
Ali Ghodsi
Anthony D. Joseph
Randy H. Katz
Scott Shenker
Ion Stoica
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
1
icy such as fair sharing, but frameworks decide which 1
0.9
resources to accept and which computations to run on 0.8
them. We have found that resource offers are flexible 0.7
enough to let frameworks achieve a variety of placement 0.6
CDF
0.5
goals, including data locality. In addition, resource offers 0.4
are simple and efficient to implement, allowing Mesos to 0.3 MapReduce Jobs
be highly scalable and robust to failures. 0.2 Map & Reduce Tasks
0.1
Mesos’s flexible fine-grained sharing model also has 0
other advantages. First, Mesos can support a wide variety 1 10 100 1000 10000 100000
of applications beyond data-intensive computing frame- Duration (s)
works, such as client-facing services (e.g., web servers) Figure 1: CDF of job and task durations in Facebook’s Hadoop
and compute-intensive parallel applications (e.g., MPI), data warehouse (data from [48]).
allowing organizations to consolidate workloads and in-
crease utilization. Although not all of these applications Facebook loads logs from its web services into a 1200-
run tasks that are fine-grained in time, Mesos can share node Hadoop cluster, where they are used for applica-
nodes in space between short and long tasks. tions such as business intelligence, spam detection, and
Second, even organizations that only use one frame- ad optimization. In addition to “production” jobs that run
work can leverage Mesos to run multiple isolated in- periodically, the cluster is used for many experimental
stances of that framework in the same cluster, or multiple jobs, ranging from multi-hour machine learning compu-
versions of the framework. For example, an organization tations to 1-2 minute ad-hoc queries submitted interac-
using Hadoop might want to run separate instances of tively through an SQL interface to Hadoop called Hive
Hadoop for production and experimental jobs, or to test [3]. Most jobs are short (the median being 84s long), and
a new Hadoop version alongside its existing one. the jobs are composed of fine-grained map and reduce
Third, by providing a platform for resource sharing tasks (the median task being 23s), as shown in Figure 1.
across frameworks, Mesos allows framework develop- To meet the performance requirements of these jobs,
ers to build specialized frameworks targeted at particu- Facebook uses a fair scheduler for Hadoop that takes
lar problem domains rather than one-size-fits-all abstrac- advantage of the fine-grained nature of the workload to
tions. Frameworks can therefore evolve faster and pro- make scheduling decisions at the level of map and re-
vide better support for each problem domain. duce tasks and to optimize data locality [48]. Unfortu-
We have implemented Mesos in 10,000 lines of C++. nately, this means that the cluster can only run Hadoop
The system scales to 50,000 nodes and uses Apache jobs. If a user wishes to write a new ad targeting al-
ZooKeeper [4] for fault tolerance. To evaluate Mesos, gorithm in MPI instead of MapReduce, perhaps because
we have ported three cluster computing systems to run MPI is more efficient for this job’s communication pat-
over it: Hadoop, MPI, and the Torque batch scheduler. tern, then the user must set up a separate MPI cluster and
To validate our hypothesis that specialized frameworks import terabytes of data into it.2 Mesos aims to enable
provide value over general ones, we have also built a new fine-grained sharing between multiple cluster computing
framework on top of Mesos called Spark, optimized for frameworks, while giving these frameworks enough con-
iterative jobs where a dataset is reused in many parallel trol to achieve placement goals such as data locality.
operations. Spark can outperform Hadoop by 10x in it-
In addition to sharing clusters between “back-end” ap-
erative machine learning workloads. Finally, to evaluate
plications such as Hadoop and MPI, Mesos also enables
the applicability of Mesos to client-facing workloads, we
an organizations to share resources between “front-end”
have built an elastic Apache web server farm framework.
workloads, such as web servers, and back-end work-
This paper is organized as follows. Section 2 details loads. This is attractive because front-end applications
the data center environment that Mesos is designed for. have variable load patterns (e.g. diurnal cycles and
Section 3 presents the architecture of Mesos. Section 4 spikes), so there is an opportunity to scale them down
analyzes our distributed scheduling model and character- at during periods of low load and use free resources
izes the environments it works well in. We present our to speed up back-end workloads. We recognize that
implementation of Mesos in Section 5 and evaluate it in there are obstacles beyond node scheduling to collo-
Section 6. Section 7 surveys related work. We conclude cating front-end and back-end workloads at very large
with a discussion in Section 8. scales, such as the difficulty of isolating network traffic
2 Target Environment [30]. In Mesos, we focus on defining a resource access
2
Framework 1 Framework 2
Hadoop MPI ZooKeeper Job 1 Job 2 Job 1 Job 2
scheduler scheduler quorum FW Scheduler FW Scheduler
3
describe elements of our architecture that let resource of- share of RAM, while fully utilizing one resource (RAM).
fers work robustly and efficiently in a distributed system. DRF is a natural generalization of max/min fair-
ness [21]. DRF satisfies the above mentioned properties
3.3 Resource Allocation
and performs scheduling in O(log n) time for n frame-
Mesos delegates allocation decisions to a pluggable allo- works. We believe that DRF is novel, but due to space
cation module, so that organizations can tailor allocation constraints, we refer the reader to [29] for details.
to their needs. In normal operation, Mesos takes advan-
3.3.2 Supporting Long Tasks
tage of the fine-grained nature of tasks to only reallocate
resources when tasks finish. This usually happens fre- Apart from fine-grained workloads consisting of short
quently enough to let new frameworks start within a frac- tasks, Mesos also aims to support frameworks with
tion of the average task length: for example, if a frame- longer tasks, such as web services and MPI programs.
work’s share is 10% of the cluster, it needs to wait on This is accomplished by sharing nodes in space between
average 10% of the mean task length to receive its share. long and short tasks: for example, a 4-core node could be
Therefore, the allocation module only needs to decide running MPI on two cores, while also running MapRe-
which frameworks to offer free resources to, and how duce tasks that access local data. If long tasks are placed
many resources to offer to them. However, our architec- arbitrarily throughout the cluster, however, some nodes
ture also supports long tasks by allowing allocation mod- may become filled with them, preventing other frame-
ules to specifically designate a set of resources on each works from accessing local data. To address this prob-
node for use by long tasks. Finally, we give allocation lem, Mesos allows allocation modules to bound the to-
modules the power to revoke (kill) tasks if resources are tal resources on each node that can run long tasks. The
not becoming free quickly enough. amount of long task resources still available on the node
In this section, we start by describing one allocation is reported to frameworks in resource offers. When a
policy that we have developed, which performs fair shar- framework launches a task, it marks it as either long or
ing between frameworks using a definition of fairness short. Short tasks can use any resources, but long tasks
for multiple resources called Dominant Resource Fair- can only use up to the amount specified in the offer.
ness. We then explain Mesos’s mechanisms for support- Of course, a framework may launch a long task with-
ing long tasks and revocation. out marking it as such. In this case, Mesos will eventu-
ally revoke it, as we discuss next.
3.3.1 Dominant Resource Fairness (DRF)
3.3.3 Revocation
Because Mesos manages multiple resources (CPU, mem-
ory, network bandwidth, etc.) on each slave, a natural As described earlier, in an environment with fine-grained
question is what constitutes a fair allocation when differ- tasks, Mesos can reallocate resources quickly by simply
ent frameworks prioritize resources differently. We car- waiting for tasks to finish. However, if a cluster becomes
ried out a thorough study of desirable fairness properties filled by long tasks, e.g., due to a buggy job or a greedy
and possible definitions of fairness, which is detailed in framework, Mesos can also revoke (kill) tasks. Before
[29]. We concluded that most other definitions of fair- killing a task, Mesos gives its framework a grace period
ness have undesirable properties. For example, Compet- to clean it up. Mesos asks the respective executor to kill
itive Equilibrium from Equal Incomes (CEEI) [44], the the task, but kills the entire executor and all its tasks if
preferred fairness metric in micro-economics [39], has it does not respond to the request. We leave it up to the
the disadvantage that the freeing up of resources might allocation module to implement the policy for revoking
punish an existing user’s allocation. Similarly, other no- tasks, but describe two related mechanisms here.
tions of fairness violated other desirable properties, such First, while killing a task has a low impact on many
as envy-freedom, possibly leading to users gaming the frameworks (e.g., MapReduce or stateless web servers),
system by hoarding resources that they do not need. it is harmful for frameworks with interdependent tasks
To this end, we designed a fairness policy called dom- (e.g., MPI). We allow these frameworks to avoid be-
inant resource fairness (DRF), which attempts to equal- ing killed by letting allocation modules expose a guar-
ize each framework’s fractional share of its dominant re- anteed allocation to each framework – a quantity of
source, which is the resource that it has the largest frac- resources that the framework may hold without losing
tional share of. For example, if a cluster contains 100 tasks. Frameworks read their guaranteed allocations
CPUs and 100 GB of RAM, and framework F1 needs 4 through an API call. Allocation modules are responsible
CPUs and 1 GB RAM per task while F2 needs 1 CPU for ensuring that the guaranteed allocations they provide
and 8 GB RAM per task, then DRF gives F1 20 tasks can all be met concurrently. For now, we have kept the
(80 CPUs and 20 GB) and gives F2 10 tasks (10 CPUs semantics of guaranteed allocations simple: if a frame-
and 80 GB). This makes F1 ’s share of CPU equal to F2 ’s work is below its guaranteed allocation, none of its tasks
4
should be killed, and if it is above, any of its tasks may be
Scheduler Callbacks Scheduler Actions
killed. However, if this model is found to be too simple,
it is also possible to let frameworks specify priorities for resource_offer(offer_id, offers) reply_to_offer(offer_id, tasks,
offer_rescinded(offer_id) needs_more_offers)
their tasks, so that the allocation module can try to kill status_update(task_id, status) request_offers()
only low-priority tasks. slave_lost(slave_id) set_filters(filters)
Second, to decide when to trigger revocation, alloca- get_safe_share()
tion modules must know which frameworks would use kill_task(task_id)
more resources if they were offered them. Frameworks
Executor Callbacks Executor Actions
indicate their interest in offers through an API call.
launch_task(task_descriptor) send_status(task_id, status)
3.4 Isolation kill_task(task_id)
Mesos provides performance isolation between frame- Table 1: Mesos API functions for schedulers and executors.
work executors running on the same slave by leveraging The “callback” columns list functions that frameworks must
existing OS isolation mechanisms. Since these mecha- implement, while “actions” are operations that they can invoke.
nisms are platform-dependent, we support multiple iso-
lation mechanisms through pluggable isolation modules.
for a sufficiently long time, Mesos rescinds the offer and
In our current implementation, we use operating sys-
re-offers the resources to other frameworks.
tem container technologies, specifically Linux containers
[10] and Solaris projects [16], to achieve isolation. These We also note that even without the use of filters, Mesos
technologies can limit the CPU, physical memory, vir- can make tens of thousands of resource offers per second,
tual memory, network bandwidth, and (in new Linux ker- because the scheduling algorithm it must perform (fair
nels) IO bandwidth usage of a process tree. In addition, sharing) is highly efficient.
they support dynamic reconfiguration of a container’s re-
source limits, which is necessary for Mesos to be able to 3.5.1 API Summary
add and remove resources from an executor as it starts Table 1 summarizes the Mesos API. The only function
and finishes tasks. In the future, it would also be attrac- that we have not yet explained is the kill task function
tive to use virtual machines as containers. However, we that a scheduler may call to kill one of its tasks. This is
have not yet done this because current VM technologies useful for frameworks that implement backup tasks [25].
add non-negligible overhead in data-intensive workloads
and have limited support for dynamic reconfiguration. 3.6 Fault Tolerance
3.5 Making Resource Offers Scalable and Robust Since the master is a centerpiece of our architecture, we
Because task scheduling in Mesos is a distributed process have made it fault-tolerant by pushing state to slaves and
in which the master and framework schedulers communi- schedulers, making the master’s state soft state. At run-
cate, it needs to be efficient and robust to failures. Mesos time, multiple Mesos masters run simultaneously, but
includes three mechanisms to help with this goal. only one master is the leader. The others masters act
First, because some frameworks will always reject cer- as hot standbys ready to take over if the current leader
tain resources, Mesos lets them short-circuit the rejection fails. ZooKeeper [4] is used to implement leader elec-
process and avoid communication by providing filters to tion among masters. Schedulers and slaves also find out
the master. We support two types of filters: “only offer about the current leader through ZooKeeper. Upon the
nodes from list L” and “only offer nodes with at least R failure of the master, the slaves and schedulers connect
resources free”. A resource that fails a filter is treated ex- to the newly elected master and help restore its state. We
actly like a rejected resource. By default, any resources also ensure that messages sent to a failed master are re-
rejected during an offer have a temporary 5-second filter sent to the new one through a thin communication layer
placed on them, to minimize the programming burden on that uses sequence numbers and retransmissions.
developers who do not wish to manually set filters. Aside from handling master failures, Mesos reports
Second, because a framework may take time to re- task, slave and executor failures to frameworks’ sched-
spond to an offer, Mesos counts resources offered to a ulers. Frameworks can then react to failures using poli-
framework towards its share of the cluster for the purpose cies of their choice.
of allocation. This is a strong incentive for frameworks Finally, to deal with scheduler failures, Mesos can
to respond to offers quickly and to filter out resources be extended to allow a framework to register multiple
that they cannot use, so that they can get offers for more schedulers such that if one fails, another is notified by the
suitable resources faster. Mesos master and takes over. Frameworks must use their
Third, if a framework has not responded to an offer own mechanisms to share state between their schedulers.
5
4 Mesos Behavior provided a framework never uses more than its guaran-
teed allocation of required resources, explicitly account-
In this section, we study Mesos’s behavior for different
ing for these resources ensures that frameworks will not
workloads. In short, we find that Mesos performs very
get deadlocked waiting for them to become available.
well when frameworks can scale up and down elastically,
For simplicity, in the reminder of this section we as-
tasks durations are homogeneous, and frameworks pre-
sume that all tasks run on identical slices of machines,
fer all nodes equally. When different frameworks pre-
which we call slots. Also, unless otherwise specified, we
fer different nodes, we show that Mesos can emulate a
assume that each framework runs a single job.
centralized scheduler that uses fair sharing across frame-
works. In addition, we show that Mesos can handle het- Next, we use this simple model to estimate the frame-
erogeneous task durations without impacting the perfor- work ramp-up time, the job completion time, and the job
mance of frameworks with short tasks. We also discuss resource utilization. We emphasize that our goal here is
how frameworks are incentivized to improve their perfor- not to develop a detailed and exact model of the system,
mance under Mesos, and argue that these incentives also but to provide a coarse understanding of the system be-
improve overall cluster utilization. Finally, we conclude havior.
this section with some limitations of Mesos’s distributed 4.2 Homogeneous Tasks
scheduling model.
In the rest of this section we consider a task duration dis-
4.1 Definitions, Metrics and Assumptions tribution with mean Ts . For simplicity, we present results
for two distributions: constant task sizes and exponen-
In our discussion, we consider three metrics: tially distributed task sizes. We consider a cluster with n
• Framework ramp-up time: time it takes a new slots and a framework, f , that is entitled to k slots. We
framework to achieve its allocation (e.g., fair share); assume the framework runs a job which requires βkTs
• Job completion time: time it takes a job to complete, computation time. Thus, assuming the framework has k
assuming one job per framework; slots, it takes the job βTs time to finish. When comput-
ing the completion time of a job we assume that the last
• System utilization: total cluster utilization.
tasks of the job running on the framework’s k slots finish
We characterize workloads along four attributes: at the same time. Thus, we relax the assumptions about
• Scale up: Frameworks can elastically increase their the duration distribution for the tasks of framework f .
allocation to take advantage of free resources. This relaxation does not impact any of the other metrics,
• Scale down: Frameworks can relinquish resources i.e., ramp-up time and utilization.
without significantly impacting their performance. We consider two types of frameworks: elastic and
rigid. An elastic framework can scale its resources up
• Minimum allocation: Frameworks require a certain and down, i.e., it can start using slots as soon as it ac-
minimum number of slots before they can start us- quires them, and can release slots as soon its task finish.
ing their slots. In contrast, a rigid framework can start running its jobs
• Task distribution: The distribution of the task dura- only after it has allocated all its slots, i.e., a rigid frame-
tions. We consider both homogeneous and hetero- work is a framework where the minimum allocation is
geneous distributions. equal to its full allocation.
We also differentiate between two types of resources: Table 2 summarizes the job completion times and the
required and preferred. We say that a resource is re- utilization for the two types of frameworks and for the
quired if a framework must acquire it in order to run. For two types of task length distributions. We discuss each
example, we consider a graphical processing unit (GPU) case next.
a required resource if there are frameworks that must use 4.2.1 Elastic Frameworks
a GPU to run. Likewise, we consider a machine with a
public IP address a required resource if there are frame- An elastic framework can opportunistically use any slot
works that need to be externally accessible. In contrast, offered by Mesos, and can relinquish slots without sig-
we say that a resource is preferred if a framework will nificantly impacting the performance of its jobs. We as-
perform “better” using it, but can also run using other re- sume there are exactly k slots in the system that frame-
sources. For example, a framework may prefer using a work f prefers, and that f waits for these slots to become
node that locally stores its data, but it can remotely ac- available to reach its allocation.
cess the data from other nodes if it must. Framework ramp-up time If task durations are con-
We assume that all required resources are explicitly al- stant, it will take framework f at most Ts time to acquire
located, i.e., every type of required resource is an element k slots. This is simply because during a Ts interval, ev-
in the resource vector in resource offers. Furthermore, ery slot will become available, which will enable Mesos
6
Elastic Framework Rigid Framework
Constant dist. Exponential dist. Constant dist. Exponential dist.
Completion time (1/2 + β)Ts (1 + β)Ts (1 + β)Ts (ln k + β)Ts
Utilization 1 1 β/(1/2 + β) β/(ln k − 1 + β)
Table 2: The job completion time and utilization for both elastic and rigid frameworks, and for both constant task durations and
task durations that follow an exponential distribution. The framework starts with zero slots. k represents the number of slots the
framework is entitled under the given allocation, and βTs represents the time it takes a job to complete assuming the framework
gets all k slots at once.
to offer the framework all its k preferred slots. 4.2.2 Rigid Frameworks
If the duration distribution is exponential, the expected Some frameworks may not be able to start running jobs
ramp-up time is Ts ln k. The framework needs to wait on unless they reach a minimum allocation. One example is
average Ts /k to acquire the first slot from the set of its k MPI, where all tasks must start a computation in sync. In
preferred slots, Ts /(k−1) to acquire the second slot from this section we consider the worst case where the min-
the remaining k − 1 slots in the set, and Ts to acquire the imum allocation constraint equals the framework’s full
last slot. Thus, the ramp-up time of f is allocation, i.e., k slots.
Ts × (1 + 1/2.. + 1/k) ' Ts ln k. (1) Job completion time While in this case the ramp-up
time remains unchanged, the job completion time will
Job completion time Recall that βTs is the comple- change because the framework cannot use any slot before
tion time of the job in an ideal scenario in which the reaching its full allocation. If the task duration distribu-
frameworks acquires all its k slots instantaneously. If tion is constant the completion time is simply Ts (1 + β),
task durations are constant, the completion time is on as the framework doesn’t use any slot during the first
average (1/2 + β)Ts . To show this, assume the start- Ts interval, i.e., until it acquires all k slots. If the dis-
ing and the ending times of the tasks are uniformly dis- tribution is exponential, the completion time becomes
tributed, i.e., during the ramp-up phase, f acquires one Ts (ln k + β) as it takes the framework Ts ln k to ramp
slot every Ts /k on average. Thus, the framework’s job up (see Eq. 1).
can use roughly Ts k/2 computation time during the first
Ts interval. Once the framework acquires its k slots, it System utilization Wasting allocated slots has also a
will take the job (βkTs − Ts k/2)/k = (β − 1/2)Ts negative impact on the utilization. If the tasks duration is
time to complete. As a result the job completion time constant, and the framework acquires a slot every Ts /k
is Ts + (β − 1/2)Ts = (1/2 + β)Ts (see Table 2). on average, the framework will waste roughly Ts k/2
computation time during the ramp-up phase. Once it ac-
In the case of the exponential distribution, the ex-
quire all slots, the framework will use βkTs to complete
pected completion time of the job is Ts (1 + β) (see Ta-
its job. The utilization achieved by the framework in this
ble 2). Consider the ideal scenario in which the frame-
case is then βkTs /(kTs /2 + βkTs ) ' β/(1/2 + β).
work acquires all k slots instantaneously. Next, we com-
If the task distribution is exponential, the ex-
pute how much computation time does the job “lose”
pected computation time wasted by the framework is
during the ramp up phase compared to this ideal sce-
Ts (k ln(k − 1) − (k − 1)). The framework acquires the
nario. While the framework waits Ts /k to acquire the
first slot after waiting Ts /k, the second slot after wait-
first slot, in the ideal scenario the job would have been al-
ing Ts /(k − 1), and the last slot after waiting Ts time.
ready used each of the k slots for a total of k×Ts /k = Ts
Since the framework does not use a slot before acquiring
time. Similarly, while the framework waits Ts /(k − 1)
all of them, it follows that the first acquired slot is idle
to acquire the second slot, in the ideal scenario the job Pk−1 Pk−2
would have been used the k − 1 slots (still to be allo- for i=1 Ts /i, the second slot is idle for i=1 Ts /i,
cated) for another (k − 1)×Ts /(k − 1) = Ts time. In and the next to last slot is idle for Ts time. As a result,
general, the framework loses Ts computation time while the expected computation time wasted by the framework
waiting to acquire each slot, and a total of kTs compu- during the ramp-up phase is
Pk−1 i
tation time during the entire ramp-up phase. To account Ts × i=1 k−i =
Pk−1 k
for this loss, the framework needs to use all k slots for Ts × i=1 k−i − 1 =
an additional Ts time, which increases the expected job Pk−1 1
Ts × k × i=1 k−i − Ts × (k − 1) '
completion time by Ts to (1 + β)Ts .
Ts × k × ln(k − 1) − Ts × (k − 1)
System utilization As long as frameworks can scale Ts × (k ln(k − 1) − (k − 1))
up and down and there is enough demand in the system, Assuming k 1, the utilization achieved by the
the cluster will be fully utilized. framework is βkTs /((k ln(k−1)−(k−1))Ts +βkTs ) '
7
β/(ln k − 1 + β) (see Table 2). of frameworks in the system. Note that this scheme is
similar to lottery scheduling [45]. Furthermore, note that
4.2.3 Placement Preferences
since each framework i receives roughly si slots during
So far, we have assumed that frameworks have no slot a time interval Ts , the analysis of the ramp-up and com-
preferences. In practice, different frameworks prefer dif- pletion times in Section 4.2 still holds.
ferent nodes and their preferences may change over time. There are also allocation policies other than fair shar-
In this section, we consider the case where frameworks ing that can be implemented without knowledge of
have different preferred slots. framework preferences. For example, a strict priority
The natural question is how well Mesos will work in scheme (e.g., where framework 1 must always get pri-
this case when compared to a centralized scheduler that ority over framework 2) can be implemented by always
has full information about framework preferences. We offering resources to the high-priority framework first.
consider two cases: (a) there exists a system configura-
tion in which each framework gets all its preferred slots 4.3 Heterogeneous Tasks
and achieves its full allocation, and (b) there is no such So far we have assumed that frameworks have homo-
configuration, i.e., the demand for preferred slots exceeds geneous task duration distributions. In this section, we
the supply. discuss heterogeneous tasks, in particular, tasks that are
In the first case, it is easy to see that, irrespective of the short and long, where the mean duration of the long tasks
initial configuration, the system will converge to the state is significantly longer than the mean of the short tasks.
where each framework allocates its preferred slots after We show that by sharing nodes in space as discussed in
at most one Ts interval. This is simple because during Section 3.3.2, Mesos can accommodate long tasks with-
a Ts interval all slots become available, and as a result out impacting the performance of short tasks.
each framework will be offered its preferred slots. A heterogeneous workload can hurt frameworks with
In the second case, there is no configuration in which short tasks by increasing the time it takes such frame-
all frameworks can satisfy their preferences. The key works to reach their allocation and to acquire preferred
question in this case is how should one allocate the pre- slots (relative to the frameworks’ total job durations). In
ferred slots across the frameworks demanding them. In particular, a node preferred by a framework with short
particular, assume there are x slots preferred by m frame- tasks may become fully filled by long tasks, greatly in-
Pm where framework i requests ri such slots, and
works, creasing a framework’s waiting time.
i=1 ri > x. While many allocation policies are pos- To address this problem, Mesos differentiates between
sible, here we consider the weighted fair allocation pol- short and long slots, and bounds the number of long
icy where the weight associated with a framework is its slots on each node. This ensures there are enough short
intended allocation, si . In other words, assuming that tasks on each node whose slots become available with
Pm has enough demand, framework i will
each framework high frequency, giving frameworks better opportunities
get x×si /( i=1 si ). to quickly acquire a slot on one of their preferred nodes.
As an example, consider 30 slots that are preferred In addition, Mesos implements a revocation mechanism
by two frameworks with intended allocations s1 = 200 that does not differentiate between long and short tasks
slots, and s2 = 100 slots, respectively. Assume that once a framework exceeds its allocation. This makes sure
framework 1 requests r1 = 40 of the preferred slots, that the excess slots in the system (i.e., the slots allo-
while framework 2 requests r2 = 20 of these slots. Then, cated by frameworks beyond their intended allocations)
the preferred slots are allocated according to weighted are all treated as short slots. As a result, a framework
fair sharing, i.e., framework 1 receives x×s1 /(s1 +s2 ) = under its intended allocation can acquire slots as fast as
20 slots, and framework 2 receives the rest of 10 slots. in a system where all tasks are short. At the same time,
If the demand of framework 1 is less than its share, Mesos ensures that frameworks running their tasks on
e.g., r1 = 15, then framework 1 receives 15 slots (which long slots and not exceeding their guaranteed allocations
satisfies its demand), and framework 2 receives the rest won’t have their tasks revoked.
of 15 slots.
The challenge with Mesos is that the scheduler does 4.4 Framework Incentives
not know the preferences of each framework. Fortu- Mesos implements a decentralized scheduling approach,
nately, it turns out that there is an easy way to achieve where each framework decides which offers to accept or
the fair allocation of the preferred slots described above: reject. As with any decentralized system, it is impor-
simply offer slots to frameworks proportionally to their tant to understand the incentives of various entities in the
intended allocations. In particular, when a slot becomes system. In this section, we discuss the incentives of a
available, MesosPoffers that slot to framework i with framework to improve the response times of its jobs.
n
probability si /( i=1 si ), where n is the total number
8
Short tasks: A framework is incentivized to use short stain from offering resources on that slave until this min-
tasks for two reasons. First, it will be able to allocate imum amount is free.
any slots; in contrast frameworks with long tasks are re- Note that the wasted space due to both suboptimal bin
stricted to a subset of slots. Second, using small tasks packing and fragmentation is bounded by the ratio be-
minimizes the wasted work if the framework loses a task, tween the largest task size and the node size. Therefore,
either due to revocation or simply due to failures. clusters running “larger” nodes (e.g., multicore nodes)
and “smaller” tasks within those nodes (e.g., having a
No minimum allocation: The ability of a framework
cap on task resources) will be able to achieve high uti-
to use resources as soon as it allocates them–instead of
lization even with a distributed scheduling model.
waiting to reach a given minimum allocation–would al-
low the framework to start (and complete) its jobs earlier. Interdependent framework constraints: It’s possible
Note that the lack of a minimum allocation constraint im- to construct scenarios where, because of esoteric inter-
plies the ability of the framework to scale up, while the dependencies between frameworks’ performance, only a
reverse is not true, i.e., a framework may have both a single global allocation of the cluster resources performs
minimum allocation requirement and the ability to allo- well. We argue such scenarios are rare in practice. In
cate and use resources beyond this minimum allocation. the model discussed in this section, where frameworks
only have preferences over placement, we showed that
Scale down: The ability to scale down allows a frame- allocations approximate those of optimal schedulers.
work to grab opportunistically the available resources, as
it can later release them with little negative impact. Framework complexity: Using resources offers may
make framework scheduling more complex. We argue,
Do not accept unknown resources: Frameworks are however, that this difficulty is not in fact onerous. First,
incentivized not to accept resources that they cannot use whether using Mesos or a centralized scheduler, frame-
because most allocation policies will account for all the works need to know their preferences; in a centralized
resources that a framework owns when deciding which scheduler, the framework would need to express them to
framework to offer resources to next. the scheduler, whereas in Mesos, it needs to use them to
We note that these incentives are all well aligned with decide which offers to accept. Second, many scheduling
our goal of improving utilization. When frameworks use policies for existing frameworks are online algorithms,
short tasks, Mesos can reallocate resources quickly be- because frameworks cannot predict task times and must
tween them, reducing the need for wasted work due to be able to handle node failures and slow nodes. These
revocation. If frameworks have no minimum allocation policies are easily implemented using the resource offer
and can scale up and down, they will opportunistically mechanism.
utilize all the resources they can obtain. Finally, if frame-
works do not accept resources that they do not under- 5 Implementation
stand, they will leave them for frameworks that do. We have implemented Mesos in about 10,000 lines of
4.5 Limitations of Distributed Scheduling C++. The system runs on Linux, Solaris and Mac OS X.
Mesos applications can be programmed in C, C++,
Although we have shown that distributed scheduling Java, Ruby and Python. We use SWIG [15] to generate
works well in a range of workloads relevant to current interface bindings for the latter three languages.
cluster environments, like any decentralized approach, it To reduce the complexity of our implementation, we
can perform worse than a centralized scheduler. We have use a C++ library called libprocess [8] that provides
identified three limitations of the distributed model: an actor-based programming model using efficient asyn-
Fragmentation: When tasks have heterogeneous re- chronous I/O mechanisms (epoll, kqueue, etc). We
source demands, a distributed collection of frameworks also leverage Apache ZooKeeper [4] to perform leader
may not be able to optimize bin packing as well as a cen- election, as described in Section 3.6. Finally, our current
tralized scheduler. frameworks use HDFS [2] to share data.
There is another possible bad outcome if allocation Our implementation can use Linux containers [10] or
modules reallocate resources in a naive manner: when Solaris projects [16] to isolate applications. We currently
a cluster is filled by tasks with small resource require- isolate CPU cores and memory.3
ments, a framework f with large resource requirements We have implemented five frameworks on top of
may starve, because whenever a small task finishes, f Mesos. First, we have ported three existing cluster sys-
cannot accept the resources freed up by it, but other tems to Mesos: Hadoop [2], the Torque resource sched-
frameworks can. To accommodate frameworks with 3 Support for network and IO isolation was recently added to the
large per-task resource requirements, allocation modules Linux kernel [9] and we plan to extend our implementation to isolate
can support a minimum offer size on each slave, and ab- these resources too.
9
uler [42], and the MPICH2 implementation of MPI [22]. 5.2 Torque and MPI Ports
None of these ports required changing these frameworks’ We have ported the Torque cluster resource manager to
APIs, so all of them can run unmodified user programs. run as a framework on Mesos. The framework consists
In addition, we built a specialized framework called of a Mesos scheduler “wrapper” and a Mesos executor
Spark, that we discuss in Section 5.3. Finally, to show the “wrapper” written in 360 lines of Python that invoke dif-
applicability of running front-end frameworks on Mesos, ferent components of Torque as appropriate. In addition,
we have developed a framework that manages an elastic we had to modify 3 lines of Torque source code in or-
web server farm which we discuss in Section 5.4. der to allow it to elastically scale up and down on Mesos
depending on the jobs in its queue.
5.1 Hadoop Port The scheduler wrapper first configures and launches a
Torque server and then periodically monitors the server’s
Porting Hadoop to run on Mesos required relatively few job queue. While the queue is empty, the scheduler wrap-
modifications, because Hadoop concepts such as map per refuses all resource offers it receives. Once a job gets
and reduce tasks correspond cleanly to Mesos abstrac- added to Torque’s queue (using the standard qsub com-
tions. In addition, the Hadoop “master”, known as the mand), the scheduler wrapper informs the Mesos master
JobTracker, and Hadoop “slaves”, known as TaskTrack- it can receive new offers. As long as there are jobs in
ers, naturally fit into the Mesos model as a framework Torque’s queue, the scheduler wrapper accepts as many
scheduler and executor. offers to satisfy the constraints of its jobs. When the
executor wrapper is launched it starts a Torque back-
We first modified the JobTracker to schedule MapRe- end daemon that registers with the Torque server. When
duce tasks using resource offers. Normally, the Job- enough Torque backend daemons have registered, the
Tracker schedules tasks in response to heartbeat mes- torque server will launch the first job in the queue. In the
sages sent by TaskTrackers every few seconds, reporting event Torque is running an MPI job the executor wrapper
the number of free slots in which “map” and “reduce” will also launch the necessary MPI daemon processes.
tasks can be run. The JobTracker then assigns (prefer- Because jobs that run on Torque are typically not re-
ably data local) map and reduce tasks to TaskTrackers. silient to failures, Torque never accepts resources beyond
To port Hadoop to Mesos, we reuse the heartbeat mecha- its guaranteed allocation to avoid having its tasks re-
nism as described, but dynamically change the number of voked. The scheduler wrapper will accept up to its guar-
possible slots on the TaskTrackers. When the JobTracker anteed allocation whenever it can take advantage of those
receives a resource offer, it decides if it wants to run any resources, such as running multiple jobs simultaneously.
map or reduce tasks on the slaves included in the offer, In addition to the Torque framework, we also created a
using delay scheduling [48]. If it does, it creates a new Mesos “wrapper” framework, written in about 200 lines
Mesos task for the slave that signals the TaskTracker (i.e. of Python code, for running MPI jobs directly on Mesos.
the Mesos executor) to increase its number of total slots.
The next time the TaskTracker sends a heartbeat, the Job- 5.3 Spark Framework
Tracker can assign the runnable map or reduce task to To show the value of simple but specialized frameworks,
the new empty slot. When a TaskTracker finishes run- we built Spark, a new framework for iterative jobs that
ning a map or reduce task, the TaskTracker decrements was motivated from discussions with machine learning
it’s slot count (so the JobTracker doesn’t keep sending researchers at our institution.
it tasks) and reports the Mesos task as finished to allow One iterative algorithm used frequently in machine
other frameworks to use those resources. learning is logistic regression [11]. An implementation
of logistic regression in Hadoop must run each iteration
We also needed to change how map output data is
as a separate MapReduce job, because each iteration de-
served to reduce tasks. Hadoop normally writes map
pends on values computed in the previous round. In this
output files to the local filesystem, then serves these to
case, every iteration must re-read the input file from disk
reduce tasks using an HTTP server included in the Task-
into memory. In Dryad, the whole job can be expressed
Tracker. However, the TaskTracker within Mesos runs as
as a data flow DAG as shown in Figure 4a, but the data
an executor, which may be terminated if it is not running
must still must be reloaded from disk into memory at
tasks, which would make map output files unavailable
each iteration. Reusing the data in memory between iter-
to reduce tasks. We solved this problem by providing
ations in Dryad would require cyclic data flow.
a shared file server on each node in the cluster to serve
Spark’s execution is shown in Figure 4b. Spark uses
local files. Such a service is useful beyond Hadoop, to
the long-lived nature of Mesos executors to cache a slice
other frameworks that write data locally on each node.
of the data set in memory at each executor, and then run
In total, we added 1,100 lines of new code to Hadoop. multiple iterations on this cached data. This caching is
10
1
w f(x,w) 0.9
0.4
x 0.3
f(x,w)
0.2
x
...
0.1
0
1 101 201 301 401 501 601 701 801 901 1001
Time (seconds)
a) Dryad b) Spark Torque Hadoop Instance 1 Hadoop Instance 2 Hadoop Instance 3
Figure 4: Data flow of a logistic regression job in Dryad Figure 5: Cluster utilization over time on a 50-node cluster with
vs. Spark. Solid lines show data flow within the framework. three Hadoop frameworks each running back-to-back MapRe-
Dashed lines show reads from a distributed file system. Spark duce jobs, plus one Torque framework running a mix of 8 node
reuses processes across iterations, only loading data once. and 24 node HPL MPI benchmark jobs.
Application Duration Alone Duration on Mesos
achieved in a fault-tolerant manner: if a node is lost, MPI LINPACK 50.9s 51.8s
Spark remembers how to recompute its slice of the data. Hadoop WordCount 159.9s 166.2s
Spark leverages Scala to provide a language-integrated Table 3: Overhead of MPI and Hadoop benchmarks on Mesos.
syntax similar to DryadLINQ [47]: users invoke paral-
lel operations by applying a function on a special “dis- back-to-back WordCount MapReduce jobs, and within
tributed dataset” object, and the body of the function is the Torque framework we submitted jobs running the
captured as a closure to run as a set of tasks in Mesos. High-Performance LINPACK [19] benchmark (HPL).
Spark then schedules these tasks to run on executors that This experiment was performed using 50 EC2 in-
already have the appropriate data cached, using delay stances, each with 4 CPU cores and 15 GB RAM. Figure
scheduling. By building on-top-of Mesos, Spark’s im- 5 shows node utilization over time. The guaranteed al-
plementation only required 1300 lines of code. location for the Torque framework was 48 cores (1/4 of
Due to lack of space, we have limited our discussion of the cluster). Eight MPI jobs were launched, each using
Spark in this paper and refer the reader to [49] for details. 8 nodes, beginning around time 80s. The Torque frame-
5.4 Elastic Web Server Farm work launches tasks that use only one core. Six of the
HPL jobs were able to run as soon as they were sub-
We built an elastic web farm framework that takes ad-
mitted, bringing the Torque framework up to its guaran-
vantage of Mesos to scale up and down based on ex-
teed allocation of 48 cores. Thereafter, two of them were
ternal load. Similar to the Torque framework, the web
placed in the job queue and were launched as soon as
farm framework uses a scheduler “wrapper” and ex-
the first and second HPL jobs finished (around time 315s
ecutor “wrapper”. The scheduler wrapper launches an
and 345s). The Hadoop instances tasks fill in any avail-
haproxy [6] load balancer and periodically monitors its
able remaining slots, keeping cluster utilization at 100%.
web request statistics to decide when to launch or tear-
down servers. Its only scheduling constraint is that it will 6.2 Overhead
launch at most one Apache instance per machine, and To measure the overhead Mesos imposes on existing
then set a filter to stop receiving further offers for that cluster computing frameworks, we ran two benchmarks
machine. The wrappers are implemented in 250 lines of
Python.
100% 600
6 Evaluation
Job Running TIme (s)
Local Map Tasks (%)
80% 480
We evaluated Mesos by performing a series of experi-
60% 360
ments using Amazon’s EC2 environment.
40% 240
6.1 Macrobenchmark
20% 120
To evaluate the primary goal of Mesos, which is enabling 0% 0
multiple diverse frameworks to efficiently share the same Static Mesos, no Mesos, 1s Mesos, 5s
cluster, we ran a macrobenchmark consisting of a mix partitioning delay sched. delay sched. delay sched.
of fine and course-grained frameworks: three Hadoop Data Locality Job Running Times
frameworks and a single Torque framework. We submit-
Figure 6: Data locality and average job running times for 16
ted a handful of different sized jobs to each of these four
Hadoop instances on a 93-node cluster using static partitioning,
frameworks. Within the Hadoop frameworks, we ran
Mesos, or Mesos with delay scheduling.
11
using the MPI (not with Torque) and Hadoop. These ex- 4000
12
6.6 Mesos Scalability 100
Time (s)
within the cluster that runs a single 10 second task. Note
6
that each slave reports 8 CPU cores and 6 GB RAM, and
each framework runs tasks that consume 4 CPU cores 4
and 3 GB RAM. Figure 9 shows the average of 5 run-
2
times for launching the single framework after reaching
steady-state. 0
Our initial Mesos implementation (labeled “reactive”
0
00
00
00
00
00
00
00
00
20
60
10
14
18
22
26
30
34
38
in Figure 9) failed to scale beyond 15,000 slaves because
Number of Slaves
it attempted to allocate resources immediately after every
task finished. While the allocator code is fairly simple, Figure 10: Mean time for the slaves and applications to recon-
it dominated the master’s execution time, causing laten- nect to a secondary Mesos master upon master failure. The plot
shows 95% confidence intervals.
cies to increase. To amortize allocation costs, we modi-
fied Mesos to perform allocations in batch intervals. To
demonstrate how well this implementation scaled we ran erage MTTR with a 95% confidence interval for differ-
the same experiment as before, however we also tried ent cluster sizes. Although not shown, we experimented
variations with average task lengths of 10 and 30 sec- with different sizes of ZooKeeper quorums (between 1
onds (these are labeled as “10 s” and “30 s” in Figure 9). and 11). The results consistently showed that MTTR is
As the graph shows, with 30 second average task length always below 15 seconds, and typically 7-8 seconds.
Mesos imposes less than 1 second of additional overhead
7 Related Work
on frameworks. Unfortunately, the EC2 virtualized envi-
ronment limited scalability beyond 50,000 slaves, as at HPC and Grid schedulers. The high performance
50,000 slaves the master was processing 100,000 pack- computing (HPC) community has long been managing
ets per second (in+out), which has been shown to be the clusters [42, 51, 28, 20]. Their target environment typ-
current achievable limits in EC2[14]. ically consists of specialized hardware, such as Infini-
band, SANs, and parallel filesystems. Thus jobs do not
6.7 Fault Tolerance
need to be scheduled local to their data. Furthermore,
To evaluate recovery from master failures, we conducted each job is tightly coupled, often using barriers or mes-
an experiment identical to the scalability experiment, ex- sage passing. Thus, each job is monolithic, rather than
cept that we used 20 second task lengths, and two Mesos composed of smaller fine-grained tasks. Consequently,
masters connected to a 5 node ZooKeeper quorum us- a job does not dynamically grow or shrink its resource
ing a default tick timer set to 2 seconds. We ran the demands across machines during its lifetime. Moreover,
experiment using 62 EC2 instances, each with 4 CPU fault-tolerance is achieved through checkpointing, rather
cores and 15 GB RAM. We synchronized the two mas- than recomputing fine-grained tasks. For these reasons,
ters’ clocks using NTP and measured the mean time to HPC schedulers use centralized scheduling, and require
recovery (MTTR) after killing the active master. The jobs to declare the required resources at job submission
MTTR is the time for all of the slaves and frameworks time. Jobs are then allocated course-grained allocations
to connect to the second master. Figure 10 shows the av- of the cluster. Unlike the Mesos approach, this does
13
not allow frameworks to locally access data distributed ification language always has the problem that certain
over the cluster. Furthermore, jobs cannot grow and preferences might currently not be possible to express.
shrink dynamically as their allocations change. In ad- For example, delay scheduling is hard to express with
dition to supporting fine-grained sharing, Mesos can run the current languages. In contrast, the two level schedul-
HPC schedulers, such as Torque, as a frameworks, which ing and resource offer architecture in Mesos gives jobs
then can schedule HPC workloads appropriately. control in deciding where and when to run tasks.
Grid computing has mostly focused on the problem
of making diverse virtual organizations share geograph- 8 Conclusion
ically distributed and separately administered resources We have presented Mesos, a resource management sub-
in a secure and inter-operable way. Mesos could well strate that allows diverse parallel applications to effi-
be used within a virtual organization, which is part of a ciently share a cluster. Mesos is built around two con-
larger grid that, for example, runs Globus Toolkit. tributions: a fine-grained sharing model where applica-
tions divide work into smaller tasks, and a distributed
Public and Private Clouds. Virtual machine clouds,
scheduling mechanism called resource offers that lets ap-
such as Amazon EC2 [1] and Eucalyptus [40] share com-
plications choose which resources to run on. Together,
mon goals with Mesos, such as isolating frameworks
these contributions let Mesos achieve high utilization, re-
while providing a low-level abstraction (VMs). How-
spond rapidly to workload changes, and cater to applica-
ever, they differ from Mesos in several important ways.
tions with diverse placement preferences, while remain-
First, these systems share resources in a coarse-grained
ing simple and scalable. We have shown that existing
manner, where a user may hold onto a virtual machine
cluster applications can effectively share resources with
for an arbitrary amount of time. This makes fine-grained
Mesos, that new specialized cluster computing frame-
sharing of data difficult Second, these systems generally
works, such as Spark, can provide major performance
do not let applications specify placement needs beyond
gains, and that Mesos’s simple architecture enables the
the size of virtual machine they require. In contrast,
system to be fault tolerant and to scale to 50,000 nodes.
Mesos allows frameworks to be highly selective about
which resources they acquire through resource offers. Mesos is inspired by work on microkernels [18], ex-
okernels [27] and hypervisors [32] in the OS community
Quincy. Quincy [34] is a fair scheduler for Dryad. It and by the success of the narrow-waist IP model [23] in
uses a centralized scheduling algorithm based on min- computer networks. Like a microkernel or hypervisor,
cost flow optimization and takes into account both fair- Mesos is a stable, minimal core that isolates applications
ness and locality constraints. Resources are reassigned sharing a cluster. Like an exokernel, Mesos aims to give
by killing tasks (similarly to Mesos resource killing) applications maximal control over their execution. Fi-
when the output of the min-cost flow algorithm changes. nally, like IP, Mesos encourages diversity and innovation
Quincy assumes that tasks have identical resource re- in cluster computing by providing a “narrow waist” API
quirements, and the authors note that Quincy’s min-cost that lets diverse applications coexist efficiently.
flow formulation of the scheduling problem is difficult In future work, we intend to focus on three areas. First,
to extend to tasks with multi-dimensional resource de- we plan to further analyze the resource offer model to
mands. In contrast, Mesos aims to support multiple clus- characterize environments it works well in and determine
ter computing frameworks, including frameworks with whether any extensions can improve its efficiency while
scheduling preferences other than data locality. Mesos retaining its flexibility. Second, we plan to use Mesos as
uses a decentralized two-level scheduling approach lets a springboard to experiment with cluster programming
frameworks decide where they run, and supports hetero- models. Lastly, we intend to build a stack of higher-level
geneous task resource requirements. We have shown that programming abstractions on Mesos to allow developers
frameworks can still achieve near perfect data locality us- to quickly write scalable, fault-tolerant applications.
ing delay scheduling.
References
Specification Language Approach. Some systems,
such as Condor and Clustera [43, 26], go to great lengths [1] Amazon EC2. http://aws.amazon.com/ec2/.
to match users and jobs to available resources. Clustera [2] Apache Hadoop. http://hadoop.apache.org.
[3] Apache Hive. http://hadoop.apache.org/
provides multi-user support, and uses a heuristic incor-
hive/.
porating user priorities, data locality and starvation to
[4] Apache ZooKeeper. http://hadoop.apache.
match work to idle nodes. However, Clustera requires org/zookeeper/.
each job to explicitly list its data locality needs, and does [5] Hadoop and hive at facebook. http:
not support placement constraints other than data local- //www.slideshare.net/dzhou/
ity. Condor uses the ClassAds [41] to match node proper- facebook-hadoop-data-applications.
ties to job needs. This approach of using a resource spec- [6] HAProxy Homepage. http://haproxy.1wt.eu/.
14
[7] Hive – A Petabyte Scale Data Warehouse using J. Royalty, S. Shankar, and A. Krioukov. Clustera: an
Hadoop. http://www.facebook.com/note. integrated computation and data management system.
php?note_id=89508453919#/note.php? PVLDB, 1(1):28–41, 2008.
note_id=89508453919. [27] D. R. Engler, M. F. Kaashoek, and J. O’Toole. Exokernel:
[8] LibProcess Homepage. http://www.eecs. An operating system architecture for application-level re-
berkeley.edu/˜benh/libprocess. source management. In SOSP, pages 251–266, 1995.
[9] Linux 2.6.33 release notes. http:// [28] W. Gentzsch. Sun grid engine: towards creating a com-
kernelnewbies.org/Linux_2_6_33. pute power grid. pages 35–36, 2001.
[10] Linux containers (LXC) overview document. http:// [29] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski,
lxc.sourceforge.net/lxc.html. S. Shenker, and I. Stoica. Dominant resource fairness:
[11] Logistic Regression Wikipedia Page. http: Fair allocation of heterogeneous resources in datacen-
//en.wikipedia.org/wiki/Logistic_ ters. Technical Report UCB/EECS-2010-55, EECS De-
regression. partment, University of California, Berkeley, May 2010.
[12] Personal communication with Dhruba Borthakur from the [30] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel.
Facebook data infrastructure team. The cost of a cloud: research problems in data center net-
[13] Personal communication with Khaled Elmeleegy from works. SIGCOMM Comput. Commun. Rev., 39(1):68–73,
Yahoo! Research. 2009.
[14] RightScale webpage. http://blog. [31] B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, , and
rightscale.com/2010/04/01/ L. Zhou. Comet: Batched stream processing in data in-
benchmarking-load-balancers-in-the-cloud. tensive distributed computing. Technical Report MSR-
[15] Simplified wrapper interface generator. http://www. TR-2009-180, Microsoft Research, December 2009.
swig.org. [32] E. C. Hendricks and T. C. Hartmann. Evolution of
[16] Solaris Resource Management. http://docs.sun. a virtual machine subsystem. IBM Systems Journal,
com/app/docs/doc/817-1592. 18(1):111–142, 1979.
[17] Twister. http://www.iterativemapreduce. [33] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.
org. Dryad: distributed data-parallel programs from sequential
[18] M. J. Accetta, R. V. Baron, W. J. Bolosky, D. B. Golub, building blocks. In EuroSys 07, pages 59–72, 2007.
R. F. Rashid, A. Tevanian, and M. Young. Mach: A new [34] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar,
kernel foundation for unix development. In USENIX Sum- and A. Goldberg. Quincy: Fair scheduling for distributed
mer, pages 93–113, 1986. computing clusters. In SOSP 2009, November 2009.
[19] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, [35] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability
A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, of intermediate data in cloud computations. In HOTOS
C. Bischof, and D. Sorensen. Lapack: a portable linear al- ’09: Proceedings of the Twelfth Workshop on Hot Topics
gebra library for high-performance computers. In Super- in Operating Systems. USENIX, May 2009.
computing ’90: Proceedings of the 1990 conference on [36] D. Logothetis, C. Olston, B. Reed, K. Webb, and
Supercomputing, pages 2–11, Los Alamitos, CA, USA, K. Yocum. Programming bulk-incremental dataflows.
1990. IEEE Computer Society Press. Technical report, University of California San Diego,
[20] D. Angulo, M. Bone, C. Grubbs, and G. von Laszewski. http://cseweb.ucsd.edu/˜dlogothetis/
Workflow management through cobalt. In International docs/bips_techreport.pdf, 2009.
Workshop on Grid Computing Environments–Open Ac- [37] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
cess (2006), Tallahassee, FL, November 2006. I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system
[21] D. Bertsekas and R. Gallager. Data Networks. Prentice for large-scale graph processing. In PODC ’09, pages 6–
Hall, second edition, 1992. 6, New York, NY, USA, 2009. ACM.
[22] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, [38] D. Mosberger and T. Jin. httperf – a tool for measuring
P. Lemarinier, and F. Magniette. Mpich-v2: a fault web server performance. SIGMETRICS Perform. Eval.
tolerant mpi for volatile nodes based on pessimistic Rev., 26(3):31–37, 1998.
sender based message logging. In SC ’03: Proceedings [39] H. Moulin. Fair Division and Collective Welfare. The
of the 2003 ACM/IEEE conference on Supercomputing, MIT Press, 2004.
page 25, Washington, DC, USA, 2003. IEEE Computer [40] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-
Society. man, L. Youseff, and D. Zagorodnov. The eucalyp-
[23] D. D. Clark. The design philosophy of the darpa internet tus open-source cloud-computing system. In Proceed-
protocols. In SIGCOMM, pages 106–114, 1988. ings of Cloud Computing and Its Applications [Online],
[24] T. Condie, N. Conway, P. Alvaro, and J. M. Hellerstein. Chicago, Illinois, 10 2008.
Mapreduce online. In 7th Symposium on Networked Sys- [41] R. Raman, M. Solomon, M. Livny, and A. Roy. The clas-
tems Design and Implementation. USENIX, May 2010. sads language. pages 255–270, 2004.
[25] J. Dean and S. Ghemawat. MapReduce: Simplified data [42] G. Staples. Torque - torque resource manager. In SC,
processing on large clusters. In OSDI, pages 137–150, page 8, 2006.
2004. [43] D. Thain, T. Tannenbaum, and M. Livny. Distributed
[26] D. J. DeWitt, E. Paulson, E. Robinson, J. F. Naughton, computing in practice: the Condor experience. Concur-
15
rency and Computation - Practice and Experience, 17(2-
4):323–356, 2005.
[44] H. Varian. Equity, envy, and efficiency. Journal of Eco-
nomic Theory, 9(1):63–91, 1974.
[45] C. A. Waldspurger and W. E. Weihl. Lottery schedul-
ing: flexible proportional-share resource management. In
OSDI ’94: Proceedings of the 1st USENIX conference on
Operating Systems Design and Implementation, pages 1+,
Berkeley, CA, USA, 1994. USENIX Association.
[46] Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation
for data-parallel computing: interfaces and implementa-
tions. In SOSP ’09: Proceedings of the ACM SIGOPS
22nd symposium on Operating systems principles, pages
247–260, New York, NY, USA, 2009. ACM.
[47] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson,
P. K. Gunda, and J. Currey. DryadLINQ: A system for
general-purpose distributed data-parallel computing us-
ing a high-level language. In OSDI, pages 1–14, 2008.
[48] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,
S. Shenker, and I. Stoica. Delay scheduling: A simple
technique for achieving locality and fairness in cluster
scheduling. In EuroSys 2010, April 2010.
[49] M. Zaharia, N. M. M. Chowdhury, M. Franklin,
S. Shenker, and I. Stoica. Spark: Cluster computing
with working sets. Technical Report UCB/EECS-2010-
53, EECS Department, University of California, Berke-
ley, May 2010.
[50] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and
I. Stoica. Improving mapreduce performance in hetero-
geneous environments. In Proc. OSDI 2008, Dec. 2008.
[51] S. Zhou. Lsf: Load sharing in large-scale heterogeneous
distributed systems. In Workshop on Cluster Computing,
Tallahassee, FL, December 1992.
16