You are on page 1of 18

Mesos: A Platform for Fine-Grained Resource Sharing

in the Data Center

Benjamin Hindman
Andrew Konwinski
Matei Zaharia
Ali Ghodsi
Anthony D. Joseph
Randy H. Katz
Scott Shenker
Ion Stoica

Electrical Engineering and Computer Sciences


University of California at Berkeley

Technical Report No. UCB/EECS-2010-87


http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-87.html

May 26, 2010


Copyright © 2010, by the author(s).
All rights reserved.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Benjamin Hindman, Andy Konwinski, Matei Zaharia,


Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
University of California, Berkeley

Abstract it lets applications share datasets that may be too expen-


We present Mesos, a platform for sharing commod- sive to replicate. In this paper, we explore the problem
ity clusters between multiple diverse cluster computing of efficiently sharing commodity clusters among diverse
frameworks, such as Hadoop and MPI 1 . Sharing im- cluster computing frameworks.
proves cluster utilization and avoids per-framework data An important feature of commodity clusters is that
replication. Mesos shares resources in a fine-grained data is distributed throughout the cluster and stored on
manner, allowing frameworks to achieve data locality by the same nodes that run computations. In these en-
taking turns reading data stored on each machine. To vironments, reading data remotely is expensive, so it
support the sophisticated schedulers of today’s frame- is important to schedule computations near their data.
works, Mesos introduces a distributed two-level schedul- Consequently, sharing the cluster requires a fine-grained
ing mechanism called resource offers. Mesos decides scheduling model, where applications take turns running
how many resources to offer each framework, while computations on each node. Existing cluster computing
frameworks decide which resources to accept and which frameworks, such as Hadoop and Dryad, already imple-
computations to run on them. Our experimental results ment fine-grained sharing across their jobs by multiplex-
show that Mesos can achieve near-optimal locality when ing at the level of tasks within a job [34, 48]. However,
sharing the cluster among diverse frameworks, can scale because these frameworks are developed independently,
up to 50,000 nodes, and is resilient to node failures. there is no way to perform fine-grained sharing across
different frameworks, making it difficult to share clusters
1 Introduction and data efficiently between frameworks.
Clusters of commodity servers have become a major In this paper, we propose Mesos, a thin resource
computing platform, powering both large Internet ser- management substrate that enables efficient fine-grained
vices and a growing number of data-intensive scientific sharing across diverse cluster computing frameworks.
applications. Driven by these applications, researchers Mesos gives diverse frameworks a common interface for
and practitioners have been developing a diverse array of running fine-grained tasks in a cluster.
cluster computing frameworks to simplify programming The main challenge Mesos must address is that differ-
the cluster. Prominent examples include MapReduce ent frameworks have different scheduling needs. Each
[25], Dryad [33], MapReduce Online [24] (which sup- framework has scheduling preferences specific to its pro-
ports streaming jobs), Pregel [37] (a specialized frame- gramming model, based on the dependencies between its
work for graph computations), and others [36, 17, 31]. tasks, the location of its data, its communication pat-
It seems clear that new cluster computing frameworks tern, and domain-specific optimizations. Furthermore,
will continue to emerge, and that no framework will be frameworks’ scheduling policies are rapidly evolving
optimal for all applications. Therefore, organizations [46, 48, 50, 35]. Therefore, designing a generic sched-
will want to run multiple frameworks in the same cluster, uler to support all current and future frameworks is hard,
picking the best one for each application. Sharing a clus- if not infeasible. Instead, Mesos takes a different ap-
ter between frameworks is desirable for two reasons: it proach: giving frameworks control over their scheduling.
improves utilization through statistical multiplexing, and Mesos lets frameworks choose which resources to use
1 We will continue to improve this Technical Report. The
through a distributed scheduling mechanism called re-
latest version is available at http://mesos.berkeley.edu/ source offers. Mesos decides how many resources to
tr-mesos.pdf. offer each framework, based on an organizational pol-

1
icy such as fair sharing, but frameworks decide which 1
0.9
resources to accept and which computations to run on 0.8
them. We have found that resource offers are flexible 0.7
enough to let frameworks achieve a variety of placement 0.6

CDF
0.5
goals, including data locality. In addition, resource offers 0.4
are simple and efficient to implement, allowing Mesos to 0.3 MapReduce Jobs
be highly scalable and robust to failures. 0.2 Map & Reduce Tasks
0.1
Mesos’s flexible fine-grained sharing model also has 0
other advantages. First, Mesos can support a wide variety 1 10 100 1000 10000 100000
of applications beyond data-intensive computing frame- Duration (s)
works, such as client-facing services (e.g., web servers) Figure 1: CDF of job and task durations in Facebook’s Hadoop
and compute-intensive parallel applications (e.g., MPI), data warehouse (data from [48]).
allowing organizations to consolidate workloads and in-
crease utilization. Although not all of these applications Facebook loads logs from its web services into a 1200-
run tasks that are fine-grained in time, Mesos can share node Hadoop cluster, where they are used for applica-
nodes in space between short and long tasks. tions such as business intelligence, spam detection, and
Second, even organizations that only use one frame- ad optimization. In addition to “production” jobs that run
work can leverage Mesos to run multiple isolated in- periodically, the cluster is used for many experimental
stances of that framework in the same cluster, or multiple jobs, ranging from multi-hour machine learning compu-
versions of the framework. For example, an organization tations to 1-2 minute ad-hoc queries submitted interac-
using Hadoop might want to run separate instances of tively through an SQL interface to Hadoop called Hive
Hadoop for production and experimental jobs, or to test [3]. Most jobs are short (the median being 84s long), and
a new Hadoop version alongside its existing one. the jobs are composed of fine-grained map and reduce
Third, by providing a platform for resource sharing tasks (the median task being 23s), as shown in Figure 1.
across frameworks, Mesos allows framework develop- To meet the performance requirements of these jobs,
ers to build specialized frameworks targeted at particu- Facebook uses a fair scheduler for Hadoop that takes
lar problem domains rather than one-size-fits-all abstrac- advantage of the fine-grained nature of the workload to
tions. Frameworks can therefore evolve faster and pro- make scheduling decisions at the level of map and re-
vide better support for each problem domain. duce tasks and to optimize data locality [48]. Unfortu-
We have implemented Mesos in 10,000 lines of C++. nately, this means that the cluster can only run Hadoop
The system scales to 50,000 nodes and uses Apache jobs. If a user wishes to write a new ad targeting al-
ZooKeeper [4] for fault tolerance. To evaluate Mesos, gorithm in MPI instead of MapReduce, perhaps because
we have ported three cluster computing systems to run MPI is more efficient for this job’s communication pat-
over it: Hadoop, MPI, and the Torque batch scheduler. tern, then the user must set up a separate MPI cluster and
To validate our hypothesis that specialized frameworks import terabytes of data into it.2 Mesos aims to enable
provide value over general ones, we have also built a new fine-grained sharing between multiple cluster computing
framework on top of Mesos called Spark, optimized for frameworks, while giving these frameworks enough con-
iterative jobs where a dataset is reused in many parallel trol to achieve placement goals such as data locality.
operations. Spark can outperform Hadoop by 10x in it-
In addition to sharing clusters between “back-end” ap-
erative machine learning workloads. Finally, to evaluate
plications such as Hadoop and MPI, Mesos also enables
the applicability of Mesos to client-facing workloads, we
an organizations to share resources between “front-end”
have built an elastic Apache web server farm framework.
workloads, such as web servers, and back-end work-
This paper is organized as follows. Section 2 details loads. This is attractive because front-end applications
the data center environment that Mesos is designed for. have variable load patterns (e.g. diurnal cycles and
Section 3 presents the architecture of Mesos. Section 4 spikes), so there is an opportunity to scale them down
analyzes our distributed scheduling model and character- at during periods of low load and use free resources
izes the environments it works well in. We present our to speed up back-end workloads. We recognize that
implementation of Mesos in Section 5 and evaluate it in there are obstacles beyond node scheduling to collo-
Section 6. Section 7 surveys related work. We conclude cating front-end and back-end workloads at very large
with a discussion in Section 8. scales, such as the difficulty of isolating network traffic
2 Target Environment [30]. In Mesos, we focus on defining a resource access

As an example of a workload we aim to support, con- 2 This


problem is not hypothetical; our contacts at Yahoo! and Face-
sider the Hadoop data warehouse at Facebook [5, 7]. book report that users want to run MPI and MapReduce Online [13, 12].

2
Framework 1 Framework 2
Hadoop MPI ZooKeeper Job 1 Job 2 Job 1 Job 2
scheduler scheduler quorum FW Scheduler FW Scheduler

<task1, s1, 2cpu, 1gb, … >


<s1, 4cpu, 4gb, … > 2 3 <task2, s1, 1cpu, 2gb, … >

Mesos Standby Standby Allocation Mesos


master master master module master

<s1, 4cpu, 4gb, … > 1 <fw1, task1, 2cpu, 1gb, … >


4 <fw1, task2, 1cpu, 2gb, … >
Mesos slave Mesos slave Mesos slave
Slave 1 Slave 2
Hadoop MPI Hadoop MPI Executor Executor
executor executor executor executor
Task Task Task Task
task
task task
task task task

Figure 3: Resource offer example.


Figure 2: Mesos architecture diagram, showing two running
frameworks (Hadoop and MPI).
is launched on slave nodes to run the framework’s tasks.
The main role of the Mesos master is to offer available
interface that allows these applications to coexist, and we resources on slaves to framework schedulers through re-
hope to leverage isolation solutions developed by others. source offers. Each resource offer contains a list of free
3 Architecture resources on multiple slaves. Multiple offers describing
disjoint resource sets can be outstanding at each time.
We begin our description of Mesos by presenting our de- A pluggable allocation module in the master determines
sign philosophy. We then describe the components of how many resources to offer to each framework. Frame-
Mesos, our resource allocation mechanisms, and how works’ schedulers select which of the offered resources
Mesos achieves isolation, scalability, and fault tolerance. to use, and describe tasks to launch on those resources.
3.1 Design Philosophy Figure 3 shows an example of how a framework gets
Recall that Mesos aims to provide a stable and scalable scheduled to run a task. In step (1), slave 1 reports to the
core that diverse frameworks can run over to share cluster master that it has 4 CPUs and 4 GB of memory free. The
resources. Because cluster frameworks are both highly master then invokes the allocation policy module, which
diverse and rapidly evolving, our overriding design phi- tells it that framework 1 should be offered all available
losophy has been to define a minimal interface that en- resources. In step (2) the master sends a resource of-
ables efficient resource sharing, and otherwise push con- fer describing what is available on slave 1 to framework
trol to the frameworks. Pushing control to the frame- 1. In step (3), the framework’s scheduler replies to the
works has two benefits. First, it allows frameworks to master with information about two tasks to run on the
implement diverse approaches to various problems in the slave, using h2 CPUs, 1 GB RAMi for the first task, and
cluster (e.g., dealing with faults), and to evolve these so- h1 CPUs, 2 GB RAMi for the second task. Finally, in
lutions independently. Second, it keeps Mesos simple step (4), the master sends the tasks to the slave, which al-
and minimizes the rate of change required of the system, locates appropriate resources to the framework’s execu-
which makes it easier to make Mesos scalable and robust. tor, which in turn launches the two tasks (depicted with
Although Mesos provides a low-level interface, we ex- dotted-line borders in the figure). Because 1 CPU and 1
pect higher-level libraries implementing common func- GB of RAM are still unallocated, the allocation module
tionality (such as fault tolerance) to be built on top of may now offer them to framework 2. In addition, this
it. These libraries would be analogous to library OSes in resource offer process repeats when tasks finish and new
the exokernel [27]. Putting this functionality in libraries resources become free.
rather than in Mesos allows Mesos to remain small and The key aspect of Mesos that lets frameworks achieve
flexible, and lets the libraries evolve independently. placement goals is the fact that they can reject resources.
In particular, we have found that a simple policy called
3.2 Overview delay scheduling [48], in which frameworks wait for a
Figure 2 shows the components of Mesos. The system limited time to acquire preferred nodes, yields nearly op-
consists a master process that manages slave daemons timal data locality. We report these results in Section 6.3.
running on each cluster node. We use ZooKeeper [4] In this section, we describe how Mesos performs two
to make the master fault tolerant, as we shall describe major roles: allocation (performed by allocation modules
in Section 3.6. Frameworks running on Mesos consist of in the master) and isolation (performed by the slaves).
two components: a scheduler that registers with the mas- We include a discussion of one allocation policy we have
ter to be offered resources, and an executor process that developed for fairly sharing multiple resources. We then

3
describe elements of our architecture that let resource of- share of RAM, while fully utilizing one resource (RAM).
fers work robustly and efficiently in a distributed system. DRF is a natural generalization of max/min fair-
ness [21]. DRF satisfies the above mentioned properties
3.3 Resource Allocation
and performs scheduling in O(log n) time for n frame-
Mesos delegates allocation decisions to a pluggable allo- works. We believe that DRF is novel, but due to space
cation module, so that organizations can tailor allocation constraints, we refer the reader to [29] for details.
to their needs. In normal operation, Mesos takes advan-
3.3.2 Supporting Long Tasks
tage of the fine-grained nature of tasks to only reallocate
resources when tasks finish. This usually happens fre- Apart from fine-grained workloads consisting of short
quently enough to let new frameworks start within a frac- tasks, Mesos also aims to support frameworks with
tion of the average task length: for example, if a frame- longer tasks, such as web services and MPI programs.
work’s share is 10% of the cluster, it needs to wait on This is accomplished by sharing nodes in space between
average 10% of the mean task length to receive its share. long and short tasks: for example, a 4-core node could be
Therefore, the allocation module only needs to decide running MPI on two cores, while also running MapRe-
which frameworks to offer free resources to, and how duce tasks that access local data. If long tasks are placed
many resources to offer to them. However, our architec- arbitrarily throughout the cluster, however, some nodes
ture also supports long tasks by allowing allocation mod- may become filled with them, preventing other frame-
ules to specifically designate a set of resources on each works from accessing local data. To address this prob-
node for use by long tasks. Finally, we give allocation lem, Mesos allows allocation modules to bound the to-
modules the power to revoke (kill) tasks if resources are tal resources on each node that can run long tasks. The
not becoming free quickly enough. amount of long task resources still available on the node
In this section, we start by describing one allocation is reported to frameworks in resource offers. When a
policy that we have developed, which performs fair shar- framework launches a task, it marks it as either long or
ing between frameworks using a definition of fairness short. Short tasks can use any resources, but long tasks
for multiple resources called Dominant Resource Fair- can only use up to the amount specified in the offer.
ness. We then explain Mesos’s mechanisms for support- Of course, a framework may launch a long task with-
ing long tasks and revocation. out marking it as such. In this case, Mesos will eventu-
ally revoke it, as we discuss next.
3.3.1 Dominant Resource Fairness (DRF)
3.3.3 Revocation
Because Mesos manages multiple resources (CPU, mem-
ory, network bandwidth, etc.) on each slave, a natural As described earlier, in an environment with fine-grained
question is what constitutes a fair allocation when differ- tasks, Mesos can reallocate resources quickly by simply
ent frameworks prioritize resources differently. We car- waiting for tasks to finish. However, if a cluster becomes
ried out a thorough study of desirable fairness properties filled by long tasks, e.g., due to a buggy job or a greedy
and possible definitions of fairness, which is detailed in framework, Mesos can also revoke (kill) tasks. Before
[29]. We concluded that most other definitions of fair- killing a task, Mesos gives its framework a grace period
ness have undesirable properties. For example, Compet- to clean it up. Mesos asks the respective executor to kill
itive Equilibrium from Equal Incomes (CEEI) [44], the the task, but kills the entire executor and all its tasks if
preferred fairness metric in micro-economics [39], has it does not respond to the request. We leave it up to the
the disadvantage that the freeing up of resources might allocation module to implement the policy for revoking
punish an existing user’s allocation. Similarly, other no- tasks, but describe two related mechanisms here.
tions of fairness violated other desirable properties, such First, while killing a task has a low impact on many
as envy-freedom, possibly leading to users gaming the frameworks (e.g., MapReduce or stateless web servers),
system by hoarding resources that they do not need. it is harmful for frameworks with interdependent tasks
To this end, we designed a fairness policy called dom- (e.g., MPI). We allow these frameworks to avoid be-
inant resource fairness (DRF), which attempts to equal- ing killed by letting allocation modules expose a guar-
ize each framework’s fractional share of its dominant re- anteed allocation to each framework – a quantity of
source, which is the resource that it has the largest frac- resources that the framework may hold without losing
tional share of. For example, if a cluster contains 100 tasks. Frameworks read their guaranteed allocations
CPUs and 100 GB of RAM, and framework F1 needs 4 through an API call. Allocation modules are responsible
CPUs and 1 GB RAM per task while F2 needs 1 CPU for ensuring that the guaranteed allocations they provide
and 8 GB RAM per task, then DRF gives F1 20 tasks can all be met concurrently. For now, we have kept the
(80 CPUs and 20 GB) and gives F2 10 tasks (10 CPUs semantics of guaranteed allocations simple: if a frame-
and 80 GB). This makes F1 ’s share of CPU equal to F2 ’s work is below its guaranteed allocation, none of its tasks

4
should be killed, and if it is above, any of its tasks may be
Scheduler Callbacks Scheduler Actions
killed. However, if this model is found to be too simple,
it is also possible to let frameworks specify priorities for resource_offer(offer_id, offers) reply_to_offer(offer_id, tasks,
offer_rescinded(offer_id) needs_more_offers)
their tasks, so that the allocation module can try to kill status_update(task_id, status) request_offers()
only low-priority tasks. slave_lost(slave_id) set_filters(filters)
Second, to decide when to trigger revocation, alloca- get_safe_share()
tion modules must know which frameworks would use kill_task(task_id)
more resources if they were offered them. Frameworks
Executor Callbacks Executor Actions
indicate their interest in offers through an API call.
launch_task(task_descriptor) send_status(task_id, status)
3.4 Isolation kill_task(task_id)

Mesos provides performance isolation between frame- Table 1: Mesos API functions for schedulers and executors.
work executors running on the same slave by leveraging The “callback” columns list functions that frameworks must
existing OS isolation mechanisms. Since these mecha- implement, while “actions” are operations that they can invoke.
nisms are platform-dependent, we support multiple iso-
lation mechanisms through pluggable isolation modules.
for a sufficiently long time, Mesos rescinds the offer and
In our current implementation, we use operating sys-
re-offers the resources to other frameworks.
tem container technologies, specifically Linux containers
[10] and Solaris projects [16], to achieve isolation. These We also note that even without the use of filters, Mesos
technologies can limit the CPU, physical memory, vir- can make tens of thousands of resource offers per second,
tual memory, network bandwidth, and (in new Linux ker- because the scheduling algorithm it must perform (fair
nels) IO bandwidth usage of a process tree. In addition, sharing) is highly efficient.
they support dynamic reconfiguration of a container’s re-
source limits, which is necessary for Mesos to be able to 3.5.1 API Summary
add and remove resources from an executor as it starts Table 1 summarizes the Mesos API. The only function
and finishes tasks. In the future, it would also be attrac- that we have not yet explained is the kill task function
tive to use virtual machines as containers. However, we that a scheduler may call to kill one of its tasks. This is
have not yet done this because current VM technologies useful for frameworks that implement backup tasks [25].
add non-negligible overhead in data-intensive workloads
and have limited support for dynamic reconfiguration. 3.6 Fault Tolerance
3.5 Making Resource Offers Scalable and Robust Since the master is a centerpiece of our architecture, we
Because task scheduling in Mesos is a distributed process have made it fault-tolerant by pushing state to slaves and
in which the master and framework schedulers communi- schedulers, making the master’s state soft state. At run-
cate, it needs to be efficient and robust to failures. Mesos time, multiple Mesos masters run simultaneously, but
includes three mechanisms to help with this goal. only one master is the leader. The others masters act
First, because some frameworks will always reject cer- as hot standbys ready to take over if the current leader
tain resources, Mesos lets them short-circuit the rejection fails. ZooKeeper [4] is used to implement leader elec-
process and avoid communication by providing filters to tion among masters. Schedulers and slaves also find out
the master. We support two types of filters: “only offer about the current leader through ZooKeeper. Upon the
nodes from list L” and “only offer nodes with at least R failure of the master, the slaves and schedulers connect
resources free”. A resource that fails a filter is treated ex- to the newly elected master and help restore its state. We
actly like a rejected resource. By default, any resources also ensure that messages sent to a failed master are re-
rejected during an offer have a temporary 5-second filter sent to the new one through a thin communication layer
placed on them, to minimize the programming burden on that uses sequence numbers and retransmissions.
developers who do not wish to manually set filters. Aside from handling master failures, Mesos reports
Second, because a framework may take time to re- task, slave and executor failures to frameworks’ sched-
spond to an offer, Mesos counts resources offered to a ulers. Frameworks can then react to failures using poli-
framework towards its share of the cluster for the purpose cies of their choice.
of allocation. This is a strong incentive for frameworks Finally, to deal with scheduler failures, Mesos can
to respond to offers quickly and to filter out resources be extended to allow a framework to register multiple
that they cannot use, so that they can get offers for more schedulers such that if one fails, another is notified by the
suitable resources faster. Mesos master and takes over. Frameworks must use their
Third, if a framework has not responded to an offer own mechanisms to share state between their schedulers.

5
4 Mesos Behavior provided a framework never uses more than its guaran-
teed allocation of required resources, explicitly account-
In this section, we study Mesos’s behavior for different
ing for these resources ensures that frameworks will not
workloads. In short, we find that Mesos performs very
get deadlocked waiting for them to become available.
well when frameworks can scale up and down elastically,
For simplicity, in the reminder of this section we as-
tasks durations are homogeneous, and frameworks pre-
sume that all tasks run on identical slices of machines,
fer all nodes equally. When different frameworks pre-
which we call slots. Also, unless otherwise specified, we
fer different nodes, we show that Mesos can emulate a
assume that each framework runs a single job.
centralized scheduler that uses fair sharing across frame-
works. In addition, we show that Mesos can handle het- Next, we use this simple model to estimate the frame-
erogeneous task durations without impacting the perfor- work ramp-up time, the job completion time, and the job
mance of frameworks with short tasks. We also discuss resource utilization. We emphasize that our goal here is
how frameworks are incentivized to improve their perfor- not to develop a detailed and exact model of the system,
mance under Mesos, and argue that these incentives also but to provide a coarse understanding of the system be-
improve overall cluster utilization. Finally, we conclude havior.
this section with some limitations of Mesos’s distributed 4.2 Homogeneous Tasks
scheduling model.
In the rest of this section we consider a task duration dis-
4.1 Definitions, Metrics and Assumptions tribution with mean Ts . For simplicity, we present results
for two distributions: constant task sizes and exponen-
In our discussion, we consider three metrics: tially distributed task sizes. We consider a cluster with n
• Framework ramp-up time: time it takes a new slots and a framework, f , that is entitled to k slots. We
framework to achieve its allocation (e.g., fair share); assume the framework runs a job which requires βkTs
• Job completion time: time it takes a job to complete, computation time. Thus, assuming the framework has k
assuming one job per framework; slots, it takes the job βTs time to finish. When comput-
ing the completion time of a job we assume that the last
• System utilization: total cluster utilization.
tasks of the job running on the framework’s k slots finish
We characterize workloads along four attributes: at the same time. Thus, we relax the assumptions about
• Scale up: Frameworks can elastically increase their the duration distribution for the tasks of framework f .
allocation to take advantage of free resources. This relaxation does not impact any of the other metrics,
• Scale down: Frameworks can relinquish resources i.e., ramp-up time and utilization.
without significantly impacting their performance. We consider two types of frameworks: elastic and
rigid. An elastic framework can scale its resources up
• Minimum allocation: Frameworks require a certain and down, i.e., it can start using slots as soon as it ac-
minimum number of slots before they can start us- quires them, and can release slots as soon its task finish.
ing their slots. In contrast, a rigid framework can start running its jobs
• Task distribution: The distribution of the task dura- only after it has allocated all its slots, i.e., a rigid frame-
tions. We consider both homogeneous and hetero- work is a framework where the minimum allocation is
geneous distributions. equal to its full allocation.
We also differentiate between two types of resources: Table 2 summarizes the job completion times and the
required and preferred. We say that a resource is re- utilization for the two types of frameworks and for the
quired if a framework must acquire it in order to run. For two types of task length distributions. We discuss each
example, we consider a graphical processing unit (GPU) case next.
a required resource if there are frameworks that must use 4.2.1 Elastic Frameworks
a GPU to run. Likewise, we consider a machine with a
public IP address a required resource if there are frame- An elastic framework can opportunistically use any slot
works that need to be externally accessible. In contrast, offered by Mesos, and can relinquish slots without sig-
we say that a resource is preferred if a framework will nificantly impacting the performance of its jobs. We as-
perform “better” using it, but can also run using other re- sume there are exactly k slots in the system that frame-
sources. For example, a framework may prefer using a work f prefers, and that f waits for these slots to become
node that locally stores its data, but it can remotely ac- available to reach its allocation.
cess the data from other nodes if it must. Framework ramp-up time If task durations are con-
We assume that all required resources are explicitly al- stant, it will take framework f at most Ts time to acquire
located, i.e., every type of required resource is an element k slots. This is simply because during a Ts interval, ev-
in the resource vector in resource offers. Furthermore, ery slot will become available, which will enable Mesos

6
Elastic Framework Rigid Framework
Constant dist. Exponential dist. Constant dist. Exponential dist.
Completion time (1/2 + β)Ts (1 + β)Ts (1 + β)Ts (ln k + β)Ts
Utilization 1 1 β/(1/2 + β) β/(ln k − 1 + β)
Table 2: The job completion time and utilization for both elastic and rigid frameworks, and for both constant task durations and
task durations that follow an exponential distribution. The framework starts with zero slots. k represents the number of slots the
framework is entitled under the given allocation, and βTs represents the time it takes a job to complete assuming the framework
gets all k slots at once.

to offer the framework all its k preferred slots. 4.2.2 Rigid Frameworks
If the duration distribution is exponential, the expected Some frameworks may not be able to start running jobs
ramp-up time is Ts ln k. The framework needs to wait on unless they reach a minimum allocation. One example is
average Ts /k to acquire the first slot from the set of its k MPI, where all tasks must start a computation in sync. In
preferred slots, Ts /(k−1) to acquire the second slot from this section we consider the worst case where the min-
the remaining k − 1 slots in the set, and Ts to acquire the imum allocation constraint equals the framework’s full
last slot. Thus, the ramp-up time of f is allocation, i.e., k slots.
Ts × (1 + 1/2.. + 1/k) ' Ts ln k. (1) Job completion time While in this case the ramp-up
time remains unchanged, the job completion time will
Job completion time Recall that βTs is the comple- change because the framework cannot use any slot before
tion time of the job in an ideal scenario in which the reaching its full allocation. If the task duration distribu-
frameworks acquires all its k slots instantaneously. If tion is constant the completion time is simply Ts (1 + β),
task durations are constant, the completion time is on as the framework doesn’t use any slot during the first
average (1/2 + β)Ts . To show this, assume the start- Ts interval, i.e., until it acquires all k slots. If the dis-
ing and the ending times of the tasks are uniformly dis- tribution is exponential, the completion time becomes
tributed, i.e., during the ramp-up phase, f acquires one Ts (ln k + β) as it takes the framework Ts ln k to ramp
slot every Ts /k on average. Thus, the framework’s job up (see Eq. 1).
can use roughly Ts k/2 computation time during the first
Ts interval. Once the framework acquires its k slots, it System utilization Wasting allocated slots has also a
will take the job (βkTs − Ts k/2)/k = (β − 1/2)Ts negative impact on the utilization. If the tasks duration is
time to complete. As a result the job completion time constant, and the framework acquires a slot every Ts /k
is Ts + (β − 1/2)Ts = (1/2 + β)Ts (see Table 2). on average, the framework will waste roughly Ts k/2
computation time during the ramp-up phase. Once it ac-
In the case of the exponential distribution, the ex-
quire all slots, the framework will use βkTs to complete
pected completion time of the job is Ts (1 + β) (see Ta-
its job. The utilization achieved by the framework in this
ble 2). Consider the ideal scenario in which the frame-
case is then βkTs /(kTs /2 + βkTs ) ' β/(1/2 + β).
work acquires all k slots instantaneously. Next, we com-
If the task distribution is exponential, the ex-
pute how much computation time does the job “lose”
pected computation time wasted by the framework is
during the ramp up phase compared to this ideal sce-
Ts (k ln(k − 1) − (k − 1)). The framework acquires the
nario. While the framework waits Ts /k to acquire the
first slot after waiting Ts /k, the second slot after wait-
first slot, in the ideal scenario the job would have been al-
ing Ts /(k − 1), and the last slot after waiting Ts time.
ready used each of the k slots for a total of k×Ts /k = Ts
Since the framework does not use a slot before acquiring
time. Similarly, while the framework waits Ts /(k − 1)
all of them, it follows that the first acquired slot is idle
to acquire the second slot, in the ideal scenario the job Pk−1 Pk−2
would have been used the k − 1 slots (still to be allo- for i=1 Ts /i, the second slot is idle for i=1 Ts /i,
cated) for another (k − 1)×Ts /(k − 1) = Ts time. In and the next to last slot is idle for Ts time. As a result,
general, the framework loses Ts computation time while the expected computation time wasted by the framework
waiting to acquire each slot, and a total of kTs compu- during the ramp-up phase is
Pk−1 i
tation time during the entire ramp-up phase. To account Ts × i=1 k−i = 
Pk−1 k
for this loss, the framework needs to use all k slots for Ts × i=1 k−i − 1 =
an additional Ts time, which increases the expected job Pk−1 1
Ts × k × i=1 k−i − Ts × (k − 1) '
completion time by Ts to (1 + β)Ts .
Ts × k × ln(k − 1) − Ts × (k − 1)
System utilization As long as frameworks can scale Ts × (k ln(k − 1) − (k − 1))
up and down and there is enough demand in the system, Assuming k  1, the utilization achieved by the
the cluster will be fully utilized. framework is βkTs /((k ln(k−1)−(k−1))Ts +βkTs ) '

7
β/(ln k − 1 + β) (see Table 2). of frameworks in the system. Note that this scheme is
similar to lottery scheduling [45]. Furthermore, note that
4.2.3 Placement Preferences
since each framework i receives roughly si slots during
So far, we have assumed that frameworks have no slot a time interval Ts , the analysis of the ramp-up and com-
preferences. In practice, different frameworks prefer dif- pletion times in Section 4.2 still holds.
ferent nodes and their preferences may change over time. There are also allocation policies other than fair shar-
In this section, we consider the case where frameworks ing that can be implemented without knowledge of
have different preferred slots. framework preferences. For example, a strict priority
The natural question is how well Mesos will work in scheme (e.g., where framework 1 must always get pri-
this case when compared to a centralized scheduler that ority over framework 2) can be implemented by always
has full information about framework preferences. We offering resources to the high-priority framework first.
consider two cases: (a) there exists a system configura-
tion in which each framework gets all its preferred slots 4.3 Heterogeneous Tasks
and achieves its full allocation, and (b) there is no such So far we have assumed that frameworks have homo-
configuration, i.e., the demand for preferred slots exceeds geneous task duration distributions. In this section, we
the supply. discuss heterogeneous tasks, in particular, tasks that are
In the first case, it is easy to see that, irrespective of the short and long, where the mean duration of the long tasks
initial configuration, the system will converge to the state is significantly longer than the mean of the short tasks.
where each framework allocates its preferred slots after We show that by sharing nodes in space as discussed in
at most one Ts interval. This is simple because during Section 3.3.2, Mesos can accommodate long tasks with-
a Ts interval all slots become available, and as a result out impacting the performance of short tasks.
each framework will be offered its preferred slots. A heterogeneous workload can hurt frameworks with
In the second case, there is no configuration in which short tasks by increasing the time it takes such frame-
all frameworks can satisfy their preferences. The key works to reach their allocation and to acquire preferred
question in this case is how should one allocate the pre- slots (relative to the frameworks’ total job durations). In
ferred slots across the frameworks demanding them. In particular, a node preferred by a framework with short
particular, assume there are x slots preferred by m frame- tasks may become fully filled by long tasks, greatly in-
Pm where framework i requests ri such slots, and
works, creasing a framework’s waiting time.
i=1 ri > x. While many allocation policies are pos- To address this problem, Mesos differentiates between
sible, here we consider the weighted fair allocation pol- short and long slots, and bounds the number of long
icy where the weight associated with a framework is its slots on each node. This ensures there are enough short
intended allocation, si . In other words, assuming that tasks on each node whose slots become available with
Pm has enough demand, framework i will
each framework high frequency, giving frameworks better opportunities
get x×si /( i=1 si ). to quickly acquire a slot on one of their preferred nodes.
As an example, consider 30 slots that are preferred In addition, Mesos implements a revocation mechanism
by two frameworks with intended allocations s1 = 200 that does not differentiate between long and short tasks
slots, and s2 = 100 slots, respectively. Assume that once a framework exceeds its allocation. This makes sure
framework 1 requests r1 = 40 of the preferred slots, that the excess slots in the system (i.e., the slots allo-
while framework 2 requests r2 = 20 of these slots. Then, cated by frameworks beyond their intended allocations)
the preferred slots are allocated according to weighted are all treated as short slots. As a result, a framework
fair sharing, i.e., framework 1 receives x×s1 /(s1 +s2 ) = under its intended allocation can acquire slots as fast as
20 slots, and framework 2 receives the rest of 10 slots. in a system where all tasks are short. At the same time,
If the demand of framework 1 is less than its share, Mesos ensures that frameworks running their tasks on
e.g., r1 = 15, then framework 1 receives 15 slots (which long slots and not exceeding their guaranteed allocations
satisfies its demand), and framework 2 receives the rest won’t have their tasks revoked.
of 15 slots.
The challenge with Mesos is that the scheduler does 4.4 Framework Incentives
not know the preferences of each framework. Fortu- Mesos implements a decentralized scheduling approach,
nately, it turns out that there is an easy way to achieve where each framework decides which offers to accept or
the fair allocation of the preferred slots described above: reject. As with any decentralized system, it is impor-
simply offer slots to frameworks proportionally to their tant to understand the incentives of various entities in the
intended allocations. In particular, when a slot becomes system. In this section, we discuss the incentives of a
available, MesosPoffers that slot to framework i with framework to improve the response times of its jobs.
n
probability si /( i=1 si ), where n is the total number

8
Short tasks: A framework is incentivized to use short stain from offering resources on that slave until this min-
tasks for two reasons. First, it will be able to allocate imum amount is free.
any slots; in contrast frameworks with long tasks are re- Note that the wasted space due to both suboptimal bin
stricted to a subset of slots. Second, using small tasks packing and fragmentation is bounded by the ratio be-
minimizes the wasted work if the framework loses a task, tween the largest task size and the node size. Therefore,
either due to revocation or simply due to failures. clusters running “larger” nodes (e.g., multicore nodes)
and “smaller” tasks within those nodes (e.g., having a
No minimum allocation: The ability of a framework
cap on task resources) will be able to achieve high uti-
to use resources as soon as it allocates them–instead of
lization even with a distributed scheduling model.
waiting to reach a given minimum allocation–would al-
low the framework to start (and complete) its jobs earlier. Interdependent framework constraints: It’s possible
Note that the lack of a minimum allocation constraint im- to construct scenarios where, because of esoteric inter-
plies the ability of the framework to scale up, while the dependencies between frameworks’ performance, only a
reverse is not true, i.e., a framework may have both a single global allocation of the cluster resources performs
minimum allocation requirement and the ability to allo- well. We argue such scenarios are rare in practice. In
cate and use resources beyond this minimum allocation. the model discussed in this section, where frameworks
only have preferences over placement, we showed that
Scale down: The ability to scale down allows a frame- allocations approximate those of optimal schedulers.
work to grab opportunistically the available resources, as
it can later release them with little negative impact. Framework complexity: Using resources offers may
make framework scheduling more complex. We argue,
Do not accept unknown resources: Frameworks are however, that this difficulty is not in fact onerous. First,
incentivized not to accept resources that they cannot use whether using Mesos or a centralized scheduler, frame-
because most allocation policies will account for all the works need to know their preferences; in a centralized
resources that a framework owns when deciding which scheduler, the framework would need to express them to
framework to offer resources to next. the scheduler, whereas in Mesos, it needs to use them to
We note that these incentives are all well aligned with decide which offers to accept. Second, many scheduling
our goal of improving utilization. When frameworks use policies for existing frameworks are online algorithms,
short tasks, Mesos can reallocate resources quickly be- because frameworks cannot predict task times and must
tween them, reducing the need for wasted work due to be able to handle node failures and slow nodes. These
revocation. If frameworks have no minimum allocation policies are easily implemented using the resource offer
and can scale up and down, they will opportunistically mechanism.
utilize all the resources they can obtain. Finally, if frame-
works do not accept resources that they do not under- 5 Implementation
stand, they will leave them for frameworks that do. We have implemented Mesos in about 10,000 lines of
4.5 Limitations of Distributed Scheduling C++. The system runs on Linux, Solaris and Mac OS X.
Mesos applications can be programmed in C, C++,
Although we have shown that distributed scheduling Java, Ruby and Python. We use SWIG [15] to generate
works well in a range of workloads relevant to current interface bindings for the latter three languages.
cluster environments, like any decentralized approach, it To reduce the complexity of our implementation, we
can perform worse than a centralized scheduler. We have use a C++ library called libprocess [8] that provides
identified three limitations of the distributed model: an actor-based programming model using efficient asyn-
Fragmentation: When tasks have heterogeneous re- chronous I/O mechanisms (epoll, kqueue, etc). We
source demands, a distributed collection of frameworks also leverage Apache ZooKeeper [4] to perform leader
may not be able to optimize bin packing as well as a cen- election, as described in Section 3.6. Finally, our current
tralized scheduler. frameworks use HDFS [2] to share data.
There is another possible bad outcome if allocation Our implementation can use Linux containers [10] or
modules reallocate resources in a naive manner: when Solaris projects [16] to isolate applications. We currently
a cluster is filled by tasks with small resource require- isolate CPU cores and memory.3
ments, a framework f with large resource requirements We have implemented five frameworks on top of
may starve, because whenever a small task finishes, f Mesos. First, we have ported three existing cluster sys-
cannot accept the resources freed up by it, but other tems to Mesos: Hadoop [2], the Torque resource sched-
frameworks can. To accommodate frameworks with 3 Support for network and IO isolation was recently added to the
large per-task resource requirements, allocation modules Linux kernel [9] and we plan to extend our implementation to isolate
can support a minimum offer size on each slave, and ab- these resources too.

9
uler [42], and the MPICH2 implementation of MPI [22]. 5.2 Torque and MPI Ports
None of these ports required changing these frameworks’ We have ported the Torque cluster resource manager to
APIs, so all of them can run unmodified user programs. run as a framework on Mesos. The framework consists
In addition, we built a specialized framework called of a Mesos scheduler “wrapper” and a Mesos executor
Spark, that we discuss in Section 5.3. Finally, to show the “wrapper” written in 360 lines of Python that invoke dif-
applicability of running front-end frameworks on Mesos, ferent components of Torque as appropriate. In addition,
we have developed a framework that manages an elastic we had to modify 3 lines of Torque source code in or-
web server farm which we discuss in Section 5.4. der to allow it to elastically scale up and down on Mesos
depending on the jobs in its queue.
5.1 Hadoop Port The scheduler wrapper first configures and launches a
Torque server and then periodically monitors the server’s
Porting Hadoop to run on Mesos required relatively few job queue. While the queue is empty, the scheduler wrap-
modifications, because Hadoop concepts such as map per refuses all resource offers it receives. Once a job gets
and reduce tasks correspond cleanly to Mesos abstrac- added to Torque’s queue (using the standard qsub com-
tions. In addition, the Hadoop “master”, known as the mand), the scheduler wrapper informs the Mesos master
JobTracker, and Hadoop “slaves”, known as TaskTrack- it can receive new offers. As long as there are jobs in
ers, naturally fit into the Mesos model as a framework Torque’s queue, the scheduler wrapper accepts as many
scheduler and executor. offers to satisfy the constraints of its jobs. When the
executor wrapper is launched it starts a Torque back-
We first modified the JobTracker to schedule MapRe- end daemon that registers with the Torque server. When
duce tasks using resource offers. Normally, the Job- enough Torque backend daemons have registered, the
Tracker schedules tasks in response to heartbeat mes- torque server will launch the first job in the queue. In the
sages sent by TaskTrackers every few seconds, reporting event Torque is running an MPI job the executor wrapper
the number of free slots in which “map” and “reduce” will also launch the necessary MPI daemon processes.
tasks can be run. The JobTracker then assigns (prefer- Because jobs that run on Torque are typically not re-
ably data local) map and reduce tasks to TaskTrackers. silient to failures, Torque never accepts resources beyond
To port Hadoop to Mesos, we reuse the heartbeat mecha- its guaranteed allocation to avoid having its tasks re-
nism as described, but dynamically change the number of voked. The scheduler wrapper will accept up to its guar-
possible slots on the TaskTrackers. When the JobTracker anteed allocation whenever it can take advantage of those
receives a resource offer, it decides if it wants to run any resources, such as running multiple jobs simultaneously.
map or reduce tasks on the slaves included in the offer, In addition to the Torque framework, we also created a
using delay scheduling [48]. If it does, it creates a new Mesos “wrapper” framework, written in about 200 lines
Mesos task for the slave that signals the TaskTracker (i.e. of Python code, for running MPI jobs directly on Mesos.
the Mesos executor) to increase its number of total slots.
The next time the TaskTracker sends a heartbeat, the Job- 5.3 Spark Framework
Tracker can assign the runnable map or reduce task to To show the value of simple but specialized frameworks,
the new empty slot. When a TaskTracker finishes run- we built Spark, a new framework for iterative jobs that
ning a map or reduce task, the TaskTracker decrements was motivated from discussions with machine learning
it’s slot count (so the JobTracker doesn’t keep sending researchers at our institution.
it tasks) and reports the Mesos task as finished to allow One iterative algorithm used frequently in machine
other frameworks to use those resources. learning is logistic regression [11]. An implementation
of logistic regression in Hadoop must run each iteration
We also needed to change how map output data is
as a separate MapReduce job, because each iteration de-
served to reduce tasks. Hadoop normally writes map
pends on values computed in the previous round. In this
output files to the local filesystem, then serves these to
case, every iteration must re-read the input file from disk
reduce tasks using an HTTP server included in the Task-
into memory. In Dryad, the whole job can be expressed
Tracker. However, the TaskTracker within Mesos runs as
as a data flow DAG as shown in Figure 4a, but the data
an executor, which may be terminated if it is not running
must still must be reloaded from disk into memory at
tasks, which would make map output files unavailable
each iteration. Reusing the data in memory between iter-
to reduce tasks. We solved this problem by providing
ations in Dryad would require cyclic data flow.
a shared file server on each node in the cluster to serve
Spark’s execution is shown in Figure 4b. Spark uses
local files. Such a service is useful beyond Hadoop, to
the long-lived nature of Mesos executors to cache a slice
other frameworks that write data locally on each node.
of the data set in memory at each executor, and then run
In total, we added 1,100 lines of new code to Hadoop. multiple iterations on this cached data. This caching is

10
1
w f(x,w) 0.9

Cluster Node Utilization


0.8
x 0.7
f(x,w)
w 0.6
w
0.5

0.4
x 0.3
f(x,w)
0.2
x

...
0.1

0
1 101 201 301 401 501 601 701 801 901 1001
Time (seconds)
a) Dryad b) Spark Torque Hadoop Instance 1 Hadoop Instance 2 Hadoop Instance 3

Figure 4: Data flow of a logistic regression job in Dryad Figure 5: Cluster utilization over time on a 50-node cluster with
vs. Spark. Solid lines show data flow within the framework. three Hadoop frameworks each running back-to-back MapRe-
Dashed lines show reads from a distributed file system. Spark duce jobs, plus one Torque framework running a mix of 8 node
reuses processes across iterations, only loading data once. and 24 node HPL MPI benchmark jobs.
Application Duration Alone Duration on Mesos
achieved in a fault-tolerant manner: if a node is lost, MPI LINPACK 50.9s 51.8s
Spark remembers how to recompute its slice of the data. Hadoop WordCount 159.9s 166.2s
Spark leverages Scala to provide a language-integrated Table 3: Overhead of MPI and Hadoop benchmarks on Mesos.
syntax similar to DryadLINQ [47]: users invoke paral-
lel operations by applying a function on a special “dis- back-to-back WordCount MapReduce jobs, and within
tributed dataset” object, and the body of the function is the Torque framework we submitted jobs running the
captured as a closure to run as a set of tasks in Mesos. High-Performance LINPACK [19] benchmark (HPL).
Spark then schedules these tasks to run on executors that This experiment was performed using 50 EC2 in-
already have the appropriate data cached, using delay stances, each with 4 CPU cores and 15 GB RAM. Figure
scheduling. By building on-top-of Mesos, Spark’s im- 5 shows node utilization over time. The guaranteed al-
plementation only required 1300 lines of code. location for the Torque framework was 48 cores (1/4 of
Due to lack of space, we have limited our discussion of the cluster). Eight MPI jobs were launched, each using
Spark in this paper and refer the reader to [49] for details. 8 nodes, beginning around time 80s. The Torque frame-
5.4 Elastic Web Server Farm work launches tasks that use only one core. Six of the
HPL jobs were able to run as soon as they were sub-
We built an elastic web farm framework that takes ad-
mitted, bringing the Torque framework up to its guaran-
vantage of Mesos to scale up and down based on ex-
teed allocation of 48 cores. Thereafter, two of them were
ternal load. Similar to the Torque framework, the web
placed in the job queue and were launched as soon as
farm framework uses a scheduler “wrapper” and ex-
the first and second HPL jobs finished (around time 315s
ecutor “wrapper”. The scheduler wrapper launches an
and 345s). The Hadoop instances tasks fill in any avail-
haproxy [6] load balancer and periodically monitors its
able remaining slots, keeping cluster utilization at 100%.
web request statistics to decide when to launch or tear-
down servers. Its only scheduling constraint is that it will 6.2 Overhead
launch at most one Apache instance per machine, and To measure the overhead Mesos imposes on existing
then set a filter to stop receiving further offers for that cluster computing frameworks, we ran two benchmarks
machine. The wrappers are implemented in 250 lines of
Python.
100% 600
6 Evaluation
Job Running TIme (s)
Local Map Tasks (%)

80% 480
We evaluated Mesos by performing a series of experi-
60% 360
ments using Amazon’s EC2 environment.
40% 240
6.1 Macrobenchmark
20% 120
To evaluate the primary goal of Mesos, which is enabling 0% 0
multiple diverse frameworks to efficiently share the same Static Mesos, no Mesos, 1s Mesos, 5s
cluster, we ran a macrobenchmark consisting of a mix partitioning delay sched. delay sched. delay sched.

of fine and course-grained frameworks: three Hadoop Data Locality Job Running Times
frameworks and a single Torque framework. We submit-
Figure 6: Data locality and average job running times for 16
ted a handful of different sized jobs to each of these four
Hadoop instances on a 93-node cluster using static partitioning,
frameworks. Within the Hadoop frameworks, we ran
Mesos, or Mesos with delay scheduling.

11
using the MPI (not with Torque) and Hadoop. These ex- 4000

Running Time (s)


periments were performed using 50 EC2 instances, each
3000
with 2 CPU cores and 6.5 GB RAM. We used the High-
Hadoop
Performance LINPACK [19] benchmark for MPI and the 2000
Spark
WordCount workload for Hadoop. Table 3 shows aver- 1000
age running times across three runs of MPI and Hadoop
0
both with and without Mesos.
0 10 20 30
6.3 Data Locality through Fine-Grained Sharing Number of Iterations
and Resource Offers
Figure 7: Hadoop and Spark logistic regression running times.
In this experiment, we demonstrate how Mesos’ resource
offer mechanism enables frameworks to control their
tasks’ placement and in particular, data locality. We ran
16 instances of Hadoop using 93 EC2 instances, each
with 4 CPU cores and 15 GB RAM. Each instance ran a
map-only scan job that searched a 100 GB file spread
throughout the cluster on a shared HDFS file system.
Each job printed 1% of the input records as output. We
tested four scenarios: giving each Hadoop instance its
own 5-6 node static partition of the cluster (to emu-
late organizations that use coarse-grained sharing mech-
anisms), and all instances on Mesos using either no delay
scheduling, 1s delay scheduling or 5s delay scheduling.
Figure 6 shows averaged measurements from all 16
Hadoop instances across 3 runs of each scenario. Us-
ing static partitioning yields very low (18%) data local-
ity because the Hadoop instances are forced to fetch data Figure 8: The average session load on the load balancer, the
from nodes outside their partition. In contrast, running average number of sessions across each web server, and the
the Hadoop instances on Mesos improves data locality number of web servers running over time.
even without delay scheduling because each Hadoop in-
stance takes turns accessing all the nodes in the cluster. parse it into floating point numbers. Hadoop incurs the
Adding a 1-second delay brings locality above 90%, and read/parsing cost for each iteration, while Spark reuses
a 5-second delay achieves 95% locality, which is com- cached blocks of parsed data and only incurs the cost
petitive with running Hadoop alone. As expected, the once (using slower text-parsing routines), leading to a
average performance of each Hadoop instance improves 10.7x speedup on 30 iterations.
with locality: jobs run 1.7x faster in the 5s delay scenario
than with static partitioning. 6.5 Elastic Web Farm

6.4 Benefit of Specialized Frameworks To demonstrate an interactive framework dynamically


scaling on Mesos we ran an elastic web farm on Mesos.
We evaluated the benefit of running iterative jobs using We used HTTPerf [38] to generate increasing and then
the specialized Spark framework (Section 5.3) over the decreasing load on the web farm. As the average load
general-purpose Hadoop framework. We used a logistic on each server reaches 150 sessions/second, the elastic
regression job implemented in Hadoop by local machine web farm framework signals to Mesos that it is willing to
learning researchers, and implemented a second version accept more resources and launches another Apache in-
using Spark and Mesos. We ran each version separately stance. We ran experiments using 4 EC2 instances with
on 20 EC2 instances, each with 4 CPU cores and 15 GB 8 CPU cores and 6.5 GB RAM. Figure 8 shows the web
RAM. Each experiment used a 29 GB data file and varied farm dynamically adapts the number of web servers to
the number of logistic regression iterations from 1 to 20 the offered load (sessions/second at the load balancer)
(see Figure 7). to ensure that the load at each web server remains at or
With Hadoop, each iteration takes 127s on average, below 150 sessions/sec. The brief drops in sessions per
while with Spark, the first iteration takes 174s, but sub- second at the load balancer were due to limitations in the
sequent iterations take about 6 seconds. Basically, the current haproxy implementation, which required the
time to evaluate the function for each iteration is domi- framework to restart haproxy to increase or decrease
nated by the time to read the input data from HDFS and the number of Apache servers.

12
6.6 Mesos Scalability 100  

To evaluate Mesos’ scalability, we emulated large clus-

Addi$onal  Overhead  (s)  


ters by using 99 Amazon EC2 instances, each with 8 10  
CPU cores and 6 GB RAM. We used one EC2 instance
for the master with the remaining instances each run- reac-ve  

ning multiple slaves. During the experiment, each of 200 1   10  s  


30  s  
frameworks randomly distributed throughout the cluster
continuously launches one task at a time for each slave
0.1  
that it receives an offer. Each task sleeps for a specified
0   10000   20000   30000   40000   50000  
period of time, based on a normal distribution around 10
Number  of  Slaves  
seconds with a standard deviation of 2, and then ends.
We choose 10 seconds to help demonstrate Mesos scala- Figure 9: Mesos master’s scalability versus number of slaves.
bility under high load. In practice, average task runtime
12
will be higher, yielding lower load on the master. Once
the cluster reaches steady-state (i.e., the 200 frameworks 10
achieve their fair share and all cluster resources are in
use), we launch a single framework from a random slave 8

Time (s)
within the cluster that runs a single 10 second task. Note
6
that each slave reports 8 CPU cores and 6 GB RAM, and
each framework runs tasks that consume 4 CPU cores 4
and 3 GB RAM. Figure 9 shows the average of 5 run-
2
times for launching the single framework after reaching
steady-state. 0
Our initial Mesos implementation (labeled “reactive”
0

00

00

00

00

00

00

00

00
20

60

10

14

18

22

26

30

34

38
in Figure 9) failed to scale beyond 15,000 slaves because
Number of Slaves
it attempted to allocate resources immediately after every
task finished. While the allocator code is fairly simple, Figure 10: Mean time for the slaves and applications to recon-
it dominated the master’s execution time, causing laten- nect to a secondary Mesos master upon master failure. The plot
shows 95% confidence intervals.
cies to increase. To amortize allocation costs, we modi-
fied Mesos to perform allocations in batch intervals. To
demonstrate how well this implementation scaled we ran erage MTTR with a 95% confidence interval for differ-
the same experiment as before, however we also tried ent cluster sizes. Although not shown, we experimented
variations with average task lengths of 10 and 30 sec- with different sizes of ZooKeeper quorums (between 1
onds (these are labeled as “10 s” and “30 s” in Figure 9). and 11). The results consistently showed that MTTR is
As the graph shows, with 30 second average task length always below 15 seconds, and typically 7-8 seconds.
Mesos imposes less than 1 second of additional overhead
7 Related Work
on frameworks. Unfortunately, the EC2 virtualized envi-
ronment limited scalability beyond 50,000 slaves, as at HPC and Grid schedulers. The high performance
50,000 slaves the master was processing 100,000 pack- computing (HPC) community has long been managing
ets per second (in+out), which has been shown to be the clusters [42, 51, 28, 20]. Their target environment typ-
current achievable limits in EC2[14]. ically consists of specialized hardware, such as Infini-
band, SANs, and parallel filesystems. Thus jobs do not
6.7 Fault Tolerance
need to be scheduled local to their data. Furthermore,
To evaluate recovery from master failures, we conducted each job is tightly coupled, often using barriers or mes-
an experiment identical to the scalability experiment, ex- sage passing. Thus, each job is monolithic, rather than
cept that we used 20 second task lengths, and two Mesos composed of smaller fine-grained tasks. Consequently,
masters connected to a 5 node ZooKeeper quorum us- a job does not dynamically grow or shrink its resource
ing a default tick timer set to 2 seconds. We ran the demands across machines during its lifetime. Moreover,
experiment using 62 EC2 instances, each with 4 CPU fault-tolerance is achieved through checkpointing, rather
cores and 15 GB RAM. We synchronized the two mas- than recomputing fine-grained tasks. For these reasons,
ters’ clocks using NTP and measured the mean time to HPC schedulers use centralized scheduling, and require
recovery (MTTR) after killing the active master. The jobs to declare the required resources at job submission
MTTR is the time for all of the slaves and frameworks time. Jobs are then allocated course-grained allocations
to connect to the second master. Figure 10 shows the av- of the cluster. Unlike the Mesos approach, this does

13
not allow frameworks to locally access data distributed ification language always has the problem that certain
over the cluster. Furthermore, jobs cannot grow and preferences might currently not be possible to express.
shrink dynamically as their allocations change. In ad- For example, delay scheduling is hard to express with
dition to supporting fine-grained sharing, Mesos can run the current languages. In contrast, the two level schedul-
HPC schedulers, such as Torque, as a frameworks, which ing and resource offer architecture in Mesos gives jobs
then can schedule HPC workloads appropriately. control in deciding where and when to run tasks.
Grid computing has mostly focused on the problem
of making diverse virtual organizations share geograph- 8 Conclusion
ically distributed and separately administered resources We have presented Mesos, a resource management sub-
in a secure and inter-operable way. Mesos could well strate that allows diverse parallel applications to effi-
be used within a virtual organization, which is part of a ciently share a cluster. Mesos is built around two con-
larger grid that, for example, runs Globus Toolkit. tributions: a fine-grained sharing model where applica-
tions divide work into smaller tasks, and a distributed
Public and Private Clouds. Virtual machine clouds,
scheduling mechanism called resource offers that lets ap-
such as Amazon EC2 [1] and Eucalyptus [40] share com-
plications choose which resources to run on. Together,
mon goals with Mesos, such as isolating frameworks
these contributions let Mesos achieve high utilization, re-
while providing a low-level abstraction (VMs). How-
spond rapidly to workload changes, and cater to applica-
ever, they differ from Mesos in several important ways.
tions with diverse placement preferences, while remain-
First, these systems share resources in a coarse-grained
ing simple and scalable. We have shown that existing
manner, where a user may hold onto a virtual machine
cluster applications can effectively share resources with
for an arbitrary amount of time. This makes fine-grained
Mesos, that new specialized cluster computing frame-
sharing of data difficult Second, these systems generally
works, such as Spark, can provide major performance
do not let applications specify placement needs beyond
gains, and that Mesos’s simple architecture enables the
the size of virtual machine they require. In contrast,
system to be fault tolerant and to scale to 50,000 nodes.
Mesos allows frameworks to be highly selective about
which resources they acquire through resource offers. Mesos is inspired by work on microkernels [18], ex-
okernels [27] and hypervisors [32] in the OS community
Quincy. Quincy [34] is a fair scheduler for Dryad. It and by the success of the narrow-waist IP model [23] in
uses a centralized scheduling algorithm based on min- computer networks. Like a microkernel or hypervisor,
cost flow optimization and takes into account both fair- Mesos is a stable, minimal core that isolates applications
ness and locality constraints. Resources are reassigned sharing a cluster. Like an exokernel, Mesos aims to give
by killing tasks (similarly to Mesos resource killing) applications maximal control over their execution. Fi-
when the output of the min-cost flow algorithm changes. nally, like IP, Mesos encourages diversity and innovation
Quincy assumes that tasks have identical resource re- in cluster computing by providing a “narrow waist” API
quirements, and the authors note that Quincy’s min-cost that lets diverse applications coexist efficiently.
flow formulation of the scheduling problem is difficult In future work, we intend to focus on three areas. First,
to extend to tasks with multi-dimensional resource de- we plan to further analyze the resource offer model to
mands. In contrast, Mesos aims to support multiple clus- characterize environments it works well in and determine
ter computing frameworks, including frameworks with whether any extensions can improve its efficiency while
scheduling preferences other than data locality. Mesos retaining its flexibility. Second, we plan to use Mesos as
uses a decentralized two-level scheduling approach lets a springboard to experiment with cluster programming
frameworks decide where they run, and supports hetero- models. Lastly, we intend to build a stack of higher-level
geneous task resource requirements. We have shown that programming abstractions on Mesos to allow developers
frameworks can still achieve near perfect data locality us- to quickly write scalable, fault-tolerant applications.
ing delay scheduling.
References
Specification Language Approach. Some systems,
such as Condor and Clustera [43, 26], go to great lengths [1] Amazon EC2. http://aws.amazon.com/ec2/.
to match users and jobs to available resources. Clustera [2] Apache Hadoop. http://hadoop.apache.org.
[3] Apache Hive. http://hadoop.apache.org/
provides multi-user support, and uses a heuristic incor-
hive/.
porating user priorities, data locality and starvation to
[4] Apache ZooKeeper. http://hadoop.apache.
match work to idle nodes. However, Clustera requires org/zookeeper/.
each job to explicitly list its data locality needs, and does [5] Hadoop and hive at facebook. http:
not support placement constraints other than data local- //www.slideshare.net/dzhou/
ity. Condor uses the ClassAds [41] to match node proper- facebook-hadoop-data-applications.
ties to job needs. This approach of using a resource spec- [6] HAProxy Homepage. http://haproxy.1wt.eu/.

14
[7] Hive – A Petabyte Scale Data Warehouse using J. Royalty, S. Shankar, and A. Krioukov. Clustera: an
Hadoop. http://www.facebook.com/note. integrated computation and data management system.
php?note_id=89508453919#/note.php? PVLDB, 1(1):28–41, 2008.
note_id=89508453919. [27] D. R. Engler, M. F. Kaashoek, and J. O’Toole. Exokernel:
[8] LibProcess Homepage. http://www.eecs. An operating system architecture for application-level re-
berkeley.edu/˜benh/libprocess. source management. In SOSP, pages 251–266, 1995.
[9] Linux 2.6.33 release notes. http:// [28] W. Gentzsch. Sun grid engine: towards creating a com-
kernelnewbies.org/Linux_2_6_33. pute power grid. pages 35–36, 2001.
[10] Linux containers (LXC) overview document. http:// [29] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski,
lxc.sourceforge.net/lxc.html. S. Shenker, and I. Stoica. Dominant resource fairness:
[11] Logistic Regression Wikipedia Page. http: Fair allocation of heterogeneous resources in datacen-
//en.wikipedia.org/wiki/Logistic_ ters. Technical Report UCB/EECS-2010-55, EECS De-
regression. partment, University of California, Berkeley, May 2010.
[12] Personal communication with Dhruba Borthakur from the [30] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel.
Facebook data infrastructure team. The cost of a cloud: research problems in data center net-
[13] Personal communication with Khaled Elmeleegy from works. SIGCOMM Comput. Commun. Rev., 39(1):68–73,
Yahoo! Research. 2009.
[14] RightScale webpage. http://blog. [31] B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, , and
rightscale.com/2010/04/01/ L. Zhou. Comet: Batched stream processing in data in-
benchmarking-load-balancers-in-the-cloud. tensive distributed computing. Technical Report MSR-
[15] Simplified wrapper interface generator. http://www. TR-2009-180, Microsoft Research, December 2009.
swig.org. [32] E. C. Hendricks and T. C. Hartmann. Evolution of
[16] Solaris Resource Management. http://docs.sun. a virtual machine subsystem. IBM Systems Journal,
com/app/docs/doc/817-1592. 18(1):111–142, 1979.
[17] Twister. http://www.iterativemapreduce. [33] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.
org. Dryad: distributed data-parallel programs from sequential
[18] M. J. Accetta, R. V. Baron, W. J. Bolosky, D. B. Golub, building blocks. In EuroSys 07, pages 59–72, 2007.
R. F. Rashid, A. Tevanian, and M. Young. Mach: A new [34] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar,
kernel foundation for unix development. In USENIX Sum- and A. Goldberg. Quincy: Fair scheduling for distributed
mer, pages 93–113, 1986. computing clusters. In SOSP 2009, November 2009.
[19] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, [35] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability
A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, of intermediate data in cloud computations. In HOTOS
C. Bischof, and D. Sorensen. Lapack: a portable linear al- ’09: Proceedings of the Twelfth Workshop on Hot Topics
gebra library for high-performance computers. In Super- in Operating Systems. USENIX, May 2009.
computing ’90: Proceedings of the 1990 conference on [36] D. Logothetis, C. Olston, B. Reed, K. Webb, and
Supercomputing, pages 2–11, Los Alamitos, CA, USA, K. Yocum. Programming bulk-incremental dataflows.
1990. IEEE Computer Society Press. Technical report, University of California San Diego,
[20] D. Angulo, M. Bone, C. Grubbs, and G. von Laszewski. http://cseweb.ucsd.edu/˜dlogothetis/
Workflow management through cobalt. In International docs/bips_techreport.pdf, 2009.
Workshop on Grid Computing Environments–Open Ac- [37] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
cess (2006), Tallahassee, FL, November 2006. I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system
[21] D. Bertsekas and R. Gallager. Data Networks. Prentice for large-scale graph processing. In PODC ’09, pages 6–
Hall, second edition, 1992. 6, New York, NY, USA, 2009. ACM.
[22] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, [38] D. Mosberger and T. Jin. httperf – a tool for measuring
P. Lemarinier, and F. Magniette. Mpich-v2: a fault web server performance. SIGMETRICS Perform. Eval.
tolerant mpi for volatile nodes based on pessimistic Rev., 26(3):31–37, 1998.
sender based message logging. In SC ’03: Proceedings [39] H. Moulin. Fair Division and Collective Welfare. The
of the 2003 ACM/IEEE conference on Supercomputing, MIT Press, 2004.
page 25, Washington, DC, USA, 2003. IEEE Computer [40] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-
Society. man, L. Youseff, and D. Zagorodnov. The eucalyp-
[23] D. D. Clark. The design philosophy of the darpa internet tus open-source cloud-computing system. In Proceed-
protocols. In SIGCOMM, pages 106–114, 1988. ings of Cloud Computing and Its Applications [Online],
[24] T. Condie, N. Conway, P. Alvaro, and J. M. Hellerstein. Chicago, Illinois, 10 2008.
Mapreduce online. In 7th Symposium on Networked Sys- [41] R. Raman, M. Solomon, M. Livny, and A. Roy. The clas-
tems Design and Implementation. USENIX, May 2010. sads language. pages 255–270, 2004.
[25] J. Dean and S. Ghemawat. MapReduce: Simplified data [42] G. Staples. Torque - torque resource manager. In SC,
processing on large clusters. In OSDI, pages 137–150, page 8, 2006.
2004. [43] D. Thain, T. Tannenbaum, and M. Livny. Distributed
[26] D. J. DeWitt, E. Paulson, E. Robinson, J. F. Naughton, computing in practice: the Condor experience. Concur-

15
rency and Computation - Practice and Experience, 17(2-
4):323–356, 2005.
[44] H. Varian. Equity, envy, and efficiency. Journal of Eco-
nomic Theory, 9(1):63–91, 1974.
[45] C. A. Waldspurger and W. E. Weihl. Lottery schedul-
ing: flexible proportional-share resource management. In
OSDI ’94: Proceedings of the 1st USENIX conference on
Operating Systems Design and Implementation, pages 1+,
Berkeley, CA, USA, 1994. USENIX Association.
[46] Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation
for data-parallel computing: interfaces and implementa-
tions. In SOSP ’09: Proceedings of the ACM SIGOPS
22nd symposium on Operating systems principles, pages
247–260, New York, NY, USA, 2009. ACM.
[47] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson,
P. K. Gunda, and J. Currey. DryadLINQ: A system for
general-purpose distributed data-parallel computing us-
ing a high-level language. In OSDI, pages 1–14, 2008.
[48] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,
S. Shenker, and I. Stoica. Delay scheduling: A simple
technique for achieving locality and fairness in cluster
scheduling. In EuroSys 2010, April 2010.
[49] M. Zaharia, N. M. M. Chowdhury, M. Franklin,
S. Shenker, and I. Stoica. Spark: Cluster computing
with working sets. Technical Report UCB/EECS-2010-
53, EECS Department, University of California, Berke-
ley, May 2010.
[50] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and
I. Stoica. Improving mapreduce performance in hetero-
geneous environments. In Proc. OSDI 2008, Dec. 2008.
[51] S. Zhou. Lsf: Load sharing in large-scale heterogeneous
distributed systems. In Workshop on Cluster Computing,
Tallahassee, FL, December 1992.

16

You might also like