You are on page 1of 13

Apache Hadoop

Storage Provisioning
Using VMware vSphere
Big Data Extensions

TEC H N I C A L W H ITE PA P E R

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Table of Contents
Apache Hadoop Deployment on VMware vSphere Using vSphere
Big Data Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Local Storage and Shared Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Basic vSphere Storage Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Using Local and Shared Storage for Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Storage Provisioning by BDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Datastore Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Cluster Specification of Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Disk Placement and Storage Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Storage Management After Cluster Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Allocation of Unused Datastore Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Storage Failure and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Disk Replacement and Node Data Disk Recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Disk Replacement and Node Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Recoverable Disk Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Storage Configuration for Hadoop Outside of BDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Data Disk Resizing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Utilization of Additional Disks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

TECH N I C AL WH ITE PAPE R / 2

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Apache Hadoop Deployment on


VMware vSphere Using vSphere
Big Data Extensions
The Apache Hadoop software library is a framework that enables the distributed processing of large data sets
across clusters of computers. It is designed to scale up from single servers to thousands of machines, with each
offering local computation and storage. Hadoop is being used by enterprises across verticals for big data
analytics, to help make better business decisions based on large data sets.
Serengeti is an open-source project initiated by VMware to automate deployment and management of Hadoop
clusters on virtualized environments such as VMware vSphere. Serengeti offers the following key benefits:
Deploy a Hadoop cluster on vSphere in minutes via one command
Employ a fully customizable configuration profile to specify computer, storage and network resources as well
as node placement
Provide better Hadoop manageability and usability, enabling fast and simple cluster scale-out and
Hadoop tuning
Enable separation of data and compute nodes without losing data locality
Improve Hadoop cluster availability by leveraging VMware vSphere High Availability (vSphere HA),
VMware vSphere Fault Tolerance (vSphere FT) and VMware vSphere vMotion
Support multiple Hadoop distributions, including Apache Hadoop, Cloudera CDH, Pivotal HD, MapR,
Hortonworks Data Platform (HDP) and Intel IDH
Through its sponsorship of Project Serengeti, VMware has been investing in making it easier for users to run big
data and Hadoop workloads. VMware has introduced Big Data Extensions (BDE) as a commercially supported
version of Project Serengeti designed for enterprises seeking VMware support. BDE enables customers to run
clustered, scale-out Hadoop applications through vSphere, delivering all the benefits of virtualization to Hadoop
users. BDE provides increased agility through an easy-to-use interface, elastic scaling through the separation of
compute and storage resources, and increased reliability and security by leveraging proven vSphere technology.
VMware has built BDE to support all major Hadoop distributions and associated Hadoop projects such as Pig,
Hive and HBase.
Serengeti automates deployment of a Hadoop cluster, masking from the user complex resource allocation
and configuration tasks on a virtualized infrastructure. Among all resources, storage probably intrigues the
Hadoop user the most, due to performance, capacity and data-locality considerations. This white paper
examines how storage is allocated and configured for a Hadoop cluster deployed using Serengeti. It also offers
recommendations on how to administer storage in certain scenarios where manual intervention is necessary.

TECH N I C AL WH ITE PAPE R / 3

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Local Storage and Shared Storage


Basic vSphere Storage Concepts
VMware ESXi provides host-level storage virtualization, which logically abstracts the physical storage layer
from virtual machines. An ESXi virtual machine uses one or more virtual disks to store its operating system (OS),
program files and other data. Each virtual disk is a large physical file, or a set of files, that resides on a VMware
vSphere VMFS datastore, a datastore based on some other technology such as Network File System (NFS) or
VMware Virtual SAN, or a raw disk. To access virtual disks, a virtual machine uses virtual SCSI controllers. From
the standpoint of the virtual machine, each virtual disk appears as if it were a SCSI drive connected to a SCSI
controller. The underlying physical storage for the virtual diskwhether accessed through parallel SCSI, iSCSI,
network or Fibre Channel adapters on the ESXi hostis transparent to the guest OS and to applications running
on the virtual machine.
ESXi supports two types of storage: local storage and shared storage. Local storage maintains virtual machine
files on internal or directly attached external disks that are managed exclusively by that single host, whereas
shared storage maintains virtual machine files on disks or storage arrays shared among more than one host,
such as those connected through an IP-based or Fibre Channel network. Datastores are logical containers that
hide specifics of each storage device and provide a uniform model for storing virtual machines. For more details
about vSphere storage, refer to vSphere documentation.

Using Local and Shared Storage for Hadoop


When deploying a Hadoop cluster on vSphere, users can choose to use either local or shared storage for each
node. BDE reads the setting from the cluster configuration file and creates virtual disks for the nodes in specified
datastore(s) accordingly.
Local and shared storage offer distinctive benefits. Shared storage in a vSphere environment enables advanced
capabilities such as vSphere HA, vSphere FT, and vSphere vMotion in the vSphere cluster to protect Hadoop
nodes. Shared storage is typically provided by network-attached storage (NAS) or storage area network (SAN)
storage arrays, which not only offer high and scalable capacity but also add another layer of storage availability
through RAID, hardware redundancy and multipathing. On the other hand, local storage offers better I/O
performance by eliminating network overhead and latency. To improve performance and conserve bandwidth,
Hadoop is particularly designed for data locality, so data is processed on the same machine that stores it.
Data locality is preserved in a Hadoop deployment on vSphere when a slave node uses local storage. Virtual disk
I/Os from the slave node are directed to the local disks attached to the ESXi host without requiring to be
transferred on network. vSphere virtualization enables the possibility of deploying multiple slave nodes on a
single ESXi host with none of them losing data locality. When two virtual machines are deployed on the same
ESXi host, communication between the virtual machines is transmitted on the virtual network that logically
connects them in the host. Network traffic is performed in the ESXi host memory and never leaves the host. This
enables data and compute to be separated for a Hadoop cluster without compromising data locality.
When separating data and compute nodes, users can set constraints to strictly associate compute nodes
with data nodes. When a user specifies TEMPFS as the storage type for the compute nodes, BDE installs an
NFS server on associated data nodes, installs an NFS client on compute nodes, and mounts data node disks
on compute nodes. BDE does not assign disks to compute nodes, and all temporary files generated during
Hadoop MapReduce jobs are saved on the NFS disks. Using NFS storage for compute nodes increases the
capacity of each compute node and returns storage resources when compute nodes stop.

TECH N I C AL WH ITE PAPE R / 4

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

VMware recommends the following best practices for configuring storage for a Hadoop cluster deployed
on vSphere:
Place the Hadoop master node (including NameNode and JobTracker) on shared storage to enable
vSphere HA, vSphere FT and VMware vSphere Distributed Resource Scheduler (vSphere DRS) features.
These features prevent the master node from being the single point of failure (SPOF) in the Hadoop cluster.
Place the Hadoop data nodes on local storage for locality and performance. Follow similar best practices of
storage provisioning (disk types, number of drives per node, no RAID, and so on) as for Hadoop deployment
on physical infrastructure.
If separating data and compute in the cluster, or deploying a compute-only cluster, place the compute nodes
on local storage or use NFS in the form previously described.
Place the Hadoop client nodes and other Hadoop ecosystem nodes on either local storage or shared storage.
When using local storage, set the server RAID controller cache policy to write back instead of write
through if a cache battery backup unit (BBU) module is installed. Initial I/Os to a disk formatted as either
thin provision or thick provision lazy zeroed will result in disk zeroing on demand, leading to degraded
performance until the entire disk has been zeroed. The write back cache mode helps eliminate this
performance degradation. By default, BDE formats node data disks to the thick provision lazy zeroed
format, so initial Hadoop performance might not be optimal unless the write back mode is applied on the
RAID controller.

Storage Provisioning by BDE


Datastore Management
BDE enables users to specify datastores to be selected for Hadoop deployment. In the BDE CLI, the following is
the command syntax:
datastore add --name <storagepool name in BDE> --spec <datastore name in vSphere>
--type <LOCAL|SHARED>

BDE defines the type of storage pool to be used for cluster deployment. A pool can contain one or
many vSphere datastores. The datastore name can be specified using a wildcard certificate to include a
set of datastores for cluster use. BDE currently does not check whether the datastore actually exists in
VMware vCenter. Use of a nonexistent datastore will cause cluster creation to fail. Two other commands,
datastore delete and datastore list, are provided for deleting and listing BDE storage pools.

Cluster Specification of Storage


When deploying a customized Hadoop cluster, users can specify a set of attributes related to storage for each
node group, including number of nodes, size of each node, and storage type. For instance, the following
specification instructs BDE to create four data nodes, each using 50GB of local storage:

TECH N I C AL WH ITE PAPE R / 5

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

name: data,
roles: [
hadoop_datanode
],
instanceNum: 4,
cpuNum: 2,
memCapacityMB: 2048,
storage: {
type: LOCAL,
sizeGB: 50
}

The cluster specification can also be used to place system and data disks on separate datastores. In this
storage clause, data disks are placed on dsNames4Data datastores, and system disks are placed on
dsNames4System datastores:
storage: {
type: LOCAL,
sizeGB: 50,
dsNames4Data: [DSLOCALSSD],
dsNames4System: [DSNDFS]
}

The cluster create command uses the --dsNames parameter to specify the list of BDE storage pools to be used
for cluster creation. These storage pools must collectively meet the size and type requirements in the cluster
specification. Otherwise, cluster creation will fail.

Disk Placement and Storage Allocation


To illustrate how BDE creates virtual disks and allocates storage for a node according to the cluster specification,
a simple example is provided here. The same placement and allocation policy and algorithm apply to both local
and shared storage.
Suppose there is an ESXi cluster of four nodes, each with five locally attached 120GB disks. On each ESXi
host, all five disks are formatted in VMFS to create datastores named as localDS<0-4>_esx<0-3>. All these
datastoresa total of 20have been added into a BDE local storage pool to be used for cluster creation. The
ESXi hosts also share an SAN storage array, from which a 100GB LUN is created and formatted in VMFS as a
datastore named sharedDS. This datastore is added into a BDE shared storage pool for cluster use. A Hadoop
cluster with one master node, four data nodes, eight compute nodes, and one client node must be created using
the previously specified BDE local and shared storage pools. The data nodes will use local storage with 150GB
each, the compute nodes will use local storage with 25GB each, while the master node and client node will use
shared storage with 20GB each. In this example, BDE by default will place one data node and two compute
nodes per ESXi host and will place the master node and client node randomly on two of the ESXi hosts.
Every Hadoop node deployed by BDE will have three types of virtual disks: a fixed-size system disk, formatted in
multiple ext3 partitions, to install the guest OS and application; a swap disk of the same size as the virtual
memory; and one or more disks, each formatted in a single ext4 partition, to store application data. The
number of data disks is dictated by the number of datastores specified for cluster creation on the host in the
corresponding BDE storage pool. The size of each disk equals the specified node size divided by the number of
data disks. BDE creates either thin provision or thick provision lazy zeroed virtual disks, depending on the node
and disk types. In this example, every node is assigned 4GB of virtual memory, so the swap disk size is also
roughly 4GB.

TECH N I C AL WH ITE PAPE R / 6

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Applying this disk placement and storage allocation policy to the example, Table 1 shows how storage is
configured on each of the cluster nodes.

NODE

VIRTUAL
DISK

USAGE

DATASTORE

SIZE

VDISK TYPE

Master

/dev/sda
/dev/sdb
/dev/sdc

System
Swap
Data

sharedDS
sharedDS
sharedDS

20GB
~4GB
20GB

Thin provision
Thin provision
Thin provision

Client

/dev/sda
/dev/sdb
/dev/sdc

System
Swap
Data

sharedDS
sharedDS
sharedDS

20GB
~4GB
20GB

Thin provision
Thin provision
Thin provision

Data
(one per
ESXi
host)

/dev/sda

System

localDS0_esx#

20GB

Thin provision

/dev/sdb

Swap

localDS0_esx#

~4GB

Thick provision
lazy zeroed

/dev/sdc

Data

localDS0_esx#

30GB

Thick provision
lazy zeroed

/dev/sdd

Data

localDS1_esx#

30GB

Thick provision
lazy zeroed

/dev/sde

Data

localDS2_esx#

30GB

Thick provision
lazy zeroed

/dev/sdf

Data

localDS3_esx#

30GB

Thick provision
lazy zeroed

/dev/sdg

Data

localDS4_esx#

30GB

Thick provision
lazy zeroed

/dev/sda

System

localDS0_esx#

20GB

Thin provision

/dev/sdb

Swap

localDS0_esx#

~4GB

Thick provision
lazy zeroed

/dev/sdc

Data

localDS0_esx#

5GB

Thick provision
lazy zeroed

/dev/sdd

Data

localDS1_esx#

5GB

Thick provision
lazy zeroed

/dev/sde

Data

localDS2_esx#

5GB

Thick provision
lazy zeroed

/dev/sdf

Data

localDS3_esx#

5GB

Thick provision
lazy zeroed

/dev/sdg

Data

localDS4_esx#

5GB

Thick provision
lazy zeroed

Compute
(two per
ESXi
host)

Table 1. Hadoop Cluster Node Storage Provisioning

As a result, the shared datastore is estimated to have 10GB of free space left. Of the five local disks on each ESXi,
the first disk has roughly 5GB of free space left and each of the other four has about 80GB.

TECH N I C AL WH ITE PAPE R / 7

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Storage Management After Cluster Deployment


Allocation of Unused Datastore Storage
In a Hadoop cluster deployed by BDE, virtual disks of the cluster nodes cannot be resized through BDE to use
available free space in the underlying vSphere datastores, nor can BDE create additional disks for the nodes. The
unused datastore storage can be utilized in other ways:
Scale out the existing Hadoop cluster to create more slave nodes
Create another Hadoop cluster
Allocate the storage to other applications
However, these methodsthe last two in particularof using available free space in the datastores will inevitably
lead to disk contention with the existing Hadoop cluster when running concurrently, resulting in significant
performance degradation. It is not a recommended practice unless applications and workloads can be
scheduled reasonably to avoid contention.
Therefore, it is very important to plan and size storage carefully at both the physical and virtual layers prior to
cluster deployment, taking into consideration the existing storage requirements as well as the prospects for
data growth.

Storage Failure and Recovery


Enterprise-class SAN and NAS storage rarely fails, due to the sophisticated set of high-availability capabilities
built into the arrays. However, individual commercial off-the-shelf hard disk drives have a much higher
probability of failure, particularly the lower-grade SATA drives that are often used for Hadoop deployment.
When the underlying storage fails, the vSphere datastore becomes unavailable, resulting in loss of access to
data among virtual machines using the datastore. If a virtual machines system disk resides in the datastore,
the virtual machine is completely inaccessible. This section discusses the impact of a local disk failure on the
Hadoop cluster using the disk and how to recover from the failure.
There are a few different scenarios related to hard disk failure:
The failed disk is used by Hadoop node(s) for data disk only. The disk is not recoverable and must be replaced.
The failed disk is used by Hadoop node(s) for both system and data disks. The disk is not recoverable and must
be replaced.
The failed disk is recoverable without its VMFS partition damaged and content corrupted.
Disk Replacement and Node Data Disk Recovery
If the datastore created from the failed HDD contains Hadoop data disks only, each node using the datastore
goes through the following sequence of states:
The node loses one of the data disks. Consequently, the Hadoop Distributed File System (HDFS) blocks stored
on the disk are missing from the node.
The node remains up for a short period. BDE reports the node to be service ready.
Hadoop reports the node to be alive in the cluster.
Because a cluster deployed by BDE has fault tolerance level set to zero by default, the node eventually stops
the DataNode service, due to the loss of the disk.
BDE detects the loss of the node and reports it.
Hadoop detects that the node is not in service and adds the node to the deadNodes list.
The cluster remains fully functional with the remaining nodes, although at a reduced capacity. There is no data
loss because HDFS has replicas of the blocks elsewhere. Over time, HDFS will detect underreplicated blocks
and replicate them automatically.

TECH N I C AL WH ITE PAPE R / 8

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

The node can resume service after removal of the inaccessible data disk in accordance with the procedure
described in VMware knowledge base article 1009854.
After power-on of the node, BDE reprovisions the node appropriately and updates relevant Hadoop
configuration files on the node to exclude the lost data disk. BDE then reports the node to be back in service.
Hadoop reports the node to be alive again. The cluster is fully functional with all nodes, although this particular
node has one fewer data disk.
After a new physical disk has replaced the failed one, the following procedure can be used to make it available to
the Hadoop cluster to recover each of the affected nodes with a recreated data disk:
1. Create a VMFS datastore on the new disk, as detailed in the BDE Users Guide.
2. Power off the node.
3. Add a virtual disk to the node.
a. Click Edit Settings.
b. Click Add in the virtual machine Properties window.
c. Select Hard Disk as the type of device to add.
d. Select Create a new virtual disk.
e. Specify the disk size to be exactly the same as the other data disks on the node, and choose
Thick Provision Lazy Zeroed as the provisioning type.
f. Select Specify a datastore or datastore cluster for the disk location, and browse to choose the
datastore created in step 1.

g. Place the disk on the same SCSI controller and target location as the previously removed disk.
h. Make the disk Independent in the Persistent mode.
4. Power on the node. BDE reprovisions the node appropriately and updates relevant Hadoop configuration
files on the node to include the newly provisioned data disk. BDE then reports the node to be back
in service.
It is recommended that an HDFS fsck be run after all affected nodes have been recovered. At this point, the
Hadoop cluster is fully functional with all nodes. Each node has the same number of data disks as initially
deployed by BDE. There should be no data loss throughout this entire failure and recovery process. Hadoop will
not try to balance data blocks across the newly replaced data disks but will likely place blocks of newly created
files on these data disks first.
Disk Replacement and Node Recovery
If a Hadoop node has both its system disk and data disk in the datastore created from the failed HDD, the cluster
and node go into the following state:
The node is completely dead due to the loss of system disk. BDE reports the node to be down.
The cluster loses the node and places it on the deadNodes list.
The cluster remains fully functional with the remaining nodes, although at a reduced capacity. There is no data
loss because HDFS has replicas of the blocks elsewhere. Over time, HDFS will detect underreplicated blocks
and replicate them automatically.
There is currently no way of recovering the node after a new physical disk has replaced the failed one. To
preserve the cluster size, users can run the following command to scale out the slave node group by one:
cluster resize --name <cluster name> --nodeGroup worker --instanceNum <slave # + 1>

BDE now maintains a seemingly larger Hadoop cluster but with the same number of active slave nodes
as before.

TECH N I C AL WH ITE PAPE R / 9

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Recoverable Disk Failure


Some hard disk failures can be repaired physically or through utility software that corrects logical sector errors.
In other cases, the disk itself might be fine but there can be a problem with the cable or cable connection.
In these cases, the hard disk can be reused after the problem has been corrected. Normally the disk contains the
original VMFS partition and data. Therefore, the vSphere datastore is recovered when the disk has been made
available again. In turn, the virtual disk(s) created in the datastore become available to the Hadoop nodes. All of
this happens automatically within minutes after the disk has been made ready on the ESXi host. There is no
specific action that must be taken in BDE and the Hadoop cluster. Any node that uses the datastore is back in
service after powering up. Depending on whether both system and data disks are impacted, during this failure
and recovery process the affected nodes might go into states described in the previous scenarios. Nevertheless,
except for rare and extreme cases where nodes are too concentrated on the failed disk, severely undermining
Hadoop functionality and availability, the Hadoop cluster remains functional.

Storage Configuration for Hadoop Outside of BDE


Data Disk Resizing
During cluster deployment, BDE calculates disk sizes based on cluster specification and availability of datastores.
After the cluster has been deployed, BDE manages node storage only in terms of maintaining the number of
disks as provisioned and presenting them for Hadoop to use. It does not keep track of other aspects, including
disk size. Therefore, even though BDE currently provides no function to enable disk resizing for the cluster, it
does not prohibit the disks from being resized manually.
System disks should not be resized. The following procedure can be used to resize a data disk:
1. Follow Hadoop best practices to move data blocks from the disk to other disks or nodes to maintain
HDFS consistency.
2. Power off the node.
3. Resize the disk in vSphere.
a. Click Edit Settings on the node.
b. Select the disk to be resized, and enter the new disk size for Provisioned Size.
c. Click OK to submit the change.
4. Power on the node.
5. Utilize the newly added disk space:
Option 1:
a. Install a partition management utility such as GParted in the node to expand the partition on the disk.
b. Run the resize2fs command to expand the ext4 file system on the partition.
Option 2:
a. Run the fdisk command to remove the existing partition.
b. Reboot the node. BDE will prepare the disk appropriately, including partition, file system, and
Hadoop directory structure creation.

Hadoop will recognize the new disk size automatically and will start using it.

TECH N I C AL WH ITE PAPE R / 1 0

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Utilization of Additional Disks


Disk resizing is a way of scaling up storage on a node. Another possible way of scaling up storage is to add more
data disks to the node, which is much more complicated than disk resizing. The following procedure can be used
to manually add a data disk to a node:
1. Power off the node.
2. Follow the previously described procedure to add a new disk to the node.
a. Place the disk in the selected datastore.
b. Attach the disk to a SCSI controller and target location sequentially to existing disks.
3. Power on the node.
4. Run the sfdisk command to create a single partition on the new disk.
5. Run mkfs to create an ext4 file system using the partition.
6. Create a mount point under /mnt and add the mount point to the /etc/fstab file.
7. Mount the file system to the mount point.
8. Create the following directory structure in the file system:
#
#
#
#
#
#
#
#
#
#
#

mkdir hadoop
chown R hdfs:hadoop hadoop
cd hadoop
mkdir hdfs mapred
cd hdfs
mkdir data name secondary
chown hdfs:hadoop data name secondary
chmod 700 name secondary
cd ../mapred
mkdir local
chown mapred:hadoop local

9. Edit the /usr/lib/hadoop-1.0.1/conf/hdfs-site.xml file to add the new HDFS name and data locations for
the dfs.name.dir and dfs.data.dir properties respectively.
10. Edit the /usr/lib/hadoop-1.0.1/conf/mapred-site.xml file to add the new MapReduce local directory for
property mapred.local.dir.
11. Restart the hadoop-0.20-datanode and hadoop-0.20-tasktracker services on the node.
The new data disk is now ready for use by the Hadoop cluster. When BDE restarts the cluster, or when the node
reboots, the new disk will be intact on the node. However, BDE will restore the hdfs-site.xml and mapred-site.xml
Hadoop configuration files for the node, based on the BDE cluster configuration database. Therefore, the new
data disk is not included in the configuration files for Hadoop to consume because BDE does not detect the disk.
To use the disk, steps 911 must be performed every time the node reboots or the cluster restarts.

TECH N I C AL WH ITE PAPE R / 11

Apache Hadoop Storage Provisioning Using


VMware vSphere Big Data Extensions

Conclusion
An Apache Hadoop cluster deployed on VMware vSphere can leverage advanced vSphere HA, vSphere FT and
vSphere vMotion features for enhanced availability by using shared storage, while also preserving data locality
by using local storage for data nodes. Virtualization enables data and compute separation without
compromising data locality. Big Data Extensions simplifies Hadoop deployment on vSphere, accelerates
deployment speed, and masks the complexity from the vSphere administrator.

TECH N I C AL WH ITE PAPE R / 12

VMware, Inc. 3401 Hillview Avenue Palo Alto CA 94304 USA Tel 877-486-9273 Fax 650-427-5001 www.vmware.com
Copyright 2013 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed
at http://www.vmware.com/go/patents. VMware is a registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be
trademarks of their respective companies. Item No: VMW-WP-BIG-DATA-STOR-PROV-USLET-101
Docsouce: OIC-13VM005.03

You might also like