Professional Documents
Culture Documents
Storage Provisioning
Using VMware vSphere
Big Data Extensions
TEC H N I C A L W H ITE PA P E R
Table of Contents
Apache Hadoop Deployment on VMware vSphere Using vSphere
Big Data Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Local Storage and Shared Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Basic vSphere Storage Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Using Local and Shared Storage for Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Storage Provisioning by BDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Datastore Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Cluster Specification of Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Disk Placement and Storage Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Storage Management After Cluster Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Allocation of Unused Datastore Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Storage Failure and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Disk Replacement and Node Data Disk Recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Disk Replacement and Node Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Recoverable Disk Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Storage Configuration for Hadoop Outside of BDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Data Disk Resizing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Utilization of Additional Disks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
VMware recommends the following best practices for configuring storage for a Hadoop cluster deployed
on vSphere:
Place the Hadoop master node (including NameNode and JobTracker) on shared storage to enable
vSphere HA, vSphere FT and VMware vSphere Distributed Resource Scheduler (vSphere DRS) features.
These features prevent the master node from being the single point of failure (SPOF) in the Hadoop cluster.
Place the Hadoop data nodes on local storage for locality and performance. Follow similar best practices of
storage provisioning (disk types, number of drives per node, no RAID, and so on) as for Hadoop deployment
on physical infrastructure.
If separating data and compute in the cluster, or deploying a compute-only cluster, place the compute nodes
on local storage or use NFS in the form previously described.
Place the Hadoop client nodes and other Hadoop ecosystem nodes on either local storage or shared storage.
When using local storage, set the server RAID controller cache policy to write back instead of write
through if a cache battery backup unit (BBU) module is installed. Initial I/Os to a disk formatted as either
thin provision or thick provision lazy zeroed will result in disk zeroing on demand, leading to degraded
performance until the entire disk has been zeroed. The write back cache mode helps eliminate this
performance degradation. By default, BDE formats node data disks to the thick provision lazy zeroed
format, so initial Hadoop performance might not be optimal unless the write back mode is applied on the
RAID controller.
BDE defines the type of storage pool to be used for cluster deployment. A pool can contain one or
many vSphere datastores. The datastore name can be specified using a wildcard certificate to include a
set of datastores for cluster use. BDE currently does not check whether the datastore actually exists in
VMware vCenter. Use of a nonexistent datastore will cause cluster creation to fail. Two other commands,
datastore delete and datastore list, are provided for deleting and listing BDE storage pools.
name: data,
roles: [
hadoop_datanode
],
instanceNum: 4,
cpuNum: 2,
memCapacityMB: 2048,
storage: {
type: LOCAL,
sizeGB: 50
}
The cluster specification can also be used to place system and data disks on separate datastores. In this
storage clause, data disks are placed on dsNames4Data datastores, and system disks are placed on
dsNames4System datastores:
storage: {
type: LOCAL,
sizeGB: 50,
dsNames4Data: [DSLOCALSSD],
dsNames4System: [DSNDFS]
}
The cluster create command uses the --dsNames parameter to specify the list of BDE storage pools to be used
for cluster creation. These storage pools must collectively meet the size and type requirements in the cluster
specification. Otherwise, cluster creation will fail.
Applying this disk placement and storage allocation policy to the example, Table 1 shows how storage is
configured on each of the cluster nodes.
NODE
VIRTUAL
DISK
USAGE
DATASTORE
SIZE
VDISK TYPE
Master
/dev/sda
/dev/sdb
/dev/sdc
System
Swap
Data
sharedDS
sharedDS
sharedDS
20GB
~4GB
20GB
Thin provision
Thin provision
Thin provision
Client
/dev/sda
/dev/sdb
/dev/sdc
System
Swap
Data
sharedDS
sharedDS
sharedDS
20GB
~4GB
20GB
Thin provision
Thin provision
Thin provision
Data
(one per
ESXi
host)
/dev/sda
System
localDS0_esx#
20GB
Thin provision
/dev/sdb
Swap
localDS0_esx#
~4GB
Thick provision
lazy zeroed
/dev/sdc
Data
localDS0_esx#
30GB
Thick provision
lazy zeroed
/dev/sdd
Data
localDS1_esx#
30GB
Thick provision
lazy zeroed
/dev/sde
Data
localDS2_esx#
30GB
Thick provision
lazy zeroed
/dev/sdf
Data
localDS3_esx#
30GB
Thick provision
lazy zeroed
/dev/sdg
Data
localDS4_esx#
30GB
Thick provision
lazy zeroed
/dev/sda
System
localDS0_esx#
20GB
Thin provision
/dev/sdb
Swap
localDS0_esx#
~4GB
Thick provision
lazy zeroed
/dev/sdc
Data
localDS0_esx#
5GB
Thick provision
lazy zeroed
/dev/sdd
Data
localDS1_esx#
5GB
Thick provision
lazy zeroed
/dev/sde
Data
localDS2_esx#
5GB
Thick provision
lazy zeroed
/dev/sdf
Data
localDS3_esx#
5GB
Thick provision
lazy zeroed
/dev/sdg
Data
localDS4_esx#
5GB
Thick provision
lazy zeroed
Compute
(two per
ESXi
host)
As a result, the shared datastore is estimated to have 10GB of free space left. Of the five local disks on each ESXi,
the first disk has roughly 5GB of free space left and each of the other four has about 80GB.
The node can resume service after removal of the inaccessible data disk in accordance with the procedure
described in VMware knowledge base article 1009854.
After power-on of the node, BDE reprovisions the node appropriately and updates relevant Hadoop
configuration files on the node to exclude the lost data disk. BDE then reports the node to be back in service.
Hadoop reports the node to be alive again. The cluster is fully functional with all nodes, although this particular
node has one fewer data disk.
After a new physical disk has replaced the failed one, the following procedure can be used to make it available to
the Hadoop cluster to recover each of the affected nodes with a recreated data disk:
1. Create a VMFS datastore on the new disk, as detailed in the BDE Users Guide.
2. Power off the node.
3. Add a virtual disk to the node.
a. Click Edit Settings.
b. Click Add in the virtual machine Properties window.
c. Select Hard Disk as the type of device to add.
d. Select Create a new virtual disk.
e. Specify the disk size to be exactly the same as the other data disks on the node, and choose
Thick Provision Lazy Zeroed as the provisioning type.
f. Select Specify a datastore or datastore cluster for the disk location, and browse to choose the
datastore created in step 1.
g. Place the disk on the same SCSI controller and target location as the previously removed disk.
h. Make the disk Independent in the Persistent mode.
4. Power on the node. BDE reprovisions the node appropriately and updates relevant Hadoop configuration
files on the node to include the newly provisioned data disk. BDE then reports the node to be back
in service.
It is recommended that an HDFS fsck be run after all affected nodes have been recovered. At this point, the
Hadoop cluster is fully functional with all nodes. Each node has the same number of data disks as initially
deployed by BDE. There should be no data loss throughout this entire failure and recovery process. Hadoop will
not try to balance data blocks across the newly replaced data disks but will likely place blocks of newly created
files on these data disks first.
Disk Replacement and Node Recovery
If a Hadoop node has both its system disk and data disk in the datastore created from the failed HDD, the cluster
and node go into the following state:
The node is completely dead due to the loss of system disk. BDE reports the node to be down.
The cluster loses the node and places it on the deadNodes list.
The cluster remains fully functional with the remaining nodes, although at a reduced capacity. There is no data
loss because HDFS has replicas of the blocks elsewhere. Over time, HDFS will detect underreplicated blocks
and replicate them automatically.
There is currently no way of recovering the node after a new physical disk has replaced the failed one. To
preserve the cluster size, users can run the following command to scale out the slave node group by one:
cluster resize --name <cluster name> --nodeGroup worker --instanceNum <slave # + 1>
BDE now maintains a seemingly larger Hadoop cluster but with the same number of active slave nodes
as before.
Hadoop will recognize the new disk size automatically and will start using it.
mkdir hadoop
chown R hdfs:hadoop hadoop
cd hadoop
mkdir hdfs mapred
cd hdfs
mkdir data name secondary
chown hdfs:hadoop data name secondary
chmod 700 name secondary
cd ../mapred
mkdir local
chown mapred:hadoop local
9. Edit the /usr/lib/hadoop-1.0.1/conf/hdfs-site.xml file to add the new HDFS name and data locations for
the dfs.name.dir and dfs.data.dir properties respectively.
10. Edit the /usr/lib/hadoop-1.0.1/conf/mapred-site.xml file to add the new MapReduce local directory for
property mapred.local.dir.
11. Restart the hadoop-0.20-datanode and hadoop-0.20-tasktracker services on the node.
The new data disk is now ready for use by the Hadoop cluster. When BDE restarts the cluster, or when the node
reboots, the new disk will be intact on the node. However, BDE will restore the hdfs-site.xml and mapred-site.xml
Hadoop configuration files for the node, based on the BDE cluster configuration database. Therefore, the new
data disk is not included in the configuration files for Hadoop to consume because BDE does not detect the disk.
To use the disk, steps 911 must be performed every time the node reboots or the cluster restarts.
Conclusion
An Apache Hadoop cluster deployed on VMware vSphere can leverage advanced vSphere HA, vSphere FT and
vSphere vMotion features for enhanced availability by using shared storage, while also preserving data locality
by using local storage for data nodes. Virtualization enables data and compute separation without
compromising data locality. Big Data Extensions simplifies Hadoop deployment on vSphere, accelerates
deployment speed, and masks the complexity from the vSphere administrator.
VMware, Inc. 3401 Hillview Avenue Palo Alto CA 94304 USA Tel 877-486-9273 Fax 650-427-5001 www.vmware.com
Copyright 2013 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed
at http://www.vmware.com/go/patents. VMware is a registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be
trademarks of their respective companies. Item No: VMW-WP-BIG-DATA-STOR-PROV-USLET-101
Docsouce: OIC-13VM005.03