Hadop Install

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss
http://dak1n1.com/blog/9-hadoop-el6-install
HOME
BLOG
ABOUT
Search...
Cloudera Hadoop RHEL/CentOS 6 Install Guide

Like
Share
Tweet
Created on 15 February 2012

Dakini
Hits: 20533
This guide contains everything you need to get a basic Hadoop cluster up and
running. It is intended as a condensed and easy-to-understand supplement to
the official documentation, so lengthy descriptions are omitted. For the full
documentation, see Cloudera's Install Guide.
More Reading
3 Million and Beyond Creating Scalable Web
Clusters with LVS
ATI Radeon HD fglrx driver
install on Fedora 17
Automating GIS Metadata
Whether you want to start with a basic two-node cluster, or add hundreds or even thousands of nodes, the concepts
Conversion
here apply. Adding nodes can be done at any time without interrupting the cluster's workflow, so as long as you have
Building a Load-Balancing
two machines, you're ready to begin installing.
Cluster with LVS
About hardware:
Hadoop was designed to be used on commodity hardware, so you won't need anything special for this project. You'll
Cloudera Hadoop
RHEL/CentOS 6 Install
Guide
want 2 or more reasonably fast, modern servers. (I'm using 9 SuperMicro boxes I happened to have lying around).
Mine have dual Xeon processors, 48G RAM, 4 x 7200 RPM SATA disks in JBOD mode. JBOD is recomended above
RAID for Hadoop, since Hadoop has its own built-in redundancy which performs better with plain disks.
0
Unlike High-Availablilty clustering, an HPC cluster like Hadoop does not require any special fencing hardware. It
handles hardware failure by simply not giving jobs to broken/misbehaving nodes. If a node fails a certain number of
jobs, it's out.
In a Hadoop cluster, the only type of hardware failure that would cause any noticeable disruption is the possible
Affiliates
failure of the NameNode. This is why it's always good to have a Secondary NameNode on standby.
Hadoop Clusters - Core concepts
1 of 11
12/05/2014 02:17 PM
Here are a few core concepts that will help you understand what you're about to build. For a very small cluster (let's
say, 9 nodes or less), your cluster will consist of these types of nodes:
1. Head node. Runs the NameNode service and JobTracker service.
2. Worker nodes. All other nodes in the cluster will run DataNode and TaskTracker services.
A larger cluster is almost identical to this, but generally they use a separate machine for the JobTracker service. It's
also common to add a Secondary NameNode for redundancy. So in that scenerio you'd have:
1. NameNode machine.
2. Secondary NameNode machine.
3. Jobtracker machine.
4. Worker machines, each running DataNode + TaskTracker services.
A brief definition of these components:
NameNode: Stores all metadata for the HDFS filesystem.
DataNodes: Worker nodes that store and retrieve data when told to (by clients or the NameNode).
TaskTrackers: Runs tasks and sends progress reports to the JobTracker.
JobTracker: Coordinates all jobs and schedules them to run on TaskTracker nodes.
HDFS: Hadoop Distributed File System. An HDFS cluster consists of a NameNode + DataNodes. All Hadoop IO
happens through this. Built for storing very large files across many machines.
Hadoop Installation
Now that you have a little background on this software, we can begin installing. The first thing you'll need is Java
JDK 1.6 u8 or higher. You might also want to use a tool like clusterssh to ssh into all your nodes at once to perform
this installation.
Install Java on each node

# grab the latest java version - probably not this one anymore ;)
wget http://download.oracle.com/otn-pub/java/jdk/7u1-b08/jdk-7u1-linux-x64.rpm
rpm -Uvh jdk-7u1-linux-x64.rpm
alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1600
alternatives --auto java
Disable SELinux
setenforce 0
vim /etc/sysconfig/selinux
SELINUX=disabled
Allow communication between nodes in IPtables

Either disable IPtables, allow all communication between nodes, or open Hadoop-specific ports.
Set up the Cloudera yum repo
2 of 11
12/05/2014 02:17 PM
wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm
rpm --import http://archive.cloudera.com/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
On the head node, install NameNode and JobTracker packages

yum -y install hadoop-0.20-namenode hadoop-0.20-jobtracker
On the worker nodes, install DataNode and TaskTracker packages

yum -y install hadoop-0.20-datanode hadoop-0.20-tasktracker
Use alternatives to set up your custom cluster config

Setting up your cluster like this will allow you to keep multiple cluster configurations handy, and makes switching
between them so simple!
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.MyCluster
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 50
alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster
Verify that it's using MyCluster config instead of the default
[root@nodes ~]# alternatives --display hadoop-0.20-conf

----------------------------------------------------------hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.MyCluster
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.MyCluster - priority 50
Current `best' version is /etc/hadoop-0.20/conf.MyCluster.
-----------------------------------------------------------
Set up Hadoop storage disks on each node

In a Hadoop cluster, it is ideal to have multiple plain-disk storage mounts on each node. In this example, I'm using 4
plain disks, formatted ext4.
mkdir -p /mnt/hdfs/{1..4}
Add the new disks to /etc/fstab, ensuring that they're mounted with noatime. (This prevents reads from turning into
unnecessary writes, which is generally good for performance.)
vim /etc/fstab
3 of 11
12/05/2014 02:17 PM
# append the new disks

/dev/sdb1 /mnt/hdfs/1 ext4 noatime 0 0
/dev/sdc1 /mnt/hdfs/2 ext4 noatime 0 0
/dev/sdd1 /mnt/hdfs/3 ext4 noatime 0 0
/dev/sde1 /mnt/hdfs/4 ext4 noatime 0 0
Mount the new disks and create Hadoop directories

The directories we're creating here correspond with Hadoop configuration options we'll be setting
later: dfs.name.dir, dfs.data.dir, mapred.local.dir.
mount /mnt/hdfs/1
mount /mnt/hdfs/2
mount /mnt/hdfs/3
mount /mnt/hdfs/4
# create the namenode, datanode, and mapred dirs on each disk

for num in 1 2 3 4; do mkdir /mnt/hdfs/$num/{namenode,datanode,mapred}; done
Set directory permissions

This part is very important! You'll run into errors later if these dirs are owned by the wrong user.
# make sure everything is owned by hdfs:hadoop

chown -R hdfs:hadoop /mnt/hdfs/
# ...except for the mapred dirs
chown -R mapred:hadoop /mnt/hdfs/{1,2,3,4}/mapred
Hadoop Core Configuration

Now that the underlying directories and storage devices are set up, we're ready to configure Hadoop.
These configuration files are not node-specific, so you can write them once and copy them to all nodes. Change
these config examples so that HEADNODE is actually the name of your head node.
/etc/hadoop-0.20/conf.MyCluster/core-site.xml
This config file will tell Hadoop where to find the NameNode and its default file system.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://HEADNODE:54310</value>
</property>
4 of 11
12/05/2014 02:17 PM
</configuration>
/etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml
This is where we tell Hadoop to use the directories we created earlier. It specifies local storage on each node, used
by the DataNodes and NameNode services to store HDFS data.
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/mnt/hdfs/1/namenode,/mnt/hdfs/2/namenode,/mnt/hdfs/3/namenode,/mnt/hdfs/4/namenode</v
</property>
<property>
<name>dfs.data.dir</name>
<value>/mnt/hdfs/1/datanode,/mnt/hdfs/2/datanode,/mnt/hdfs/3/datanode,/mnt/hdfs/4/datanode</v
</property>
</configuration>
/etc/hadoop-0.20/conf.MyCluster/mapred-site.xml
Specify the JobTracker here, along with all the local directories for writing map/reduce (job-related) data. This is
used by the TaskTracker service on all the worker nodes. Change HEADNODE to the name of your machine that
runs the JobTracker service. (In a small cluster, this machine is the same one that runs the NameNode service.)
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://HEADNODE:54311</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/mnt/hdfs/1/mapred,/mnt/hdfs/2/mapred,/mnt/hdfs/3/mapred,/mnt/hdfs/4/mapred</value>
</property>
Bringing the Cluster online

With the basic configuration finished above, the cluster is now ready to be brought online. First, we need to format
the NameNode to create an HDFS filesystem for our nodes to use as storage. This only needs to be done on the
NameNode.
Formatting the NameNode

[root@HEADNODE ~]# sudo -u hdfs hadoop namenode -format
5 of 11
12/05/2014 02:17 PM
11/12/07 04:55:59 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = HEADNODE/192.168.1.2
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2-cdh3u2
STARTUP_MSG: build = file:///tmp/topdir/BUILD/hadoop-0.20.2-cdh3u2 -r 95a824e4005b2a94fe1c11f1ef9db4c
************************************************************/
Format filesystem in /mnt/hdfs/1/namenode ? (Y or N) Y
11/12/07 04:56:00 INFO util.GSet: VM type = 64-bit
11/12/07 04:56:00 INFO util.GSet: 2% max memory = 17.77875 MB
11/12/07 04:56:00 INFO util.GSet: capacity = 2^21 = 2097152 entries
11/12/07 04:56:00 INFO util.GSet: recommended=2097152, actual=2097152
11/12/07 04:56:00 INFO namenode.FSNamesystem: fsOwner=hdfs
11/12/07 04:56:00 INFO namenode.FSNamesystem: supergroup=supergroup
11/12/07 04:56:00 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/12/07 04:56:00 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
11/12/07 04:56:00 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 mi
11/12/07 04:56:00 INFO common.Storage: Image file of size 110 saved in 0 seconds.
11/12/07 04:56:01 INFO common.Storage: Storage directory /mnt/hdfs/1/namenode has been
11/12/07 04:56:01 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at HEADNODE/192.168.1.2
************************************************************/
We can see from the output above that the NameNode has been successfully formatted. But nothing is running yet,
so let's start up our NameNode service on the head node. To start up services, first we'll need to fix permissions on
each node.
HDFS User Permissions

Set up permissions to allow the hdfs user to write log files here. Otherwise, the service may fail to start correctly.
chgrp hdfs /usr/lib/hadoop-0.20/

chmod g+rw /usr/lib/hadoop-0.20/
Start the NameNode service on the head node
6 of 11
12/05/2014 02:17 PM
service hadoop-0.20-namenode start
And on the worker nodes, start up the DataNode service.
service hadoop-0.20-datanode start
Give it a couple seconds to start up the HDFS filesystem. The nodes will connect, and the local storage of each
node will be added to the collective HDFS filesystem. Now we can create core directories.
First-time HDFS use: create core directories

# these paths are relative to the HDFS filesystem,
# so you can copy and paste this regardless of your physical directory layout
sudo -u hdfs hadoop fs -mkdir /mapred/system
sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system
sudo -u hdfs hadoop dfs -mkdir /tmp
sudo -u hdfs hadoop dfs -chmod -R 1777 /tmp
Check that your nodes are all online and healthy

This will show a report overview for all nodes, and show you how much storage you have between them all.
sudo -u hdfs hadoop dfsadmin -report

sudo -u hdfs hadoop dfs -df
Start JobTracker and TaskTracker services

If all looks well, continue on to start the JobTracker service on the head node.
service hadoop-0.20-jobtracker start
And start the TaskTracker service on the worker nodes.
service hadoop-0.20-tasktracker start
You now have a fully-functional Hadoop cluster up and running! Check the cluster status on your local Hadoop
status pages:
http://localhost:50070
http://localhost:50030
7 of 11
12/05/2014 02:17 PM
Comments (13)
Login
Sort by: Date Rating Last Activity

Alan Reed 139 weeks ago
+1
And start the TaskTracker service on the worker nodes.

service hadoop-0.20-jobtracker start
I think that this should be service hadoop-0.20-tasktracker start
Reply
Dak1n1 138 weeks ago
+2
Fixed! Thanks.
Reply
kifal 132 weeks ago
Hi Chloe, Looks like the cloudera's installation guide link is broken !!

Reply
rashmi 124 weeks ago
Hi,
For apache hadoop-2.0.0-alpha installation on two linux machines, what should be values of fs.defaultFS and
dfs.name.dir and dfs.data.dir properties on both name nodes????
one machine hostname is rsi-nod-nsn1 and another one is rsi-nod-nsn2...
i want to make both as federated namenodes.. and both should be used as datanodes too..
i want to configure both federation anf YARN.
what should be configuration changes for the same? i am not finding masters, mapred-site.xml, and hadoopenv.sh files in hadoopHome/etc/hadoop folder... how do i make changes for these files?
regards,
rashmi
Reply
1 reply active 120 weeks ago
dak1n1 31p 120 weeks ago
+1
hadoop-2.0.0-alpha ... that doesn't sound like Cloudera Hadoop. Apache Hadoop works differently and isn't
covered in this guide.
The core configuration options are listed above, so that covers 'fs.default.name' and 'dfs.data.dir'. (Though that
8 of 11
12/05/2014 02:17 PM
was for version 0.22... it might be different in your version). I honestly don't use Hadoop anymore, so I don't
know.
But, the manual will show you all available configuration options, so that could be handy:
http://hadoop.apache.org/hdfs/docs/current/hdfs-d...
As far as the location of the configuration files, you can do:
rpm -qa |grep hadoop # find the package name
rpm -ql hadoop-package-name --configfiles
Here's an example of that, using 'httpd' as the package name:
[dakini@nibbana ~]$ sudo rpm -ql httpd --configfiles
/etc/httpd/conf.d/welcome.conf
/etc/httpd/conf/httpd.conf
/etc/httpd/conf/magic
Reply
Chris 96 weeks ago
Since I'm a raging Hadummy, is there a more detailed guide on how to partition each of the nodes? Following the
steps above leads to pain, suffering and errors. (Specifically: "special device /dev/sdd1 does not exist")
Reply
oh no, it's very dangerous to copy/paste commands like that from the internet, unless you fully understand
what they do. '/dev/sdb' refers to a disk that you don't have, which means you're trying to run a command that
works on someone else's hardware, but not on yours.
When partitioning and mounting, you have to look at your particular hardware, and adjust the commands
accordingly. Otherwise you might find yourself destroying your data!
'sudo fdisk -l' will show you all the disks you have. I suggest googling around for a partitioning guide, since it's
important to learn what this stuff does before attempting to run it.
Reply
dani 91 weeks ago
I have followed the steps in this blog.

For the worker node I did following
01. added 4 partitions and formated them with ext4, and changed fstab
02. used following commands
for num in 1 2 3 4; do mkdir /mnt/hdfs/$num/{namenode,datanode,mapred}; done
03. configured following files as per this blog;
9 of 11
12/05/2014 02:17 PM
for the head node I did the following:

03. configured following files as per this blog;
04. Issued the following command at head node
sudo -u hdfs hadoop namenode -format
but I got following error
Please note am testing on AWS using two EC2 instances and route 53.
http://pastebin.com/Xp0gZ9tR
Highly appreciate your help on to get this up and running
Reply
preeth 88 weeks ago
alternatives --display hadoop-0.20-conf

for this cmd i am getting manual conf
how to solve
Reply
Chris 84 weeks ago
Nice guide, thanks - worked for me

Reply
Chris 81 weeks ago
How I am getting some file permissions issues. I installed as root but my /mnt/hdfs etc (I created with same
names as you) are
drwxr-xr-x 5 hdfs hadoop 4096 Apr 20 19:15 1
Would you recommend adding root to hadoop group and changing -R 775?
Reply
+1
No, I wouldn't add root to any groups, because the hadoop services don't run as root. They run as 'hdfs' and
'mapred'.
See the permissions section above. It worked for me every time during my hadoop installs 1+ years ago.
# hdfs user must own the /mnt/hdfs directory
# mapred user must own the mapred directories
10 of 11
12/05/2014 02:17 PM
Though maybe you're talking about the HDFS filesystem itself having permissions issues. If that's the case,
read the section labeled "First-time HDFS use: create core directories".
Otherwise maybe check your logs for a more detailed error message and see the install manual for your
Hadoop version. It's possible that things may have changed, since this guide is a year old.
Reply
sarthak 51 weeks ago
my tasktracker is not running

Reply
Post a new comment

Enter text right here!
Comment as a Guest, or login:

Name
Email
Website (optional)
Displayed next to your comments.
Not displayed publicly.
If you have a website, link to it here.
Subscribe to
None
Submit Comment
Category: Blog
me@dak1n1.com
RSS Feed
Copyright 2011-2012 - All Rights Reserved
11 of 11
12/05/2014 02:17 PM

Hadop Install

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadop Install

Uploaded by

Copyright:

Available Formats

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

Cloudera Hadoop RHEL/CentOS 6 Install Guide

Created on 15 February 2012

two machines, you're ready to begin installing.

Cluster with LVS

Hadoop Clusters - Core concepts

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

Install Java on each node

Allow communication between nodes in IPtables

Set up the Cloudera yum repo

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

On the head node, install NameNode and JobTracker packages

On the worker nodes, install DataNode and TaskTracker packages

Use alternatives to set up your custom cluster config

Verify that it's using MyCluster config instead of the default

[root@nodes ~]# alternatives --display hadoop-0.20-conf

Set up Hadoop storage disks on each node

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

# append the new disks

Mount the new disks and create Hadoop directories

# create the namenode, datanode, and mapred dirs on each disk

Set directory permissions

# make sure everything is owned by hdfs:hadoop

Hadoop Core Configuration

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

Bringing the Cluster online

Formatting the NameNode

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

11/12/07 04:55:59 INFO namenode.NameNode: STARTUP_MSG:

HDFS User Permissions

chgrp hdfs /usr/lib/hadoop-0.20/

Start the NameNode service on the head node

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

service hadoop-0.20-namenode start

And on the worker nodes, start up the DataNode service.

service hadoop-0.20-datanode start

First-time HDFS use: create core directories

Check that your nodes are all online and healthy

sudo -u hdfs hadoop dfsadmin -report

Start JobTracker and TaskTracker services

service hadoop-0.20-jobtracker start

And start the TaskTracker service on the worker nodes.

service hadoop-0.20-tasktracker start

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

Sort by: Date Rating Last Activity

And start the TaskTracker service on the worker nodes.

Dak1n1 138 weeks ago

kifal 132 weeks ago

Hi Chloe, Looks like the cloudera's installation guide link is broken !!

rashmi 124 weeks ago

1 reply active 120 weeks ago

dak1n1 31p 120 weeks ago

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

Chris 96 weeks ago

dak1n1 31p 95 weeks ago

dani 91 weeks ago

I have followed the steps in this blog.

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

for the head node I did the following:

preeth 88 weeks ago

alternatives --display hadoop-0.20-conf

Chris 84 weeks ago

Nice guide, thanks - worked for me

Chris 81 weeks ago