You are on page 1of 11

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

HOME

BLOG

ABOUT

Search...

Cloudera Hadoop RHEL/CentOS 6 Install Guide


Like

Share

Tweet

Created on 15 February 2012


Dakini
Hits: 20533

This guide contains everything you need to get a basic Hadoop cluster up and
running. It is intended as a condensed and easy-to-understand supplement to
the official documentation, so lengthy descriptions are omitted. For the full
documentation, see Cloudera's Install Guide.

More Reading
3 Million and Beyond Creating Scalable Web
Clusters with LVS
ATI Radeon HD fglrx driver
install on Fedora 17
Automating GIS Metadata

Whether you want to start with a basic two-node cluster, or add hundreds or even thousands of nodes, the concepts

Conversion

here apply. Adding nodes can be done at any time without interrupting the cluster's workflow, so as long as you have

Building a Load-Balancing

two machines, you're ready to begin installing.

Cluster with LVS

About hardware:
Hadoop was designed to be used on commodity hardware, so you won't need anything special for this project. You'll

Cloudera Hadoop
RHEL/CentOS 6 Install
Guide

want 2 or more reasonably fast, modern servers. (I'm using 9 SuperMicro boxes I happened to have lying around).
Mine have dual Xeon processors, 48G RAM, 4 x 7200 RPM SATA disks in JBOD mode. JBOD is recomended above
RAID for Hadoop, since Hadoop has its own built-in redundancy which performs better with plain disks.

0
Unlike High-Availablilty clustering, an HPC cluster like Hadoop does not require any special fencing hardware. It
handles hardware failure by simply not giving jobs to broken/misbehaving nodes. If a node fails a certain number of
jobs, it's out.
In a Hadoop cluster, the only type of hardware failure that would cause any noticeable disruption is the possible

Affiliates

failure of the NameNode. This is why it's always good to have a Secondary NameNode on standby.

Hadoop Clusters - Core concepts

1 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

Here are a few core concepts that will help you understand what you're about to build. For a very small cluster (let's
say, 9 nodes or less), your cluster will consist of these types of nodes:
1. Head node. Runs the NameNode service and JobTracker service.
2. Worker nodes. All other nodes in the cluster will run DataNode and TaskTracker services.
A larger cluster is almost identical to this, but generally they use a separate machine for the JobTracker service. It's
also common to add a Secondary NameNode for redundancy. So in that scenerio you'd have:
1. NameNode machine.
2. Secondary NameNode machine.
3. Jobtracker machine.
4. Worker machines, each running DataNode + TaskTracker services.
A brief definition of these components:
NameNode: Stores all metadata for the HDFS filesystem.
DataNodes: Worker nodes that store and retrieve data when told to (by clients or the NameNode).
TaskTrackers: Runs tasks and sends progress reports to the JobTracker.
JobTracker: Coordinates all jobs and schedules them to run on TaskTracker nodes.
HDFS: Hadoop Distributed File System. An HDFS cluster consists of a NameNode + DataNodes. All Hadoop IO
happens through this. Built for storing very large files across many machines.

Hadoop Installation
Now that you have a little background on this software, we can begin installing. The first thing you'll need is Java
JDK 1.6 u8 or higher. You might also want to use a tool like clusterssh to ssh into all your nodes at once to perform
this installation.

Install Java on each node


# grab the latest java version - probably not this one anymore ;)
wget http://download.oracle.com/otn-pub/java/jdk/7u1-b08/jdk-7u1-linux-x64.rpm
rpm -Uvh jdk-7u1-linux-x64.rpm
alternatives --install /usr/bin/java java /usr/java/latest/bin/java 1600
alternatives --auto java

Disable SELinux
setenforce 0
vim /etc/sysconfig/selinux
SELINUX=disabled

Allow communication between nodes in IPtables


Either disable IPtables, allow all communication between nodes, or open Hadoop-specific ports.

Set up the Cloudera yum repo

2 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm
rpm --import http://archive.cloudera.com/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

On the head node, install NameNode and JobTracker packages


yum -y install hadoop-0.20-namenode hadoop-0.20-jobtracker

On the worker nodes, install DataNode and TaskTracker packages


yum -y install hadoop-0.20-datanode hadoop-0.20-tasktracker

Use alternatives to set up your custom cluster config


Setting up your cluster like this will allow you to keep multiple cluster configurations handy, and makes switching
between them so simple!

cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.MyCluster
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster 50
alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.MyCluster

Verify that it's using MyCluster config instead of the default

[root@nodes ~]# alternatives --display hadoop-0.20-conf


----------------------------------------------------------hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.MyCluster
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.MyCluster - priority 50
Current `best' version is /etc/hadoop-0.20/conf.MyCluster.
-----------------------------------------------------------

Set up Hadoop storage disks on each node


In a Hadoop cluster, it is ideal to have multiple plain-disk storage mounts on each node. In this example, I'm using 4
plain disks, formatted ext4.

mkdir -p /mnt/hdfs/{1..4}

Add the new disks to /etc/fstab, ensuring that they're mounted with noatime. (This prevents reads from turning into
unnecessary writes, which is generally good for performance.)

vim /etc/fstab

3 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

# append the new disks


/dev/sdb1 /mnt/hdfs/1 ext4 noatime 0 0
/dev/sdc1 /mnt/hdfs/2 ext4 noatime 0 0
/dev/sdd1 /mnt/hdfs/3 ext4 noatime 0 0
/dev/sde1 /mnt/hdfs/4 ext4 noatime 0 0

Mount the new disks and create Hadoop directories


The directories we're creating here correspond with Hadoop configuration options we'll be setting
later: dfs.name.dir, dfs.data.dir, mapred.local.dir.

mount /mnt/hdfs/1
mount /mnt/hdfs/2
mount /mnt/hdfs/3
mount /mnt/hdfs/4

# create the namenode, datanode, and mapred dirs on each disk


for num in 1 2 3 4; do mkdir /mnt/hdfs/$num/{namenode,datanode,mapred}; done

Set directory permissions


This part is very important! You'll run into errors later if these dirs are owned by the wrong user.

# make sure everything is owned by hdfs:hadoop


chown -R hdfs:hadoop /mnt/hdfs/
# ...except for the mapred dirs
chown -R mapred:hadoop /mnt/hdfs/{1,2,3,4}/mapred

Hadoop Core Configuration


Now that the underlying directories and storage devices are set up, we're ready to configure Hadoop.
These configuration files are not node-specific, so you can write them once and copy them to all nodes. Change
these config examples so that HEADNODE is actually the name of your head node.
/etc/hadoop-0.20/conf.MyCluster/core-site.xml
This config file will tell Hadoop where to find the NameNode and its default file system.

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://HEADNODE:54310</value>
</property>

4 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

</configuration>

/etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml
This is where we tell Hadoop to use the directories we created earlier. It specifies local storage on each node, used
by the DataNodes and NameNode services to store HDFS data.

<configuration>
<property>
<name>dfs.name.dir</name>
<value>/mnt/hdfs/1/namenode,/mnt/hdfs/2/namenode,/mnt/hdfs/3/namenode,/mnt/hdfs/4/namenode</v
</property>
<property>
<name>dfs.data.dir</name>
<value>/mnt/hdfs/1/datanode,/mnt/hdfs/2/datanode,/mnt/hdfs/3/datanode,/mnt/hdfs/4/datanode</v
</property>
</configuration>

/etc/hadoop-0.20/conf.MyCluster/mapred-site.xml
Specify the JobTracker here, along with all the local directories for writing map/reduce (job-related) data. This is
used by the TaskTracker service on all the worker nodes. Change HEADNODE to the name of your machine that
runs the JobTracker service. (In a small cluster, this machine is the same one that runs the NameNode service.)

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://HEADNODE:54311</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/mnt/hdfs/1/mapred,/mnt/hdfs/2/mapred,/mnt/hdfs/3/mapred,/mnt/hdfs/4/mapred</value>
</property>

Bringing the Cluster online


With the basic configuration finished above, the cluster is now ready to be brought online. First, we need to format
the NameNode to create an HDFS filesystem for our nodes to use as storage. This only needs to be done on the
NameNode.

Formatting the NameNode


[root@HEADNODE ~]# sudo -u hdfs hadoop namenode -format

5 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

11/12/07 04:55:59 INFO namenode.NameNode: STARTUP_MSG:


/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = HEADNODE/192.168.1.2
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2-cdh3u2
STARTUP_MSG: build = file:///tmp/topdir/BUILD/hadoop-0.20.2-cdh3u2 -r 95a824e4005b2a94fe1c11f1ef9db4c
************************************************************/
Format filesystem in /mnt/hdfs/1/namenode ? (Y or N) Y
Format filesystem in /mnt/hdfs/2/namenode ? (Y or N) Y
Format filesystem in /mnt/hdfs/3/namenode ? (Y or N) Y
Format filesystem in /mnt/hdfs/4/namenode ? (Y or N) Y
11/12/07 04:56:00 INFO util.GSet: VM type = 64-bit
11/12/07 04:56:00 INFO util.GSet: 2% max memory = 17.77875 MB
11/12/07 04:56:00 INFO util.GSet: capacity = 2^21 = 2097152 entries
11/12/07 04:56:00 INFO util.GSet: recommended=2097152, actual=2097152
11/12/07 04:56:00 INFO namenode.FSNamesystem: fsOwner=hdfs
11/12/07 04:56:00 INFO namenode.FSNamesystem: supergroup=supergroup
11/12/07 04:56:00 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/12/07 04:56:00 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
11/12/07 04:56:00 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 mi
11/12/07 04:56:00 INFO common.Storage: Image file of size 110 saved in 0 seconds.
11/12/07 04:56:01 INFO common.Storage: Storage directory /mnt/hdfs/1/namenode has been
11/12/07 04:56:01 INFO common.Storage: Storage directory /mnt/hdfs/2/namenode has been
11/12/07 04:56:01 INFO common.Storage: Storage directory /mnt/hdfs/3/namenode has been
11/12/07 04:56:01 INFO common.Storage: Storage directory /mnt/hdfs/4/namenode has been
11/12/07 04:56:01 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at HEADNODE/192.168.1.2
************************************************************/

We can see from the output above that the NameNode has been successfully formatted. But nothing is running yet,
so let's start up our NameNode service on the head node. To start up services, first we'll need to fix permissions on
each node.

HDFS User Permissions


Set up permissions to allow the hdfs user to write log files here. Otherwise, the service may fail to start correctly.

chgrp hdfs /usr/lib/hadoop-0.20/


chmod g+rw /usr/lib/hadoop-0.20/

Start the NameNode service on the head node

6 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

service hadoop-0.20-namenode start

And on the worker nodes, start up the DataNode service.

service hadoop-0.20-datanode start

Give it a couple seconds to start up the HDFS filesystem. The nodes will connect, and the local storage of each
node will be added to the collective HDFS filesystem. Now we can create core directories.

First-time HDFS use: create core directories


# these paths are relative to the HDFS filesystem,
# so you can copy and paste this regardless of your physical directory layout
sudo -u hdfs hadoop fs -mkdir /mapred/system
sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system
sudo -u hdfs hadoop dfs -mkdir /tmp
sudo -u hdfs hadoop dfs -chmod -R 1777 /tmp

Check that your nodes are all online and healthy


This will show a report overview for all nodes, and show you how much storage you have between them all.

sudo -u hdfs hadoop dfsadmin -report


sudo -u hdfs hadoop dfs -df

Start JobTracker and TaskTracker services


If all looks well, continue on to start the JobTracker service on the head node.

service hadoop-0.20-jobtracker start

And start the TaskTracker service on the worker nodes.

service hadoop-0.20-tasktracker start

You now have a fully-functional Hadoop cluster up and running! Check the cluster status on your local Hadoop
status pages:
http://localhost:50070
http://localhost:50030

7 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

Comments (13)

http://dak1n1.com/blog/9-hadoop-el6-install

Login

Sort by: Date Rating Last Activity


Alan Reed 139 weeks ago

+1

And start the TaskTracker service on the worker nodes.


service hadoop-0.20-jobtracker start
I think that this should be service hadoop-0.20-tasktracker start
Reply

Dak1n1 138 weeks ago

+2

Fixed! Thanks.
Reply

kifal 132 weeks ago

Hi Chloe, Looks like the cloudera's installation guide link is broken !!


Reply

rashmi 124 weeks ago

Hi,
For apache hadoop-2.0.0-alpha installation on two linux machines, what should be values of fs.defaultFS and
dfs.name.dir and dfs.data.dir properties on both name nodes????
one machine hostname is rsi-nod-nsn1 and another one is rsi-nod-nsn2...
i want to make both as federated namenodes.. and both should be used as datanodes too..
i want to configure both federation anf YARN.
what should be configuration changes for the same? i am not finding masters, mapred-site.xml, and hadoopenv.sh files in hadoopHome/etc/hadoop folder... how do i make changes for these files?
regards,
rashmi
Reply

1 reply active 120 weeks ago

dak1n1 31p 120 weeks ago

+1

hadoop-2.0.0-alpha ... that doesn't sound like Cloudera Hadoop. Apache Hadoop works differently and isn't
covered in this guide.
The core configuration options are listed above, so that covers 'fs.default.name' and 'dfs.data.dir'. (Though that

8 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

was for version 0.22... it might be different in your version). I honestly don't use Hadoop anymore, so I don't
know.
But, the manual will show you all available configuration options, so that could be handy:
http://hadoop.apache.org/hdfs/docs/current/hdfs-d...
As far as the location of the configuration files, you can do:
rpm -qa |grep hadoop # find the package name
rpm -ql hadoop-package-name --configfiles
Here's an example of that, using 'httpd' as the package name:
[dakini@nibbana ~]$ sudo rpm -ql httpd --configfiles
/etc/httpd/conf.d/welcome.conf
/etc/httpd/conf/httpd.conf
/etc/httpd/conf/magic
Reply

Chris 96 weeks ago

Since I'm a raging Hadummy, is there a more detailed guide on how to partition each of the nodes? Following the
steps above leads to pain, suffering and errors. (Specifically: "special device /dev/sdd1 does not exist")
1 reply active 95 weeks ago

Reply

dak1n1 31p 95 weeks ago

oh no, it's very dangerous to copy/paste commands like that from the internet, unless you fully understand
what they do. '/dev/sdb' refers to a disk that you don't have, which means you're trying to run a command that
works on someone else's hardware, but not on yours.
When partitioning and mounting, you have to look at your particular hardware, and adjust the commands
accordingly. Otherwise you might find yourself destroying your data!
'sudo fdisk -l' will show you all the disks you have. I suggest googling around for a partitioning guide, since it's
important to learn what this stuff does before attempting to run it.
Reply

dani 91 weeks ago

I have followed the steps in this blog.


For the worker node I did following
01. added 4 partitions and formated them with ext4, and changed fstab
02. used following commands
for num in 1 2 3 4; do mkdir /mnt/hdfs/$num/{namenode,datanode,mapred}; done
chown -R hdfs:hadoop /mnt/hdfs/
chown -R mapred:hadoop /mnt/hdfs/{1,2,3,4}/mapred
03. configured following files as per this blog;
/etc/hadoop-0.20/conf.MyCluster/core-site.xml
/etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml
/etc/hadoop-0.20/conf.MyCluster/mapred-site.xml

9 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

for the head node I did the following:


03. configured following files as per this blog;
/etc/hadoop-0.20/conf.MyCluster/core-site.xml
/etc/hadoop-0.20/conf.MyCluster/hdfs-site.xml
/etc/hadoop-0.20/conf.MyCluster/mapred-site.xml
04. Issued the following command at head node
sudo -u hdfs hadoop namenode -format
but I got following error
Please note am testing on AWS using two EC2 instances and route 53.
http://pastebin.com/Xp0gZ9tR
Highly appreciate your help on to get this up and running
Reply

preeth 88 weeks ago

alternatives --display hadoop-0.20-conf


for this cmd i am getting manual conf
how to solve
Reply

Chris 84 weeks ago

Nice guide, thanks - worked for me


Reply

Chris 81 weeks ago

How I am getting some file permissions issues. I installed as root but my /mnt/hdfs etc (I created with same
names as you) are
drwxr-xr-x 5 hdfs hadoop 4096 Apr 20 19:15 1
Would you recommend adding root to hadoop group and changing -R 775?
Reply

1 reply active 81 weeks ago

dak1n1 31p 81 weeks ago

+1

No, I wouldn't add root to any groups, because the hadoop services don't run as root. They run as 'hdfs' and
'mapred'.
See the permissions section above. It worked for me every time during my hadoop installs 1+ years ago.
# hdfs user must own the /mnt/hdfs directory
chown -R hdfs:hadoop /mnt/hdfs/
# mapred user must own the mapred directories
chown -R mapred:hadoop /mnt/hdfs/{1,2,3,4}/mapred

10 of 11

12/05/2014 02:17 PM

Cloudera Hadoop RHEL/CentOS 6 Install Guide - Dakini's Bliss

http://dak1n1.com/blog/9-hadoop-el6-install

Though maybe you're talking about the HDFS filesystem itself having permissions issues. If that's the case,
read the section labeled "First-time HDFS use: create core directories".
Otherwise maybe check your logs for a more detailed error message and see the install manual for your
Hadoop version. It's possible that things may have changed, since this guide is a year old.
Reply

sarthak 51 weeks ago

my tasktracker is not running


Reply

Post a new comment


Enter text right here!

Comment as a Guest, or login:


Name

Email

Website (optional)

Displayed next to your comments.

Not displayed publicly.

If you have a website, link to it here.

Subscribe to

None

Submit Comment

Category: Blog

me@dak1n1.com

RSS Feed

Copyright 2011-2012 - All Rights Reserved

11 of 11

12/05/2014 02:17 PM

You might also like