Professional Documents
Culture Documents
Index:
-
Aligning with the systems engineering team to propose and deploy new hardware
and software environments required for Hadoop and to expand existing
environments.
Working with data delivery teams to setup new Hadoop users. This job includes
setting up Linux users, setting up Kerberos principals and testing HDFS, Hive, Pig
and MapReduce access for the new users.
Cluster maintenance as well as creation and removal of nodes using tools like
Ganglia, Nagios, Cloudera Manager Enterprise, Dell Open Manage and other tools.
HDFS is a distributed storage filesystem. It runs on top of another filesystem like ext3 or
ext4.In order to be efficient, HDFS must satisfy the following prerequisites :
Manages the state of an HDFS node and interacts with its blocks
Needs a lot of I/O for processing and data transfer (I/O bound)
The critical components in this architecture are the NameNode and the Secondary
NameNode.
How HDFS manages its files
HDFS is optimized for the storage of large files. You write the file once and access it
many times. In HDFS, a file is split into several blocks. Each block is asynchronously
replicated in the cluster. Therefore, the client sends its files once and the cluster takes
care of replicating its blocks in the background.
A block is a contiguous area, a blob of data on the underlying filesystem, its default size
is 64MB but it can be extended to 128MB or even 256MB, depending on your needs. The
block replication, which has a default factor of 3, is useful for two reasons:
Ensure data recovery after the failure of a node. Hard drives used for HDFS must
be configured in JBOD, not RAID
Increase the number of maps that can work on a bloc during a MapReduce job
and therefore speedup processing
The NameNode manages the Meta informations of the HDFS cluster. This includes
Meta informations (filenames, directories,) and the location of the blocks of a file. The
filesystem structure is entirely mapped into memory.
In order to have persistence over restarts, two files are also used:
slowdown restarts
The Secondary NameNode role is to avoid this issue by regularly
merging edits with fsimage, thus pushing a new fsimage and resetting the content
of edits. The trigger for this compaction process is configurable. It can be:
The following formula can be applied to know how much memory a NameNode needs:
<Needed memory> = <total storage size in the cluster in MB> / <Size of a block in
MB> / 1000000
In other words, a rule of thumb is to consider that a NameNode needs about 1GB / 1
million blocks.
Prerequisites
Install Java. > Java 1.6 Version.
We will use a dedicated Hadoop user account for running Hadoop. While thats not
required it is recommended because it helps to separate the Hadoop installation from
other software applications.
$ sudo addgroup Naresh
$ sudo adduser --ingroup Naresh Srinu
This will add the user srinu and the group Naresh to your local machine.
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local
machine if you want to use Hadoop on it .For our single-node setup of Hadoop, we
therefore need to configure SSH access to localhost for the Srinu user
1) We have to generate an SSH key for the Srinu user.
user@ubuntu:~$ su - Srinu
Srinu@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/Srinu/.ssh/id_rsa):
Created directory '/home/Srinu/.ssh'.
Your identification has been saved in /home/Srinu/.ssh/id_rsa.
Your public key has been saved in /home/Srinu/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 Srinu@ubuntu
The key's randomart image is:
[...snipp...]
Srinu@ubuntu:~$
The second line will create an RSA key pair with an empty password. Generally, using an
empty password is not recommended, but in this case it is needed to unlock the key
without your interaction (you dont want to enter the passphrase every time Hadoop
interacts with its nodes).
2) We have to enable SSH access to your local machine with this newly
created key.
Srinu@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with
the Srinu user. The step is also needed to save your local machines host key fingerprint
to the Srinu users known_hosts file.
Srinu@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686
GNU/Linux
Ubuntu 12.04 LTS
[...snipp...]
Srinu@ubuntu:~$
Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networkingrelated Hadoop configuration options will result in Hadoop binding to the IPv6
addresses of my Ubuntu box. In my case, I realized that theres no practical point in
enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I
simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu, open /etc/sysctl.conf in the editor of your choice and add
the following lines to the end of the file:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (thats what we
want).
Hadoop.
Download Hadoop from the Apache Download Mirrors and extract the
contents of the Hadoop package to a location of your choice. I
picked /usr/local/hadoop. Make sure to change the owner of all the files to
the Srinu user and Naresh group, for example:
$ cd /usr/local
$ sudo tar xzf hadoop-1.0.3.tar.gz
$ sudo mv hadoop-1.0.3 hadoop
$ sudo chown -R Srinu:Naresh hadoop
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user Srinu. If you use a
shell other than bash, you should of course update its appropriate configuration files
instead of .bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
Configuration
Hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial
is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used
the installation path in this tutorial, the full path
is /usr/local/hadoop/conf/hadoop-env.sh) and set
the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.
Change
Conf/hadoop-env.sh
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
To
Conf/hadoop-env.sh
Conf/*-site.xml
In this section, we will configure the directory where Hadoop will store its data files, the
network ports it listens to, etc. Our setup will use Hadoops Distributed File
System, HDFS, even though our little cluster only contains our single local machine.
You can leave the settings below as is with the exception of
the hadoop.tmp.dir parameter this parameter you must change to a directory of
your choice. We will use the directory /app/hadoop/tmp.
Now we create the directory and set the required ownerships and permissions:
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown Srinu:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp
Add the following snippets between the <configuration> ... </configuration> tags in the
respective configuration XML file.
In file conf/core-site.xml:
Conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
Conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
Conf/Hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.The default is
used if replication is not specified in create time.
</description>
</property>
Srinu@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-Srinu-namenodeubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-Srinu-datanodeubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoopSrinu-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-Srinu-jobtrackerubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-Srinutasktracker-ubuntu.out
Srinu@ubuntu:/usr/local/hadoop$
JPS
Srinu@ubuntu:/usr/local/hadoop$ jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode
If there are any errors, examine the log files in the /logs/ directory.
Stopping Single Node Cluster.
Srinu@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
Srinu@ubuntu:/usr/local/hadoop$
Prerequisites:
Configuring single-node clusters first:
use the same settings (e.g., installation locations and paths) on both machines, or
otherwise you might run into problems later when we will migrate the two machines to
the final multi-node cluster setup. Now that you have two single-node clusters up and
running, we will modify the Hadoop configuration to make one Ubuntu box the
master (which will also act as a slave) and the other Ubuntu box a slave.
The easiest way is to put both machines in the same network with regard to hardware
and software configuration, for example connect both machines via a single hub or
switch and configure the network interfaces to use a common network such
as 192.168.0.x/24.
To make it simple, we will assign the IP address 192.168.0.1 to the master machine
and 192.168.0.2 to the slave machine. Update /etc/hosts on both machines with the
following line:
etc/hosts (For Master and slave)
192.168.0.1
192.168.0.2
master
slave
SSH Access:
The Srinu user on the master (aka Srinu@master) must be able to connect
1) To its own user account on the master i.e. ssh master in this context and not
necessarily ssh localhost.
2) To the Srinu user account on the slave (aka Srinu@slave) via a password-less
SSH login
You just have to add the Srinu@masters public SSH key (which should be
in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of Srinu@slave (in this
users $HOME/.ssh/authorized_keys). You can do this manually or use
Srinu@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub Srinu@slave
This command will prompt you for the login password for user Srinu on slave, then
copy the public SSH key for you, creating the correct directory and fixing the
permissions as necessary.
The final step is to test the SSH setup by connecting with user Srinu from the master to
the user account Srinu on the slave. The step is also needed to save slaves host key
fingerprint to the Srinu@masters known_hosts file.
Hadoop
Cluster
We will see how to configure one Ubuntu box as a master node and the other Ubuntu
box as a slave node. The master node will also act as a slave because we only have two
machines available in our cluster but still want to spread data storage and processing to
multiple machines.
The master node will run the master daemons for each layer: NameNode for the HDFS
storage layer, and JobTracker for the MapReduce processing layer. Both machines will
run the slave daemons: DataNode for the HDFS layer, and TaskTracker for
MapReduce processing layer. Basically, the master daemons are responsible for
coordination and management of the slave daemons while the latter will do the actual
data storage and data processing work.
Configuration
respectively (the primary NameNode and the JobTracker will be started on the same
machine if you run bin/start-all.sh).
To start individually...
bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode |
jobtracker | tasktracker]
The conf/slaves file on master is used only by the scripts like bin/start-dfs.sh
or bin/stop-dfs.sh. For example, if you want to add Data Nodes on the fly you can
manually start the DataNode daemon on a new slave machine via bin/hadoopdaemon.sh start datanode. Using the conf/slaves file on the master simply helps you to
make full cluster restarts easier.
<property>
<name>fs.default.name</name>
<value>hdfs: //master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
The default value of dfs.replication is 3. However, we have only two nodes available, so
we set dfs.replication to 2.
Conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
The HDFS name table is stored on the Name Nodes (here: master) local filesystem in
the directory specified by dfs.name.dir. The name table is used by the NameNode to
store tracking and coordination information for the DataNodes.
We begin with starting the HDFS daemons: the NameNode daemon is started
on master, and DataNode daemons are started on all slaves (here: master and slave).
2.
Then we start the MapReduce daemons: the JobTracker is started on master, and
TaskTracker daemons are started on all slaves (here: master and slave).
HDFS daemons
Run the command bin/start-dfs.sh on the machine you want the (primary)
NameNode to run on. This will bring up HDFS with the NameNode running on the
machine you ran the previous command on, and DataNodes on the machines listed in
the conf/slaves file.In our case, we will run bin/start-dfs.sh on master:
Java Processes running on Master after bin/start-dfs.sh
Srinu@master:/usr/local/hadoop$ jps
14799 NameNode
15314 Jps
14880 DataNode
14977 SecondaryNameNode
Srinu@master:/usr/local/hadoop$
Srinu@slave:/usr/local/hadoop$ jps
15183 DataNode
15897 TaskTracker
16284 Jps
Srinu@slave:/usr/local/hadoop$
One of the most attractive features of Hadoop framework is its utilization of commodity
hardware. However, this leads to frequent Data Node crashes in a Hadoop cluster.
Another striking feature of Hadoop Framework is the ease of scale in accordance to the
rapid growth in data volume. Because of these two reasons, one of the most common
task of a Hadoop administrator is to commission (Add) and decommission (Remove)
Data Nodes in a Hadoop Cluster.
The first task is to update the exclude files for both HDFS (hdfs-site.xml) and Map
Reduce (mapred-site.xml).
The exclude file:
For jobtracker contains the list of hosts that should be excluded by the jobtracker.
If the value is empty, no hosts are excluded.
For Name node contains a list of hosts that are not permitted to connect to the
Name node.
Here is the sample configuration for the exclude file in hdfs-site.xml and
mapred-site.xml:
Hdfs-site.xml
<Property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
Mapred-site.xml
<Property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
Note: The full pathname of the files must be specified.
we did in the above steps. The Master will recognize the process automatically and will
declare as dead. There is no need to follow the same process for removing the
tasktracker because it is NOT much crucial as compared to the DataNode. DataNode
contains the data that you want to remove safely without any loss of data.
The tasktracker can be run/shutdown on the fly by the following command at any point
of time.
Main challenge in running a hadoop cluster comes from maintenance itself. We will
point out some of the common problems we face every day.
EBay apparently has a large cluster: "Amr Awadallah said eBay have the third largest
Hadoop cluster in existence holding a few petabytes of data and move data between it
and a traditional data warehouse."
These logs are created by the Hadoop daemons, and exist on all machines running at
least one Hadoop daemon. Some of the files end with .log, and others end with .out.
The .out files are only written to when daemons are starting. After daemons have started
successfully, the .out files are truncated. By contrasts, all log messages can be found in
the .log files, including the daemon start-up messages that are sent to the .out files.
There is a .log and .out file for each daemon running on a machine. When the
namenode, jobtracker, and secondary namenode are running on the same machine,
then there are six daemon log files: a .log and .out for the each of the three daemons.
The .log and .out file names are constructed as follows:
Hadoop-<User-running-hadoop>-<daemon>-<hostname>.log
Where <user-running-hadoop> is the user running the Hadoop daemons, <daemon>
is the daemon these logs are associated (for example, namenode or jobtracker), and
<hostname> is the hostname of the machine on which the daemons are running.
For example:
Hadoop-hadoop-datanode-ip-10-251-30-53.log
By default, the .log files are rotated daily by log4j. This is configurable
with /etc/hadoop/conf/log4j.properties. Administrators of a Hadoop cluster should
review these logs regularly to look for cluster-specific errors and warnings that might
have to do with daemons running incorrectly. Note that the namenode and
secondarynamenode logs should not be deleted more frequently than
fs.checkpoint.period, so in the event of a secondarynamenode edits log compaction
failure, logs from the namenode and secondarynamenode will be available for
diagnostics.
These logs grow slowly when the cluster is idle. When jobs are running, they grow
very rapidly. Some problems create considerably more log entries, but some problems
only create a few infrequent messages. For example, if the jobtracker cant connect to
the namenode, the jobtracker daemon logs explode with the same error (something
like Retrying connecting to namenode [.]). Lots of log entries here does not
necessarily mean that there is a problem: you have to search through these logs to
look for a problem.
The job configuration XML logs are created by the jobtracker. The jobtracker creates
a .xml file for every job that runs on the cluster. These logs are stored in two
places:/var/log/hadoop and /var/log/hadoop/history. The XML file describes the job
configuration.
The /hadoop file names are constructed as follows:
Job_<JobID>_conf.xml
Job Statistics
These logs are created by the jobtracker. The jobtracker runtime statistics from jobs to
these files. Those statistics include task attempts, time spent shuffling, input splits
given to task attempts, start times of tasks attempts and other information.
Standard Error
These logs are created by each tasktracker. They contain information written to
standard error (stderr) captured when a task attempt is run. These logs can be used
for debugging. For example, a developer can include System.err.println (some useful
information) calls in the job code. The output will appear in the standard error files.
The parent directory name for these logs is constructed as follows:
/var/log/hadoop/userlogs/attempt_<Job-id>_<Map or Reduce>_<attempt-id>
where <job-id> is the ID of the job that this attempt is doing work for, <map-orreduce> is either m if the task attempt was a mapper, or r if the task attempt was a
reducer, and <attempt-id> is the ID of the task attempt.
For example:
/var/log/hadoop/userlogs/attempt_200908190029_001_m_00001_0
These logs are rotated according to the mapred.userlog.retain.hours property. You can
clear these logs periodically without affecting Hadoop. However, consider archiving
the logs if they are of interest in the job development process. Make sure you do not
move or delete a file that is being written to by a running job .
Command
Description
hadoop namenode
-format
hadoop namenode
-upgrade
start-dfs.sh
stop-dfs.sh
start-mapred.sh
stop-mapred.sh
hadoop namenode
-recover -force
Command
Description
hadoop fsck /
Command
Description
Kill a task
Command
Description
hadoop dfsadmin
-report
hadoop dfsadmin
-metasave file.txt
hadoop dfsadmin
-setQuota 10
/quotatest
hadoop dfsadmin
Clear Hadoop directory quota
-clrQuota /quotatest
hadoop dfsadmin
-refreshNodes
hadoop fs -count
-q /mydir
hadoop dfsadmin
-setSpaceQuota
/mydir 100M
/mydir
hadoop dfsadmin
-clrSpaceQuota
/mydir
hadooop dfsadmin
-saveNameSpace
The following dfsadmin commands helps the cluster to enter or leave safe mode, which
is also called as maintenance mode. In this mode, Namenode does not accept any
changes to the name space; it does not replicate or delete blocks.
Command
Description
File
Description
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
masters
slaves
Command
Description
Reload mapreduce
configuration
hadoop mradmin
-refreshUserToGroupsMappings
Command
Description
start-balancer.sh
hadoop dfsadmin
-setBalancerBandwidth
<bandwidthinbytes>