You are on page 1of 6

HDFS Assignment

Q1) What's the purpose of a distributed file systems as in HDFS and AFS?
A distributed file system like AFS and HDFS manages storage across multiple machines,
so the storage capacity is not restricted to the physical limitations of a single machine.
So, by having multiple machines in a cluster, big data sets can be addressed.
While the regular file systems like ext3, ext4, NTFS, FAT32 manage the storage of a
single machine and are limited to the physical limitations of a single machine.
Q2) What are the design considerations behind HDFS architecture?
The design considerations for HDFS are
- Very large files (in TB and PB)
- Write once and read many times (as in analytics)
- High throughput and low latency (as in OLAP)
- Using commodity machines to store the data
Q3) What are the different modes in which Hadoop can be run and what's the
purpose of these modes?
Standalone mode - For developing/debugging applications in Eclipse
Pseudo Distributed mode - For debugging/testing applications on a single machines
Fully Distributed mode - For debugging/testing applications in a cluster of machines
Q4) What's the disadvantage of running the NameNode and the DataNode on
the same machine?
The NameNode requires a machine with high memory to store the metadata and it is
should be a reliable machine, because the NameNode is a single point of failure. And the
DataNode stores the data in the Hard Disk, so it is recommended to use a machine with
a high Hard Disk capacity.
The purpose and the machine configuration of the NameNode and the DataNode is
different, so it is recommended to run them on separate machines as in the case of
production. But, for the sake of POC, evaluation, development running the different
process on a single machine should be fine.
Q4) What are the commands for
- deleting the log and data files
- cd $HADOOOP_HOME
- rm -rf logs/*
- rm -rf data/*
- Formatting the HDFS
- hadoop namenode -format
- starting/stopping HDFS
- start-dfs.sh and stop-dfs.sh
- make sure that the required processes are running
- jps
Q5) What are the commands for

- copying a file to HDFS


hadoop fs -put <src> <dest>
- get the same file back from HDFS
hadoop fs -get <src> <dest>
- verify that the file which has been put in into HDFS is the same file got
diff <file1> <file>
Q6) What are the different ways for changing the block size in HDFS? What are
the commands and the configuration changes for the same?
a) The dfs.block.size size can be modified in the hdfs-site.xml.
b) When putting the file in HDFS the block size can also be specified as
hadoop fs -Ddfs.block.size=1048576 -put <source> <target>
Q7) What are the commands to kill the NameNode and what is the impact of
killing the NameNode?
The process id (pid) of the NameNode can be got by running the jps command. By using
the kill command the NameNode can be killed as `kill -9 <pid>`.
Q8) With a replication of 1, 2 and 3 how are blocks placed across racks and
nodes?
With a replication 1, the block is replicated on the same machine or the least utilized
machine in terms of storage. The second copy is placed on some other rack, the third
copy is placed in the same rack as the first copy, but in an entirely different machine.
Q9) How to increase the replication factor for files globally and also from file to
file?
a) The replication can be modified for existing files in HDFS as
hadoop fs -setrep 2 <source>
b) The dfs.replication property can also be set in hdfs-site.xml to the appropriate value.
Q10) If a MapReduce job has 10 reducers and the job ouput has 10 files, what
are the different options for merging the files?
1) The -getmerge option supported by Hadoop fs shell can be used to merge the files.
2) The files can be got using the -get option supported by Hadoop fs shell. Once all the
files in the local file system, they can be merged using the Linux cat command.
Q11) How to change the files at arbitrary locations in HDFS? And what are the
cons of it?
HDFS doesn't support changing files at arbitrary location, files can only be appended.
One way around this is to
- get the file from HDFS using the -get command
- modify the file in the local file system
- put the file into HDFS using the -put command
Q12) What are the different ways to add a DataNode to a Hadoop cluster?

a) The DataNodes hostname has to be added to the slaves file and start the DataNode as
`bin/hadoop-daemon.sh start datanode`.
b) The DataNodes hostname has to be added to the slaves file and restart HDFS.
Q13) Does HDFS consider the record boundaries while splitting the data into
blocks?
HDFS doesn't consider record boundaries while splitting the data into blocks. The data in
HDFS is exactly split at the block size. So, a record might be split across multiple blocks.
Q14) How to encrypt/compress the data in HDFS?
HDFS doesn't support encryption/compression of the data. So, before putting the data in
HDFS the data has to be manually encrypted/compressed.
Q15) What happens when DataNode doesn't send a Heart Beat to NameNode?
When a DataNode doesn't send a Heart Beat for some time, then the NameNode will wait
for some additional time and assume that the DataNode is dead. And then the
NameNode will replicate the blocks across different nodes, to maintain the proper
replication factor of the blocks.
Q16) How does HDFS verify that a block of data got corrupted and what action
does it take when a block got corrupted?
HDFS stores the checksum (aka signature) of the block along with the actual block. And
the DataNode periodically matches the blocks with the signature. If there is any
mismatch, then it is assumed that the block is corrupted and the appropriate action is
taken.
Q17) How does Hadoop know which rack a machine belongs to?
Hadoop has to made rackaware, by setting the `net.topology.script.file.name` property
to a shell script in the core-site.xml. The script takes a hostname/ip and returns the rack
to which the host belongs to. Hadoop has to be made rack aware, so that the blocks can
be placed to use the network efficiently and also to address the fault tolerance.
Q18) What is the purpose of using installing/configuring ssh in the
master/slaves node in HDFS?
The start-up scripts in the bin folder use ssh to login to remote machines in the cluster
to manage (start/stop) the services. The ssh-client has to be installed on the master and
the ssh-server has to be installed on the slaves.
Q19) What are the commands to install ssh server and client on Ubuntu (debian
based) and RHEL (RPM based) systems?
On debian based systems
sudo apt-get install openssh-client
sudo apt-get install openssh-server

On RPM bases systems


yum -y install openssh-clients
yum -y install openssh-server
Q20) What port number does ssh listen to by default?
ssh listens to port 22 by default.
Q21) What happens if during the put operation, the block is replicated only
once and not to the default replication factor three?
As long as the `dfs.namenode.replication.min` number of replications which default to 1
is done the client/gateway will get a success back. The pending replication will be done
offline.
Q21) Let's assume a block is replicated in node1, node2 and node3. On which
node the block will be read during the get operation and why?
Hadoop will figure out which node is the closest (in terms of network) to the machine
from which the data is being read and will try to read in the same order. This way the
network is used efficiently.
Q22) How does HDFS decide on which nodes the block be replicated during a
put operation?
The HDFS runs the block placement policy which determines on which DataNodes the
blocks to be placed. Hadoop provides a pluggable framework for block placement policy.
The default block placement policy has rules only for the first 3 blocks of data.
Q23) For a HDFS operation to happen, what has to be installed and configured
on the Gateway?
For Gateway to happen, the following have to be installed and configured
-

Linux
Java
ssh server
Hadoop

Q24) What does a HDFS Client need to know about HDFS cluster to perform a
file system operation?
The HDFS Client need to know about the NameNode hostname/ip and the port number
to perform a file system operation in HDFS. There is no need to know about the
DataNode details.
Q25) Where does NameNode store the metadata?
The NameNode stores the metadata in the memory as well as on the hard disk. The
reason for storing in the memory is for quick access and reason for storing in the hard
disk is for the sake of reliable persistence.
Q26) Where does DataNode store the blocks of data? What else is stored along
with the block of data?

The DataNode stores the blocks of data on the hard disk. Along with the block of data it
also stores the some data about the block like CRC checksum.
Q27) How does the start/stop scripts know where to start/stop the different
services related to HDFS?
The start/stop scripts go through the masters/slaves file to figure out where to start/stop
the different services. The masters file corresponds to the location of the Secondary
NameNode/CheckPointNode and the slaves file corresponds to the location of the
TaskTracker/DataNode.
Q28) What are the typical characteristics of machine to run a NameNode and
DataNode?
The NameNode is a Single Point Of Failure and stores metadata in the RAM, so it should
typically be a highly reliable machine with a lot of RAM. The DataNode stores all the
blocks onto the hard disk and the DataNodes are in many numbers. So, the datanodes
should be commodity machines with a lot of hard disk storage on each machine.
Q29) How does NameNode know the location of the different DataNodes in a
HDFS cluster?
The slaves file will have the list of the DataNodes in the cluster.
Q30) How does a DataNode know the location of the NameNode in a HDFS
cluster?
The NameNode hostname/ip and the port number is stored in the core-site.xml. The
DataNode uses this for communication with the NameNode.
Q31) What happens when some of the DataNodes are overloaded in terms of
block storage? How to rectify this?
When the DataNode is overloaded with blocks, then there is a better probability that a
good percentage of the reads will be redirected to that particular DataNode and the read
process in HDFS will become slow. As a solution, the Hadoop balancer scripts can be run
to move the blocks from the over loaded node to the under loaded nodes.
Q32) What are the commands (rwx) to change the permissions on the files?
The HDFS chmod option can be used to change the rwx permission on the files.
Q33) How does the get command know where to get the blocks of data from?
The client/gateway interacts with the NameNode to get the information on which
DataNode the blocks are. Once it gets this information, it will directly contact the
DataNodes to get the actual blocks.
Q34) What does rwx permissions mean in case of files and folders in HDFS?
The read permission is required to read files or list the contents of a directory. The write
permission is required to write a file, or for a directory, to create or delete files or
directories in it. The execute permission is ignored for a file because you cant execute a

file on HDFS (unlike POSIX), and for a directory this permission is required to access its
children.
Q35) What are the commands to modify the user/group to which the file
belongs to in HDFS?
The Hadoop chown and the chgrp options can be used to modify the user/group to which
the file belongs to in HDFS.
Q36) What is the difference between copyFromLocal and the put option in HDFS
`hadoop fs` options?
The HDFS put option supports multiple source file systems, but with copyFromLocal the
source is restricted to the local file reference.
Q37) What is the difference between rm and rmr option in HDFS `hadoop fs`
options?
rm option deletes files and any empty directory. While rmr deletes folder in a recursive
fashion.
Q38) What the different options for looking at the contents of a file in HDFS?
a) The HDFS -cat option.
b) The HDFS Web Browser
c) The HDFS file can be got using -get option and then viewed in the local file system.
Q39) What are the different ways of creating data in HDFS?
The data can be inserted using
- the HDFS fs commands
- SQOOP
- Flume
- NoSQL Database like HBase, Cassandra etc
- HDFS Java API
- and others

You might also like