Hadoop Exercises

Matias Fernando Capristo
Facultad de Ingenieria - Universidad de Buenos Aires (U.B.A)

Document developed for LUSSI department of Telecom Bretagne

Ecole Nationale Suprieure des Tlcommunications de Bretagne

Based on Cloudera Hadoop Demo 0.3.3 Tutorial

November 2012

Configuring the work environment.......................................................................................................... 3
Configure the keyboard and the internet access .................................................................................. 3
Exercise 1: Getting familiar with Hadoop ................................................................................................. 3
1.a HDFS Access................................................................................................................................... 3
1.b Running a MapReduce Program..................................................................................................... 3
Exercise 2: Our first MapReduce Program - The Inverted index................................................................ 4
2.1 Coding the map() method .............................................................................................................. 4
2.2 Create the Reducer Class ............................................................................................................... 6
2.3 Run JUnit tests............................................................................................................................... 7
2.4 Debugging ..................................................................................................................................... 7
2.5 Compile your system ..................................................................................................................... 7
2.6 Running and monitoring your program .......................................................................................... 7
Exercise 3 : Improving the Inverted Index ................................................................................................ 8
Exercise 4: The patent inverted index ...................................................................................................... 9
Exercise 5: Counting things.................................................................................................................... 11
Bibliography .......................................................................................................................................... 12

Configuring the work environment

To start the Cloudera training Virtual Machine from the linux, there is an icon in the desktop.
At the end of the Lab, close the VMware properly, to could retrieve your contents in the next sessions. As
HDFS is not stored permanently, you do not must shut down the VM, you must suspend it.

Configure the keyboard and the internet access


The keyboard of the VM is configured in English. If you prefer it in French, just follow this steps:
a Go to System -> Preferences -> Keyboard -> Layout
b Click in + to add a language
c Add the french configuration.
d Set it as default.
e Apply the changes. The password will be requested: it is training . You need to
copy/paste this password from the unix prompt.

Since you are working inside the virtual machine, you have not internet access because the
system doesnt know your credentials. To be sure that you could connect outside, do the
a Open a browser.
b Try to browse any web page. A page will appear asking for your credentials.
c Enter your user and password.

Exercise 1: Getting familiar with Hadoop

In the desktop of the virtual machine you will find a folder named instructions containing some tutorials
and exercises. As a first approach to Hadoop, we will do the exercise Getting familiar with Hadoop that
is in the exercises folder. Just follow the instructions in order to interact with Hadoop and practice some
basic commands.
Important: In this exercise you will be requested to update the files using a git command. Since this
maybe take a long while, you can download the files from the web page where this file is located.
1 Download the folder data
2 Copy the content of the folder data overwriting the content of the folder ~/git/data

1.a HDFS Access

The goal of this exercise is to practice ls, put and cat commands. These commands interact whit
HDFS in a similar way as they do in Unix. All the commands are prefixed with hadoop fs

1.b Running a MapReduce Program

The goal of this exercise is to practice the syntax of the command to run mapReduce programs,
experimenting with one program already provided by Hadoop.

Exercise 2: Our first MapReduce Program - The Inverted index

Start Eclipse (via the icon on the desktop of the Cloudera VM). A project has already been created for you
called LineIndexer (path: InvertedIndex/stub-src/index/LineIndexer). This project is preloaded with a
source code "skeleton" for the activity. This workspace also contains another project containing all the
Hadoop source code and libraries. This is required as a build dependency for LineIndexer; it's also there
for your reference.
The LineIndexer class is the main "driver" for the program, which configures the MapReduce job. This
class has already been written for you. You may want to examine its body so you understand how
MapReduce jobs are created and dispatched to Hadoop.

2.1 Coding the map() method

If you open the LineIndexMapper you will realize that the map() method is empty. You must complete the
code with the following :
public void map(LongWritable key, Text value, OutputCollector<Text, Text>
output, Reporter reporter) throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
// The file name is obtained
String fileName = fileSplit.getPath().getName();
// The file name is wrapped with the Text class
Text outVal = new Text(fileName);
// StringTokenizer is used to split and iterate the string (in this case,
the string is a line of the text, and the value which separate each token is
the empty space
StringTokenizer tokenizer = new StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
output.collect(new Text(word), outVal);
This code has an error in its algorithm. The aim of this exercise is that get you familiar with some basic
commands. To find the error, you can run your program and analyse the output (using hadoop commands
like ls or cat. You must use ant tool in order to build the executable jar and then use the hadoop
command to run the jar.

Running ant:
a Go to the directory ~/git/exercises/shakespeare
b Run ant
This will compile the files using the compilation info located in build.xml file, and will generate
the jar file for your application code.

Running the jar:

a Step in the directory where the file is located.

Run hadoop jar indexer.jar index.LineIndexer

Tips to find the error in the mapper code :

1 The map function takes four parameters which by default correspond to:
a LongWritable key - the byte-offset of the current line in the file
b Text value - the current line from the file
c OutputCollector - output - this has the .collect method to output a <key, value> pair
d Reporter reporter - allows us to retrieve some information about the job (like the current

The program will read the input from the "input" folder. Instead of use the all-shakespeare text of
the previous exercise, which is long, you can reduce the input using the file "RomeoAndJulietPrologue-Part.txt". So put this file in the input folder and remove the all-shakespeare. By
reducing the input to a few lines, you can get a better understanding of the code and outputs.

Since you want to test just the mapper without a reducer, you can run the job without reducer
tasks, adding the parameter -D mapred.reduce.tasks=0 like this:
hadoop jar indexer.jar index.LineIndexer
or modify the runJob() method adding the line in bold, as is shown below:
private void runJob() throws IOException {
JobConf conf = new JobConf(getConf(), LineIndexer.class);
FileInputFormat.addInputPath(conf, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(conf, new Path(OUTPUT_PATH));

The results of the mappers will be written straight to HDFS with no processing by the reducer.

The output file will be written in the "output" folder. You can use the following command to see the
output: training@training-vm:~$ hadoop fs -cat output/<<filename>> | less

Note: Despite the name of the task (Line Indexer) we will actually be referring to locations of individual
words by the byte offset at which the line starts not the "line number" in the conventional sense. This is
because the line number is actually not available to us. (We will, however be indexing a line at a time thus
the name "Line Indexer.") Large files can be broken up into smaller chunks which are passed to mappers
in parallel; these chunks are broken up on the line ends nearest to specific byte boundaries. Since there
is no easy correspondence between lines and bytes, a mapper over the second chunk in the file would
need to have read all of the first chunk to establish its starting line number defeating the point of parallel

2.2 Create the Reducer Class

Open the LineIndexReducer class in the project. The line indexer Reducer takes in all the <"word",
"filename@offset"> key/value pairs output by the Mapper for a single word. For example, for a given key,
the pairs look like:
<key, V1>
<key, V2>
<key, Vn >
Given all those <key, value> pairs, the reduce generates a single value string. For the line indexer
problem, the strategy is simply to concat all the values together to make a single large string, using "," to
separate the values. The choice of "," is arbitrary later code can split on the "," to recover the separate
values. So for the key given the output value string will look like:


To do this, the Reducer code simply iterates over values to get all the value strings, and concats them
together into our result String. In the following exercises you can test the reducer class and debug it to
find errors if the tests arent ok.

You will have to iterate the values collection. If you dont remember too much about iterators, you
can check its documentation in

Important: To enhance perfomance, instead of using the common way for concat strings:
s1 = s1 + s2; // Order of time to perform this operation in a loop : O(n^2)
you can use the StringBuilder class which provides more efficient string
StringBuilder sb = new StringBuilder();
sb.toString(); // return the fully concatenated string at the end.
The order of the time to perform this operation is linear, which is much better than O(n^2).

2.3 Run JUnit tests

A unit test is an automated piece of code that invokes the unit of work being tested and then checks
some assumptions about the end result of that unit. A unit test is almost always written using a unittesting framework. It can be written easily and runs quickly. Its trustworthy, readable, and maintainable. It
consistent in its results as long as production code has not changed.
The piece of code under test (usually called system under test (SUC) ), should be as small as possible.
One method may have multiple unit tests according to the usage and outputs of the function.
The goals of a test are:

Ensure that the code meets expectations and specifications: Does what it is expected.
Ensure that the code continues to meet expectations over time: Avoid regression

We will work with tests in this exercise. For this, go to Package Explorer in Eclipse and open the
test/index directory. Right click on, select "Run as..." and "JUnit test." This will use the
Eclipse JUnit plugin to test your Mapper and Reducer. The unit tests have already been written in and These use a library developed by Cloudera to test mappers and
reducers, called MRUnit. The source for this library is included in your workspace.

2.4 Debugging
Now that you are more familiar with the tests, you can use them to debug your map reduce operations.
For do this:
1 Put a breakpoint in the code that you want to debug
2 Locate in the package explorer the class.
3 Right button on the class, select Debug as... -> Junit test. The execution of the mapper will
start, and will stop at your breakpoint. Then you can inspect variables, run step by step the
program, etc.

2.5 Compile your system

Open a terminal window, navigate to ~/workspace/LineIndexer/, and compile your Java sources into a jar:
$ cd ~/workspace/LineIndexer
$ ant
This will create a file named indexer.jar in the current directory. If you're curious about the instructions
that ant followed to create the compiled object, read the build.xml file in a text editor.
You can also run JUnit tests from the ant buildfile by running :
$ ant test

2.6 Running and monitoring your program

In the previous exercise, you should have loaded your sample input data into Hadoop. If you changed
all-shakespeare.txt for "RomeoAndJuliet-Prologue-Part.txt", change it again, because now that we have
our Map Reduce program working, it is better to run it with more data. Also before you run the program
again, you'll need to remove the output directory that you already created using $hadoop fs -rmr
output command; or else Hadoop will refuse to run the job and print an error message ("Directory
already exists").

Once you have only the all-shakespeare.txt file in the input folder, and the output folder is deleted, run:
$ hadoop jar indexer.jar index.LineIndexer
This will read all the files in the input directory in HDFS and compute an inverted index. It will be written to
a directory named output. View your results by using hadoop fs -cat filename. You may want to pipe this
into less.
If your program didn't work, go back and fix places you think are buggy. Recompile using the steps
above, and then re-run the program.

You can use the user interface in your browser to watch the progress of your job and monitor it,
and other interesting statistics about it. You can browse http://localhost:50030 to access the
You can check
which have a description of how to use and understand the Hadoop UI.

Exercise 3 : Improving the Inverted Index

If you take a look at exercise 2, you may notice that the StringTokenizer class doesn't do anything clever
with regards to punctuation or capitalization. You might want to improve your mapper to merge these
related tokens, and avoid indexing the same word several times just because it appears once with a
capital letter, once with a semi colon, etc.
Example.The index of the exercise 2 looks like this:
romeo shakespeare.txt@38624
Romeo shakespeare.txt@38625
Romeo; shakespeare.txt@38626,shakespeare.txt@12047

The output of the exercise 3 must be like this:



Modify the program and run it so to obtain the expected result.

Exercise 4: The patent inverted index

Lets now work with a bigger amount of data. We will use the patents data sets, which are available in the
National Bureau of Economic Research (NBER) site, at You must download
the file. Unzipped, the file is approximately 250 MB which are small enough to make our
examples runnable in Hadoops standalone or pseudo-distributed mode.
The patent citation data set contains citations from U.S. patents issued between 1975 and 1999. It has
more than 16 million rows and the first few lines resemble the following:

The data set is in the standard comma-separated values (CSV) format, with the first line a description of
the columns. Each of the other lines record one particu-lar citation. For example, the second line shows
that patent 3858241 cites patent 956203.
The aim of this exercise is to develop a program that will take the patent citation data and invert it. For
each patent,we want to find and group the patents that cite it. Our output should be similar to the


You can use as a template the project of the previous exercise. The configuration of the job will look like
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, MyJob.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);

job.set("", ",");
return 0;
public static void main(String[] args) throws Exception {
int res = Configuration(), new MyJob(), args);



You must indicate the input and output paths when you run your program from the command line.
They are not contained in the code like in the later exercise.
The mapper should implements the Mapper interface (Mapper<K1, V1, K2, V2) using the Text
class for each key and value (Mapper<Text, Text, Text, Text>)
The reducer should implements the Reducer interface (Reducer<Text, Text, Text, Text>) using
the Text class for each key and value (Reducer<Text, Text, Text, Text>)


Exercise 5: Counting things

In this exercise we will count the times that a patent is cited. Our output must be like this:


This output show us for example, that the patent 10 is cited 66634 times.


The mapper should implements Mapper<Text, Text, IntWritable, IntWritable>

The reducer should implements Reducer<IntWritable, IntWritable, IntWritable, IntWritable>

When the exercise is done, you can put the output into a spread-sheet and plot it. You can realize that
you have processed a big file of data with a specific format and obtained an analysis of its data in a very
easy way.


Chuck Lam, Hadoop in Action, Manning Publications Co., 2011

Cloudera Hadoop Demo 0.3.3, Tutorial and exercises,
Amazon Web Services, Amazon Elastic Map Reduce,


