Professional Documents
Culture Documents
by
Matias Fernando Capristo
fcaprist@fi.uba.ar
Facultad de Ingenieria - Universidad de Buenos Aires (U.B.A)
November 2012
Hadoop Exercises
Index
Configuring the work environment.......................................................................................................... 3
Configure the keyboard and the internet access .................................................................................. 3
Exercise 1: Getting familiar with Hadoop ................................................................................................. 3
1.a HDFS Access................................................................................................................................... 3
1.b Running a MapReduce Program..................................................................................................... 3
Exercise 2: Our first MapReduce Program - The Inverted index................................................................ 4
2.1 Coding the map() method .............................................................................................................. 4
2.2 Create the Reducer Class ............................................................................................................... 6
2.3 Run JUnit tests............................................................................................................................... 7
2.4 Debugging ..................................................................................................................................... 7
2.5 Compile your system ..................................................................................................................... 7
2.6 Running and monitoring your program .......................................................................................... 7
Exercise 3 : Improving the Inverted Index ................................................................................................ 8
Exercise 4: The patent inverted index ...................................................................................................... 9
Exercise 5: Counting things.................................................................................................................... 11
Bibliography .......................................................................................................................................... 12
Hadoop Exercises
The keyboard of the VM is configured in English. If you prefer it in French, just follow this steps:
a Go to System -> Preferences -> Keyboard -> Layout
b Click in + to add a language
c Add the french configuration.
d Set it as default.
e Apply the changes. The password will be requested: it is training . You need to
copy/paste this password from the unix prompt.
Since you are working inside the virtual machine, you have not internet access because the
system doesnt know your credentials. To be sure that you could connect outside, do the
following:
a Open a browser.
b Try to browse any web page. A page will appear asking for your credentials.
c Enter your user and password.
Hadoop Exercises
Running ant:
a Go to the directory ~/git/exercises/shakespeare
b Run ant
This will compile the files using the compilation info located in build.xml file, and will generate
the jar file for your application code.
Hadoop Exercises
The program will read the input from the "input" folder. Instead of use the all-shakespeare text of
the previous exercise, which is long, you can reduce the input using the file "RomeoAndJulietPrologue-Part.txt". So put this file in the input folder and remove the all-shakespeare. By
reducing the input to a few lines, you can get a better understanding of the code and outputs.
Since you want to test just the mapper without a reducer, you can run the job without reducer
tasks, adding the parameter -D mapred.reduce.tasks=0 like this:
hadoop jar indexer.jar index.LineIndexer
or modify the runJob() method adding the line in bold, as is shown below:
private void runJob() throws IOException {
JobConf conf = new JobConf(getConf(), LineIndexer.class);
FileInputFormat.addInputPath(conf, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(conf, new Path(OUTPUT_PATH));
conf.setMapperClass(LineIndexMapper.class);
conf.setReducerClass(LineIndexReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
JobClient.runJob(conf);
}
The results of the mappers will be written straight to HDFS with no processing by the reducer.
4
The output file will be written in the "output" folder. You can use the following command to see the
output: training@training-vm:~$ hadoop fs -cat output/<<filename>> | less
Hadoop Exercises
Note: Despite the name of the task (Line Indexer) we will actually be referring to locations of individual
words by the byte offset at which the line starts not the "line number" in the conventional sense. This is
because the line number is actually not available to us. (We will, however be indexing a line at a time thus
the name "Line Indexer.") Large files can be broken up into smaller chunks which are passed to mappers
in parallel; these chunks are broken up on the line ends nearest to specific byte boundaries. Since there
is no easy correspondence between lines and bytes, a mapper over the second chunk in the file would
need to have read all of the first chunk to establish its starting line number defeating the point of parallel
processing!
V1,V2,...,Vn>
To do this, the Reducer code simply iterates over values to get all the value strings, and concats them
together into our result String. In the following exercises you can test the reducer class and debug it to
find errors if the tests arent ok.
Tips:
1
You will have to iterate the values collection. If you dont remember too much about iterators, you
can check its documentation in http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Iterator.html
Important: To enhance perfomance, instead of using the common way for concat strings:
s1 = s1 + s2; // Order of time to perform this operation in a loop : O(n^2)
you can use the StringBuilder class which provides more efficient string
operations.
e.g.:
StringBuilder sb = new StringBuilder();
sb.append(s1);
sb.append(s2);
sb.toString(); // return the fully concatenated string at the end.
The order of the time to perform this operation is linear, which is much better than O(n^2).
Hadoop Exercises
Ensure that the code meets expectations and specifications: Does what it is expected.
Ensure that the code continues to meet expectations over time: Avoid regression
We will work with tests in this exercise. For this, go to Package Explorer in Eclipse and open the
test/index directory. Right click on AllTests.java, select "Run as..." and "JUnit test." This will use the
Eclipse JUnit plugin to test your Mapper and Reducer. The unit tests have already been written in
MapperTest.java and ReducerTest.java. These use a library developed by Cloudera to test mappers and
reducers, called MRUnit. The source for this library is included in your workspace.
2.4 Debugging
Now that you are more familiar with the tests, you can use them to debug your map reduce operations.
For do this:
1 Put a breakpoint in the code that you want to debug
2 Locate in the package explorer the AllTests.java class.
3 Right button on the class, select Debug as... -> Junit test. The execution of the mapper will
start, and will stop at your breakpoint. Then you can inspect variables, run step by step the
program, etc.
Hadoop Exercises
Once you have only the all-shakespeare.txt file in the input folder, and the output folder is deleted, run:
$ hadoop jar indexer.jar index.LineIndexer
This will read all the files in the input directory in HDFS and compute an inverted index. It will be written to
a directory named output. View your results by using hadoop fs -cat filename. You may want to pipe this
into less.
If your program didn't work, go back and fix places you think are buggy. Recompile using the steps
above, and then re-run the program.
Tips:
1
You can use the user interface in your browser to watch the progress of your job and monitor it,
and other interesting statistics about it. You can browse http://localhost:50030 to access the
jobTracker.
You can check
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/welcome.html
which have a description of how to use and understand the Hadoop UI.
shakespeare.txt@38624,shakespeare.txt@38625,shakespeare.txt@38626,shakespeare.txt@12047
Hadoop Exercises
The data set is in the standard comma-separated values (CSV) format, with the first line a description of
the columns. Each of the other lines record one particu-lar citation. For example, the second line shows
that patent 3858241 cites patent 956203.
The aim of this exercise is to develop a program that will take the patent citation data and invert it. For
each patent,we want to find and group the patents that cite it. Our output should be similar to the
following:
1
10000
100000
1000006
1000007
3964859,4647229
4539112
5031388
4714284
4766693
You can use as a template the project of the previous exercise. The configuration of the job will look like
this:
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, MyJob.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("MyJob");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);
Hadoop Exercises
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyJob(), args);
System.exit(res);
}
Tips:
1
2
3
You must indicate the input and output paths when you run your program from the command line.
They are not contained in the code like in the later exercise.
The mapper should implements the Mapper interface (Mapper<K1, V1, K2, V2) using the Text
class for each key and value (Mapper<Text, Text, Text, Text>)
The reducer should implements the Reducer interface (Reducer<Text, Text, Text, Text>) using
the Text class for each key and value (Reducer<Text, Text, Text, Text>)
10
Hadoop Exercises
921128
552246
380319
278438
210814
163149
127941
102155
82126
66634
1
This output show us for example, that the patent 10 is cited 66634 times.
Tips:
1
2
When the exercise is done, you can put the output into a spread-sheet and plot it. You can realize that
you have processed a big file of data with a specific format and obtained an analysis of its data in a very
easy way.
11
Hadoop Exercises
Bibliography
12