You are on page 1of 1

Imagine that you want to work with the ubiquitous XML and JSON data-serializatio

n
formats. These formats work in a straightforward manner in most programming lang
uages,
with several tools being available to help you with marshaling, unmarshaling,
and validating where applicable. Working with XML and JSON in MapReduce, however
,
poses two equally important challenges. First, MapReduce requires classes that c
an support
reading and writing a particular data-serialization format; if you re working with
a
custom file format, there s a good chance it doesn t have such classes to support th
e
serialization format you re working with. Second, MapReduce s power lies in its abil
ity
to parallelize reading your input data. If your input files are large (think hun
dreds of
megabytes or more), it s crucial that the classes reading your serialization forma
t be
able to split your large files so multiple map tasks can read them in parallel.
We ll start this chapter by tackling the problem of how to work with serialization
formats such as XML and JSON. Then we ll compare and contrast data-serialization
formats that are better suited to working with big data, such as Avro and Parque
t. The
final hurdle is when you need to work with a file format that s proprietary, or a
less
common file format for which no read/write bindings exist in MapReduce. I ll show
you how to write your own classes to read/write your file format.
XML and JSON formats This chapter assumes you re familiar with the XML
and JSON data formats. Wikipedia provides some good background articles
on XML and JSON, if needed. You should also have some experience writing
MapReduce programs and should understand the basic concepts of HDFS
and MapReduce input and output.
Data serialization support in MapReduce is a property of the input and output cl
asses
that read and write MapReduce data. Let s start with an overview of how MapReduce
supports data input and output.
3.1 Understanding inputs and outputs in MapReduce
Your data might be XML files sitting behind a number of FTP servers, text log fi
les sitting
on a central web server, or Lucene indexes in HDFS.1 How does MapReduce support
reading and writing to these different serialization structures across the vario
us
storage mechanisms?
Figure 3.1 shows the high-level data flow through MapReduce and identifies the
actors responsible for various parts of the flow. On the input side, you can see
that
some work (Create split) is performed outside of the map phase, and other work i
s performed
as part of the map phase (Read split). All of the output work is performed in
the reduce phase (Write output).
1 Apache Lucene is an information retrieval project that stores data in an inver
ted index data structure optimized
for full-text search. More information is available at http://lucene.apache.org/
.

You might also like