You are on page 1of 126

1.

1 Introduction to Big Data

1.1.1 What is Big Data?


Big data is a term that describes the large volume of data both structured and unstructured
that inundates a business on a day-to-day basis.
Big data is a term for data sets that are so large or complex that traditional data
processing application software is inadequate to deal with them. Challenges include capture,
storage, analysis, data creation, search, sharing, transfer, visualization, querying, updating and
information privacy. The term "big data" often refers simply to the use of predictive analytics,
user behavior analytics, or certain other advanced data analytic methods that extract value from
data, and seldom to a particular size of the data set.
Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat
crime and so on." Scientists, business executives, practitioners of medicine, advertising and
governments alike regularly meet difficulties with large datasets in areas including Internet
search, finance, urban informatics, and business informatics. Scientists encounter limitations in
e-Science work, including meteorology, genomics, connectomics, complex physics simulations,
biological and environmental research.

Data sets grow rapidly - in part because they are increasingly gathered from cheap and numerous
information-sensing mobile devices, aerial (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers and wireless sensor networks.
Big data is a term that refers to data sets or combinations of data sets whose size (volume),
complexity (variability), and rate of growth (velocity) make them difficult to be captured,
managed, processed or analyzed by conventional technologies and tools, such as relational
databases and desktop statistics or visualization packages, within the time necessary to make
them useful. While the size used to determine whether a particular data set is considered big data
is not firmly defined and continues to change over time, most analysts and practitioners currently
refer to data sets from 30-50 terabytes(10 12 or 1000 gigabytes per terabyte) to multiple
petabytes (1015 or 1000 terabytes per petabyte) as big data.
The complex nature of big data is primarily driven by the unstructured nature of much of the data
that is generated by modern technologies, such as that from web logs, radio frequency Id (RFID),
sensors embedded in devices, machinery, vehicles, Internet searches, social networks such as
Facebook, portable computers, smart phones and other cell phones, GPS devices, and call center
records. In most cases, in order to effectively utilize big data, it must be combined with
structured data (typically from a relational database) from a more conventional business
application, such as Enterprise Resource Planning (ERP) or Customer Relationship Management
(CRM).
Similar to the complexity, or variability, aspect of big data, its rate of growth, or velocity aspect,
is largely due to the ubiquitous nature of modern on-line, real-time data capture devices, systems,
and networks. It is expected that the rate of growth of big data will continue to increase for the
foreseeable future.
Specific new big data technologies and tools have been and continue to be developed. Much of
the new big data technology relies heavily on massively parallel processing (MPP) databases,
which can concurrently distribute the processing of very large sets of data across many servers.
As another example, specific database query tools have been developed for working with the
massive amounts of unstructured data that are being generated in big data environments.

1.1.2 Why is Big Data Important?


When big data is effectively and efficiently captured, processed, and analyzed, companies are
able to
gain a more complete understanding of their business, customers, products, competitors, etc.
which
can lead to efficiency improvements, increased sales, lower costs, better customer service, and/or
improved products and services.
For example:
Manufacturing companies deploy sensors in their products to return a stream of telemetry.
Sometimes this is used to deliver services like OnStar, that delivers communications, security
and navigation services. Perhaps more importantly, this telemetry also reveals usage patterns,
failure rates and other opportunities for product improvement that can reduce development and
assembly costs.
The proliferation of smart phones and other GPS devices offers advertisers an opportunity to
target consumers when they are in close proximity to a store, a coffee shop or a restaurant. This
opens up new revenue for service providers and offers many businesses a chance to target new
customers.
Retailers usually know who buys their products. Use of social media and web log files from
their
ecommerce sites can help them understand who didnt buy and why they chose not to,
information not available to them today. This can enable much more effective micro customer
segmentation and targeted marketing campaigns, as well as improve supply chain
efficiencies.
Other widely-cited examples of the effective use of big data exist in the following areas:
Using information technology (IT) logs to improve IT troubleshooting and security breach
detection, speed, effectiveness, and future occurrence prevention.
Use of voluminous historical call center information more quickly, in order to improve
customer interaction and satisfaction.
Use of social media content in order to better and more quickly understand customer
sentiment about you/your customers, and improve products, services, and customer
interaction.
Fraud detection and prevention in any industry that processes financial transactions online,
such as shopping, banking, investing, insurance and health care claims.
Use of financial market transaction information to more quickly assess risk and take
corrective action.

1.1.3 Facts about Big Data


Everywhere you turn, you hear the term Big Data. What does Big Data mean to you and how
will it impact you and your business in the near future? Having access to high-quality channel
data in real time can help you make smarter business decisions and allow you to be more agile
than your competitors. Why? Because hidden inside the massive amounts of structured and
unstructured channel data that is collected on a daily basis are key organizational, partner and
end-customer insights.

On that note, below are some facts about big data that demonstrate its importance in todays
business environment.

1. According to the 2015 IDG Enterprise Big Data Research study, businesses will spend an
average of $7.4 million on data-related initiatives in 2016.
2. According to McKinsey, a retailer using Big Data to its fullest potential could increase its
operating margin by more than 60%.
3. Bad data or poor quality data costs organizations as much as 10-20% of their revenue.
4. A 10% increase in data accessibility by a Fortune 1000 company would give that
company approximately $65 million more in annual net income.
5. Big Data will drive $48.6 billion in annual spending by 2019.
6. Data production will be 44 times greater in 2020 than it was in 2009. Individuals create
more than 70% of the digital universe. But enterprises are responsible for storing and
managing 80% of it.
7. It is estimated that Walmart collects more than 2.5 petabytes of data every hour from its
customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20
million filing cabinets worth of text.
8. By 2020 one third of all data will be stored, or will have passed through the cloud, and
we will have created 35 zetabytes worth of data.
9. A 2015 report by Cap Gemini found that 56% of companies expect to increase their
spending on big data in the next three years.
10. There will be a shortage of talent necessary for organizations to take advantage of Big
Data. By 2018, the United States alone could face a shortage of 140,000 to 190,000
skilled workers with deep analytical skills as well as 1.5 million managers and analysts
with the know-how to use Big Data analytics to make effective decisions.
You can see from the above statistics that Big Data really is a big deal. However, dont just
collect data for the sake of collecting it. If you do, youll end up with a massive digital
graveyard. To get the most out of your data, you need to have a clear plan in place for how you
will manage and use your data along with a goal for what you want to accomplish with it.
Remember, unlike wine, data doesnt get better with age. So if you have invested resources to
collect, store, and report data, then you need to put your data to work to get the most value out of
it.

All of the above statistics emphasize the underlying fact that organizations need to have
processes, systems, and tools in place to help them turn raw data into useful and actionable
information. Do you have the right tools in place to get the most out of your channel data? Can
you turn your channel data into channel intelligence? Please leave a comment below on what
your company has done to make the most of Big Data.
1.1.4 What Comes Under Big Data?
Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
Social Media Data : Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data : The stock exchange data holds information about the buy and
sell decisions made on a share of different companies made by the customers.
Power Grid Data : The power grid data holds information consumed by a particular
node with respect to a base station.
Transport Data : Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data : Search engines retrieve lots of data from different databases.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text, Media Logs.

1.1.5 Where does Big Data come from


Original big data was the web data -- as in the entire Internet! Remember Hadoop was built to
index the web. These days Big data comes from multiple sources.
Web Data -- still it is big data
Social media data : Sites like Facebook, Twitter, LinkedIn generate a large amount of data
Click stream data : when users navigate a website, the clicks are logged for further analysis
(like navigation patterns). Click stream data is important in on line advertising and and E-
Commerce
sensor data : sensors embedded in roads to monitor traffic and misc. other applications generate
a large volume of data

1.2 Evolution of Big Data


Big data is still an enigma to many people. Its a relatively new term that was only coined during
the latter part of the last decade. While it may still be ambiguous to many people, since its
inception its become increasingly clear what big data is and why its important to so many
different companies.

The term big data doesnt just refer to the enormous amounts of data available today, it also
refers to the whole process of gathering, storing and analyzing that data. Importantly, this
process is being used to make the world a better place.

Since big data as we know it today is so new, theres not a whole lot of past to examine, but what
there is shows just how much big data has evolved and improved in such a short period of time
and hints at the changes that will come in the future. Importantly, big data is now starting to
move past being simply a buzzword thats understood by only a select few. Its become more
mainstream, and those who are actually implementing big data are finding great success.

In the past, big data was a big business tool. Not only were the big businesses the ones with the
huge amounts of information, but they were also the ones who had sufficient capital to get big
data up and running in the first place. Big data is still an enigma to many people. Its a relatively
new term that was only coined during the latter part of the last decade. While it may still be
ambiguous to many people, since its inception its become increasingly clear what big data is
and why its important to so many different companies. It used to be that in order to use big data
technology, a complex and costly on-premise infrastructure had to be installed. Along with that
expensive hardware came the responsibility to assemble an expert team to run and maintain the
system and make sense of the information. It wasnt easy, and it wasnt a small business friend.

Big data in the cloud changed all of that. It turned out to be the perfect solution for many
companies. It doesnt require any on-premise infrastructure, which greatly reduces the startup
costs. It also doesnt require the same amount of data gurus on the team because of how much
can be done by the cloud company itself. Big data in the cloud has been one of the key
components in big datas quick ascent in the business and technology world. Big data in the
cloud is also vital because of the growing amount of information each day. Its extremely hard to
scale your infrastructure when youve got an on-premise setup to meet your information needs.
You have to install more hardware for more data, or waste space and money with unused
hardware, when the data is less than expected. That problem doesnt exist with big data in the
cloud. Companies can scale up and down as their needs require, without significant financial
cost.

Big data has also evolved in its use since its inception. Today, we see it being used in the
military to reduce injuries, in the NBA to monitor every movement on the floor during a game,
in healthcare to prevent heart disease and cancer and in music to help artists go big. Were seeing
that it has no limits. Its fundamentally changing the way we do things. Theres so much
advancement thats coming to fruition because of it. With the increased availability and
affordability, the changes are only going to increase.

The increase in big data also means that companies are beginning to realize how important it is
to have excellent data analysts and data scientists. Companies are also beginning to implement
executive positions like chief data officer and chief data analyst. The ripple effect is being felt in
education, where universities and colleges are scrambling to provide learning for tomorrows
data specialists. Theres an enormous demand for data-literate people thats continually on the
rise.

It hasnt been around for long, but big data has been constantly evolving and that will only
continue. With an increase in technology and data, consumers can expect to see enormous
differences across a broad spectrum of industries. Big data is here to stay. As it continues to
grow and improve, those who adopt big data to discover the next competitive advantage are
going to find success ahead of their non-big data counterparts.

The explosion of the Internet, social media, technology devices and apps is creating a tsunami of
data. Extremely large sets of data can be collected and analyzed to reveal patterns, trends and
associations related to human behavior and interactions. Big data is being used to better
understand consumer habits, target marketing campaigns, improve operational efficiency, lower
costs, and reduce risk. International Data Corporation (IDC), a global provider of market
intelligence and information technology advisory services, estimates that the global big data and
analytics market will reach $125 billion in 2015.1

The challenge for businesses is how to make the best use of this wealth of information. Some
experts break down big data into three subcategories:
Smart data Information is useful and actionable if it can be organized and segmented
according to a companys needs. Smart data can be combined from multiple sources and
customized to address particular business challenges.
Identity data Profile data on consumers can be combined with their social media data,
purchasing habits and other behavioral analytics to help companies target their marketing
campaigns much more precisely.
People data Gleaned largely from social media data sets, people data helps companies to
better understand their customers as individuals and develop programs that address and
anticipate their needs. It seeks to create a shared community of customers with mutual likes,
ideas and sentiments.
Big data sets are so large that traditional processing methods often are inadequate. Big data
challenges data analysis, capture, management, search, sharing, storage, transfer, visualization
and privacy protection. As companies work through these data processing and management
issues, the focus is shifting to the areas of data strategy and data governance.

1.3 Key Big Data Challenges


Understanding and Utilizing Big Data It is a daunting task in most industries and companies
that deal with big data just to understand the data that is available to be used, determining the
best use of that data based on the companies industry, strategy, and tactics. Also, these types of
analyses need to be performed on an ongoing basis as the data landscape changes at an ever
increasing rate, and as executives develop more and more of an appetite for analytics based on
all available information.
New, Complex, and Continuously Emerging Technologies Since much of the technology
that is required in order to utilize big data is new to most organizations, it will be necessary for
these organizations to learn about these new technologies at an ever-accelerating pace, and
potentially engage with different technology providers and partners than they have used in the
past. Like with all technology, firms entering into the world of big data will need to balance the
business needs associated with big data with the associated costs of entering into and remaining
engaged in big data capture, storage, processing, and analysis.
Cloud Based Solutions A new class of business software applications has emerged whereby
company data is managed and stored in data centers around the globe. While these solutions
range from ERP, CRM, Document Management, Data Warehouses and Business Intelligence to
many others, the common issue remains the safe keeping and management of confidential
company data. These solutions often offer companies tremendous flexibility and cost savings
opportunities compared to more traditional on premise solutions but it raises a new dimension
related to data security and the overall management of an enterprises Big Data paradigm.
Privacy, Security, and Regulatory Considerations - Given the volume and complexity of big
data, it is challenging for most firms to obtain a reliable grasp on the content of all of their data
and to capture and secure it adequately, so that confidential and/or private business and customer
data are not accessed by and/or disclosed to unauthorized parties. The costs of a data privacy
breach can be enormous. For instance, in the health care field, class action lawsuits have been
filed, where the plaintiff has sought $1000 per patient record that has been inappropriately
accessed or lost. In the regulatory area, for instance, the proper storage and transmission of
personally identifiable information (PII), including that contained in unstructured data such as
emails can be problematic and necessitate new and improved security measures and
technologies. For companies doing business globally there are significant differences in privacy
laws between the U.S. and other countries. Lastly, it will be very important for most forms to
tightly integrate their big data, data security/privacy, and regulatory functions.
Archiving and Disposal of Big Data Since big data will lose its value to current decision
making over time, and since it is voluminous and varied in content and structure, it is necessary
to utilize new tools, technologies, and methods to archive and delete big data, without sacrificing
the effectiveness of using your big data for current business needs.
The Need for IT, Data Analyst, and Management Resources It is estimated that there is a
need for approximately 140,000 to 190,000 more workers with deep analytical expertise and
1.5 million more data-literate managers, either retrained or hired. Therefore, it is likely that any
firm that undertakes a big data initiative will need to either retrain existing people, or engage
new people in order for their initiative to be successful.
1.4 Characteristics of Big Data
Big Data is important because it enables organizations to gather, store, manage, and manipulate
vast amounts data at the right speed, at the right time, to gain the right insights. In addition, Big
Data generators must create scalable data (Volume) of different types (Variety) under
controllable generation rates (Velocity), while maintaining the important characteristics of the
raw data (Veracity), the collected data can bring to the intended process, activity or predictive
analysis/hypothesis. Indeed, there is no clear definition for Big Data. It has been defined based
on some of its characteristics. Therefore, these five characteristics have been used to define Big
Data, also known as 4Vs (Volume, Variety, Velocity and Veracity), as illustrated in Figure 6
below:

Figure 1: Five Vs Big Data Characteristics

Volume: refers to the quantity of data gathered by a company. This data must be used further to
gain important knowledge. Enterprises are awash with ever-growing data of all types, easily
amassing terabytes even petabytes of information(e.g. turning 12 terabytes of Tweets per day
into improved product sentiment analysis; or converting 350 billion annual meter readings to
better predict power consumption). Moreover, Demchenko, Grosso, de Laat and Membrey stated
that volume is the most important and distinctive feature of Big Data, imposing specific
requirements to all traditional technologies and tools currently used.
Velocity: refers to the time in which Big Data can be processed. Some activities are very
important and need immediate responses, which is why fast processing maximizes efficiency.
For time-sensitive processes such fraud detection, Big Data flows must be analyzed and used as
they stream into the organizations in order to maximize the value of the information (e.g.
scrutinize 5 million trade events created each day to identify potential fraud; analyze 500 million
daily call detail records in real-time to predict customer churn faster).
Variety: refers to the type of data that Big Data can comprise. This data maybe structured or
unstructured. Big data consists in any type of data, including structured and unstructured data
such as text, sensor data, audio, video, click streams, log files and so on. The analysis of
combined data types brings new problems, situations, and so on, such as monitoring hundreds of
live video feeds from surveillance cameras to target points of interest, exploiting the 80% data
growth in images, video and documents to improve customer satisfaction) );
Value: refers to the important feature of the data which is defined by the added-value that the
collected data can bring to the intended process, activity or predictive analysis/hypothesis. Data
value will depend on the events or processes they represent such as stochastic, probabilistic,
regular or random. Depending on this the requirements may be imposed to collect all data, store
for longer period (for some possible event of interest), etc. In this respect data value is closely
related to the data volume and variety.
Veracity: refers to the degree in which a leader trusts information in order to make a decision.
Therefore, finding the right correlations in Big Data is very important for the business future.
However, as one in three business leaders do not trust the information used to reach decisions,
generating trust in Big Data presents a huge challenge as the number and type of sources grows.

1.5 BIG DATA TYPES


Big Data encompasses everything, from dollar transactions to tweets to images to audio.
Therefore, taking advantage of Big Data requires that all this information to be integrated for
analysis and data management. This is more difficult than it appears. There are two main types of
data concerned here: structured and unstructured. Structured data is like a data warehouse, in
which data is tagged and sortable, while unstructured data is random and difficult to analyze. The
Figure below depicts these types, along with examples:

Figure 4: Big Data Types

1.6 THE ARCHITECTURE FOR BIG DATA


The figure below depicts the Big Data architecture:
Figure 5: Big Data architecture

A. Interfaces and feeds


Before we get into the nitty-gritty of the Big Data technology stack itself, must we understand
how Big Data works in the real world, therefore, it is important to start by understanding this
necessity. In fact, what makes Big Data big is the fact that it relies on picking up lots of data
from lots of sources. Therefore, open application programming interfaces (APIs) are a core part
of any Big Data architecture.
In addition, interfaces exist at every level and between every layer of the stack. Without
integration services, Big Data cannot happen. Other important operational database approaches
include columnar databases that store information efficiently in columns, and not rows. This
approach leads to faster performance, as input/output is extremely fast. When geographic data
storage is part of the equation, a spatial database is optimized to store and query data based on
how objects are related in real terms.

B. Redundant physical infrastructure


The supporting physical infrastructure is fundamental to the operation and scalability of a Big
Data architecture. In fact, without the availability of robust physical infrastructures, Big Data
would likely not have become such a strong trend. To support an unanticipated or unpredictable
volume of data, a physical infrastructure for Big Data has to be different than that for traditional
data. The physical infrastructure has been based on a distributed computing model. This means
that data may be physically stored in many different locations, allowing it to be linked through
networks, the use of a distributed file system, and various Big Data analytic tools and
applications.
Redundancy is important, as companies must handle a great deal of data from many sources.
Redundancy comes in many forms. For instance, if the company has created a private cloud,
company may want create redundancy within private areas so that it can scale out to support
changing workloads. If a company needs to limit internal IT growth, it may use external cloud
services to add to its own resources. In some cases, this redundancy may come in the form of a
Software as a Service (SaaS), allowing companies to carry out advanced data analysis as a
service. The SaaS approach allows for a faster start, lowering costs.

C. Security infrastructure
As Big Data analysis becomes part of workflow, it becomes vital to secure that data. For
example, a healthcare company probably wants to use Big Data applications to determine
changes in demographics or shifts in patient needs. This data about patients needs to be
protected, both to meet compliance requirements and to protect patient privacy. The company
needs to consider who is allowed to see the data and when they may see it. Also, the company
need to be able to verify the identity of users, as well as protect the identity of patients. These
types of security requirements must be part of the Big Data fabric from the outset, and not an
afterthought.

D. Operational data sources


Concerning Big Data, a company must ensure that all sources of data will provide a better
viewpoint about the business and allow it to understand how data effects the operational methods
of that company. Traditionally, an operational data source consisted of highly structured data,
managed by the line of business in a relational database. However, operational data now has to
consider a broader set of data sources, including unstructured sources such as social media or
customer data.

E. Performance matters
Data architecture also must work to perform in concert with supporting infrastructure of
organization or company. For instance, the company might be interested in running models to
determine whether it is safe to drill for oil in an offshore area, provided with real-time data of
temperature, salinity, sediment resuspension, and many other biological, chemical, and physical
properties of the water column. It might take days to run this model using a traditional server
configuration. However, using a distributed computing model, a days long task may take
minutes. Performance might also determine the kind of database that company would use. Under
certain circumstances, stakeholders may want to understand how two very distinct data elements
are related, or the relationship between social network activity and growth in sales. This is not
the typical query the company could ask of a structured, relational database. A graphing database
might be a better choice, as it may be tailored to separate the nodes or entities from its
properties or the information that defines that entity, and the edge or relationship between
nodes and properties. Using the right database may also improve performance. Typically, a graph
database maybe used in scientific and technical applications.

F. Organizing data services and tools


Indeed, not all the data that organizations use is operational. A growing amount of data comes
from a number of sources that are not quite as organized or straightforward, including data that
comes from machines or sensors, and massive public and private data sources. In the past, most
companies are notable to either capture or store this vast amount of data. It was simply too
expensive or too overwhelming. Even if companies are able to capture the data, they do not have
the tools to do anything about it. Very few tools can make sense of these vast amounts of data.
The tools that did exist were complex to use and did not produce results within a reasonable time
frame. In the end, companies who really wanted to go to the enormous effort of analyzing this
data were forced to work with snapshots of data. This means that stakeholders may miss out on
relevant events as they may not have been captured in a certain snapshot.

G. Analytical data warehouses and data marts


After a company sorts through the massive amounts of data available, it is often important to
take the subset of data that reveals patterns and put it into a form thats available to the business.
Such so-called warehouses provide compression, multilevel partitioning, and a massively
parallel processing architecture.

H. Reporting and visualization


Companies have always relied on the capability to create reports to give them an understanding
of what the data tells them about everything from monthly sales figures to projections of growth.
Big Data changes the way that data is managed and used. If a company is able to collect,
manage, and analyze enough data, it may use a new generation of tools to help management truly
understand the impact not just of a collection of data elements, but also how these data elements
offer context based on the business problem being addressed. With Big Data, reporting and data
visualization have become tools for looking at the context of how data is related and the impact
of those relationships on the future.

I. Big Data Applications


Traditionally, business has anticipated that data would be used to answer questions about what to
do and when to do it. Data has often been integrated into general-purpose business applications.
With the advent of Big Data, this is changing. Now, the companies are seeing the development
of applications that are designed specifically to take advantage of the unique characteristics of
Big Data. Specific emerging applications include areas such as healthcare, manufacturing
management, and traffic management. All of these applications rely on huge volumes, velocities,
and varieties of data to transform the behavior of a market. For example, in healthcare, a Big
Data application might be able to monitor premature infants to determine if data indicates when
intervention is needed. In manufacturing, a Big Data application can be used to prevent a
machine from shutting down during a production run. A Big Data traffic management
application may reduce the number of traffic jams on busy city highways, decreasing the number
of accidents while saving fuel and reducing pollution.

1.7 BIG DATA TECHNOLOGIES


With the evolution of computing technology, businesses may now manage immense volumes of
data previously could dealt with using expensive supercomputers. These are now much cheaper.
As a result, new techniques for distributed computing are mainstream. Big Data became
paramount as companies like Yahoo!, Google, and Facebook came to the realization that they
needed help in monetizing the massive amounts of data their offerings were creating. Thus, these
new companies must search for new technologies to store, access, and analyze huge amounts of
data in near real time. Such real-time analysis is required in order to profit from so much data
from users. Their resulting solutions have affected the larger data management market. In
particular, the top ten emerging technologies that are helping users cope with and handle Big
Data in a cost-effective manner.
Column-oriented databases
Traditional, row-oriented databases are excellent for online transaction processing with high
update speeds, but they fall short on query performance as the data volumes grow and as data
becomes more unstructured. Column-oriented databases store data with a focus on columns,
instead of rows, allowing for huge data compression and very fast query times. The downside to
these databases is that they will generally only allow batch updates, having a much slower update
time than traditional models.
Schema-less databases, or NoSQL databases
There are several database types that fit into this category, such as key-value stores and
document stores, which focus on the storage and retrieval of large volumes of unstructured,
semi-structured, or even structured data. They achieve performance gains by doing away with
some (or all) of the restrictions traditionally associated with conventional databases, such as
read-write consistency, in exchange for scalability and distributed processing.
MapReduce
This is a programming paradigm that allows for massive job execution scalability against
thousands of servers or clusters of servers. Any MapReduce implementation consists of two
tasks:
The "Map" task, where an input dataset is converted into a different set of key/value
pairs, or tuples;
The "Reduce" task, where several of the outputs of the "Map" task are combined to form
a reduced set of tuples (hence the name).
Hadoop
Hadoop is by far the most popular implementation of MapReduce, being an entirely open source
platform for handling Big Data. It is flexible enough to be able to work with multiple data
sources, either aggregating multiple sources of data in order to do large scale processing, or even
reading data from a database in order to run processor-intensive machine learning jobs. It has
several different applications, but one of the top use cases is for large volumes of constantly
changing data, such as location-based data from weather or traffic sensors, web-based or social
media data, or machine-to-machine transactional data.
Hive
Hive is a "SQL-like" bridge that allows conventional BI applications to run queries against a
Hadoop cluster. It was developed originally by Facebook, but has been made open source for
some time now, and it's a higher-level abstraction of the Hadoop framework that allows anyone
to make queries against data stored in a Hadoop cluster just as if they were manipulating a
conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users.
PIG
PIG is another bridge that tries to bring Hadoop closer to the realities of developers and business
users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-like" language that allows
for query execution over data stored on a Hadoop cluster, instead of a "SQL-like" language. PIG
was developed by Yahoo!, and, just like Hive, has also been made fully open source.
WibiData
WibiData is a combination of web analytics with Hadoop, being built on top of HBase, which is
itself a database layer on top of Hadoop. It allows web sites to better explore and work with their
user data, enabling real-time responses to user behavior, such as serving personalized content,
recommendations and decisions.
PLATFORA
Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation of
MapReduce, requiring extensive developer knowledge to operate. Between preparing, testing
and running jobs, a full cycle can take hours, eliminating the interactivity that users enjoyed with
conventional databases. PLATFORA is a platform that turns user's queries into Hadoop jobs
automatically, thus creating an abstraction layer that anyone can exploit to simplify and organize
datasets stored in Hadoop.
Storage Technologies
As the data volumes grow, so does the need for efficient and effective storage techniques. The
main evolutions in this space are related to data compression and storage virtualization.
SkyTree
SkyTree is a high-performance machine learning and data analytics platform focused specifically
on handling Big Data. Machine learning, in turn, is an essential part of Big Data, since the
massive data volumes make manual exploration, or even conventional automated exploration
methods unfeasible or too expensive.

1.8 Limitations of Big Data


big data analytics has been touted as a panacea to cure all the woes of business. Big data is seen
by many to be the key that unlocks the door to growth and success. Consequently, some experts
predict that during 2015, the average company will spend about $7 million on data analysis.
However, although big data analytics is a remarkable tool that can help with business decisions,
it does have its limitations.
Here are 5 limitations to the use of big data analytics.

Prioritizing correlations
Data analysts use big data to tease out correlation: when one variable is linked to another.
However, not all these correlations are substantial or meaningful. More specifically, just because
2 variables are correlated or linked doesnt mean that a causative relationship exists between
them (i.e.,correlation does not imply causation). For instance, between 2000 and 2009, the
number of divorces in the U.S. state of Maine and the per capita consumption of margarine both
similarly decreased. However, margarine and divorce have little to do with each other. A good
consultant will help you figure out which correlations mean something to your business and
which correlations mean little to your business.

The Wrong Questions


Big data can be used to discern correlations and insights using an endless array of questions.
However, its up to the user to figure out which questions are meaningful. If you end up getting a
right answer to the wrong question, you do yourself, your clients, and your business, a costly
disservice.

Security
As with many technological endeavors, big data analytics is prone to data breach. The
information that you provide a third party could get leaked to customers or competitors.

Transferability
Because much of the data you need analyzed lies behind a firewall or on a private cloud, it takes
technical know-how to efficiently get this data to an analytics team. Furthermore, it may be
difficult to consistently transfer data to specialists for repeat analysis.

Inconsistency in data collection


Sometimes the tools we use to gather big data sets are imprecise. For example, Google is famous
for its tweaks and updates that change the search experience in countless ways; the results of a
search on one day will likely be different from those on another day. If you were using Google
search to generate data sets, and these data sets changed often, then the correlations you derive
would change, too.
Ultimately, you need to know how to use big data to your advantage in order for it to be useful.
The use of big data analytics is akin to using any other complex and powerful tool. For instance,
an electron microscope is a powerful tool, too, but its useless if you know little about how it
works.

1.9 Introduction to Hadoop


Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. A Hadoop
frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
Apache Hadoop was born to enhance the usage and solve major issues of big data. The web
media was generating loads of information on a daily basis, and it was becoming very difficult to
manage the data of around one billion pages of content. In order of revolutionary, Google
invented a new methodology of processing data popularly known as MapReduce. Later after a
year Google published a white paper of Map Reducing framework where Doug Cutting and
Mike Cafarella, inspired by the white paper and thus created Hadoop to apply these concepts to
an open-source software framework which supported the Nutch search engine project.
Considering the original case study, Hadoop was designed with a much simpler storage
infrastructure facilities.

Apache Hadoop is the most important framework for working with Big Data. Hadoop biggest
strength is scalability. It upgrades from working on a single node to thousands of nodes without
any issue in a seamless manner.

The different domains of Big Data means we are able to manage the datas are from videos, text
medium, transactional data, sensor information, statistical data, social media conversations,
search engine queries, ecommerce data, financial information, weather data, news updates, forum
discussions, executive reports, and so on

Googles Doug Cutting and his team members developed an Open Source Project namely known
as HADOOP which allows you to handle the very large amount of data. Hadoop runs the
applications on the basis of MapReduce where the data is processed in parallel and accomplish
the entire statistical analysis on large amount of data.
It is a framework which is based on java programming. It is intended to work upon from a single
server to thousands of machines each offering local computation and storage. It supports the
large collection of data set in a distributed computing environment.

The Apache Hadoop software library based framework that gives permissions to distribute huge
amount of data sets processing across clusters of computers using easy programming models.

Why Hadoop?
It simplifies dealing with Big Data. This answer immediately resonates with people, it is clear
and succinct, but it is not complete. The Hadoop framework has built-in power and flexibility to
do what you could not do before. In fact, Cloudera presentations at the latest O'Reilly Strata
conference mentioned that MapReduce was initially used at Google and Facebook not primarily
for its scalability, but for what it allowed you to do with the data.
In 2010, the average size of Cloudera's customers' clusters was 30 machines. In 2011 it was 70.
When people start using Hadoop, they do it for many reasons, all concentrated around the new
ways of dealing with the data. What gives them the security to go ahead is the knowledge that
Hadoop solutions are massively scalable, as has been proved by Hadoop running in the world's
largest computer centers and at the largest companies.
As you will discover, the Hadoop framework organizes the data and the computations, and then
runs your code. At times, it makes sense to run your solution, expressed in a MapReduce
paradigm, even on a single machine.
But of course, Hadoop really shines when you have not one, but rather tens, hundreds, or
thousands of computers. If your data or computations are significant enough (and whose aren't
these days?), then you need more than one machine to do the number crunching. If you try to
organize the work yourself, you will soon discover that you have to coordinate the work of many
computers, handle failures, retries, and collect the results together, and so on. Enter Hadoop to
solve all these problems for you. Now that you have a hammer, everything becomes a nail:
people will often reformulate their problem in MapReduce terms, rather than create a new
custom computation platform.
No less important than Hadoop itself are its many friends. The Hadoop Distributed File System
(HDFS) provides unlimited file space available from any Hadoop node. HBase is a high-
performance unlimited-size database working on top of Hadoop. If you need the power of
familiar SQL over your large data sets, Pig provides you with an answer. While Hadoop can be
used by programmers and taught to students as an introduction to Big Data, its companion
projects (including ZooKeeper, about which we will hear later on) will make projects possible
and simplify them by providing tried-and-proven frameworks for every aspect of dealing with
large data sets.

How Hadoop solves the Big Data problem


Hadoop is built to run on a cluster of machines
Lets start with an example. Let's say that we need to store lots of photos. We will start with a
single disk. When we exceed a single disk, we may use a few disks stacked on a machine. When
we max out all the disks on a single machine, we need to get a bunch of machines, each with a
bunch of disks.
This is exactly how Hadoop is built. Hadoop is designed to run on a cluster of machines from the
get go.

Hadoop clusters scale horizontally- More storage and compute power can be achieved by adding
more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and
expensive hardware.
Hadoop can handle unstructured / semi-structured data- Hadoop doesn't enforce a 'schema' on the
data it stores. It can handle arbitrary text and binary data. So Hadoop can 'digest' any
unstructured data easily.
Hadoop clusters provides storage and computing- We saw how having separate storage and
processing clusters is not the best fit for Big Data. Hadoop clusters provide storage and
distributed computing all in one.

The Business Case for Hadoop


Hadoop provides storage for big data at reasonable cost
Storing big data using traditional storage can be expensive. Hadoop is built around commodity
hardware, so it can provide fairly large storage for a reasonable cost. Hadoop has been used in
the field at petabyte scale.

One study by Cloudera suggested that enterprises usually spend around $25,000 to $50,000 per
terabyte per year. With Hadoop, this cost drops to a few thousand dollars per terabyte per year.
As hardware gets cheaper and cheaper, this cost continues to drop.

Hadoop allows for the capture of new or more data


Sometimes organizations don't capture a type of data because it was too cost prohibitive to store
it. Since Hadoop provides storage at reasonable cost, this type of data can be captured and stored.

One example would be website click logs. Because the volume of these logs can be very high,
not many organizations captured these. Now with Hadoop it is possible to capture and store the
logs.

With Hadoop, you can store data longer


To manage the volume of data stored, companies periodically purge older data. For example,
only logs for the last three months could be stored, while older logs were deleted. With Hadoop
it is possible to store the historical data longer. This allows new analytics to be done on older
historical data.

For example, take click logs from a website. A few years ago, these logs were stored for a brief
period of time to calculate statistics like popular pages. Now with Hadoop, it is viable to store
these click logs for longer period of time.

Hadoop provides scalable analytics


There is no point in storing all this data if we can't analyze them. Hadoop not only provides
distributed storage, but also distributed processing as well, which means we can crunch a large
volume of data in parallel. The compute framework of Hadoop is called MapReduce.
MapReduce has been proven to the scale of petabytes.

Hadoop provides rich analytics


Native MapReduce supports Java as a primary programming language. Other languages like
Ruby, Python and R can be used as well.

Of course, writing custom MapReduce code is not the only way to analyze data in Hadoop.
Higher-level Map Reduce is available. For example, a tool named Pig takes English like data
flow language and translates them into MapReduce. Another tool, Hive, takes SQL queries and
runs them using MapReduce.

Business intelligence (BI) tools can provide even higher level of analysis. There are tools for this
type of analysis as well.

Hadoop History
Hadoop was created by Doug Cutting who had created the Apache Lucene(Text
Search),which is origin in Apache Nutch(Open source search Engine).Hadoop is a part of
Apache Lucene Project.Actually Apache Nutch was started in 2002 for working crawler and
search system.Nutch Architecture would not scale up to billions of pages on the web.
In 2003 google had published one Architecture called Google Distributed
Filesystem(GFS),which was solve the storage need for the very large files generated as a part of
the web crawl and indexing process.
In 2004 based on GFS architecture Nutch was implementing open source called the
Nutch Distributed Filesystem (NDFS).In 2004 google was published Mapreduce,In 2005 Nutch
developers had working on Mapreduce in Nutch Project.Most of the Algorithms had been ported
to run using mapreduce and NDFS.
In February 2006 they moved out of Nutch to form an independent subproject of
Lucene called Hadoop.At around the same time, Doug Cutting joined Yahoo!, which provided a
dedicated team and the resources to turn Hadoop into a system that ran at web scale. This was
demonstrated in February 2008 when Yahoo! announced that its production search index was
being generated by a 10,000-core Hadoop cluster.
In January 2008, Hadoop was made its own top-level project at Apache, confirming its
success and its diverse, active community. By this time, Hadoop was being used by many other
companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.
In April 2008, Hadoop broke a world record to become the fastest system to sort a
terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just
under 3 minutes), beating the previous years winner of 297 seconds.

Milestones in the History of Big Data


Information overload has become one of most often-repeated mantras of our time.
Books are being digitized, newspapers and magazines now make up just a fraction of todays
media, augmented as it is by wave after wave of tweets and blog posts, and all the while the
gadgets we use to keep up with this digital frenzy become increasingly complex.
Some might complain about the digital revolution but theres no denying the
immense impact its had on our lives. With experts claiming that as much as 90% of all the
information in existence is less than two years old, everyone and everything, from governments
and marketers, to police and now even farmers, has begun to show an interest one of the hottest
talking points of our time big data.
But did you ever wonder where all this data came from? And more to the point, how
did it get so big, and where is it all going? These are just some of the questions that well be
attempting to answer in todays short history of big data, charting the five major milestones that
led to its evolution into an entity that promised to change our world forever.

1890: The First Big Data Problem


Back in 1890 when the US government decided to perform a national census, the poor clerks at
the Bureau responsible were faced with the unenviable task of counting more than 60 million
souls in the country by means of laboriously transferring data from schedules to record sheets by
the slow and heartbreaking method of hand tallying.
Horrified at the prospect, Herman Hollerith came to the rescue with his novel new Pantograph
tabulating machine, modeled on train conductors habit of punching holes into tickets to denote
physical features and thus prevent fraud. Holleriths idea was a simple punch card which held the
data of Census respondents and could be read in seconds by his electrical tabulating machine.
Theres little doubt that Holleriths invention was a defining moment in the history of data
processing, one that symbolized the beginning of the mechanized data collecting age
Holleriths machines successfully tabulated no less than 62,622,250 people in the US, saving the
Census Bureau some $5 million and cutting the Census completion time down from ten years to
less than 24 months.

1965: First Data Center is Conceived


Data didnt really become data until it had a base that it could safely reside in a database to be
exact. In 1965, faced with the growing problem of where to keep more than 742 million tax
returns and 175 million sets of fingerprints, the US government decided that its data needed a
smaller home, and began to study the feasibility of transferring all of those records to magnetic
computer tape and storing it all on one big computer.
While the plan was later dropped amid privacy concerns, it would later be remembered
as one that heralded the dawn of the electronic data storage era nudging all of those pen-
pushing office clerks into oblivion once and for all.

1989: The World Wide Web is Born


Tim Berners-Lees proposal to leverage the internet proved to be a game-changer in the way we
share and search for information. The British computer scientist probably had little idea of the
immense impact that facilitating the spread of information via hypertext would have on the
world, yet he all the same he was remarkably confident of its success:
The information contained would grow past a critical threshold, so that the usefulness
[of] the scheme would in turn encourage its increased use, he wrote at the time.

1997-2001: Big Data is Defined


In their paper titled Application-controlled demand paging for out-of-core
visualization, Michael Cox and David Ellsworth are among the first to acknowledge the
problems that big data will present as information overload continues on its relentless path:
Visualization provides an interesting challenge for computer systems: data sets are
generally quite large, taxing the capacities of main memory, local disk, and even remote disk.
We call this the problem of big data. When data sets do not fit in main memory (in core), or
when they do not fit even on local disk, the most common solution is to acquire more resources.
Cox and Ellesworths use of the term big data is generally accepted as the first time
anyone has done so, although the honor of actually defining the term must go to one Doug
Laney, who in 2001 described it as being a 3-dimensional data challenge of increasing data
volume, velocity and variety, a definition that has since become almost ubiquitous among
industry experts today.

2004: Enter Hadoop


Having dealt with big data problems, invented servers, developed a method of sharing data, and
defined exactly what it is all that was left was to come up with some kind of tool that could
help us actually understand our big data.
Enter Hadoop, the free and open-source software program, named after a toy elephant,
which has rapidly become one of the worlds most popular websites. In the last eight years,
Hadoop has become so big that it controls entire search engines, determining everything from
which ads they show us, to which long-lost friends Facebook pulls out the hat, and even the
stories you see on your Yahoo homepage.
The creation of Hadoop marks big datas biggest milestone yet. Its an innovation
thats changed the face of big data forever, and with it the lives of everyone on the planet.
Hadoop provides a solution that anyone can use from players like Google and IBM, to even the
smallest of internet marketers giving everyone the chance to profit from the most enigmatic
phenomena of our time.

Organizations Using Hadoop


In the past decades the volume and the variety of the recorded information have increased
drastically and the existing data storage and processing tools were not able handle all the large
amounts of data that started to get create after the Internet revolution and this caused the hadoop
as one of the preferred tools for companies which have data. There are various organizations
using hadoop technology. They are:

Hadoop in Facebook:
There are many data driven companies which are using hadoop at a great scale but in
this blog we will be discussing the implementation of hadoop in few companies like
Facebook,Yahoo,IBM,health care organizations. Messaging in facebook has been one of its
popular feature since its inception.
Another features of facebook such has like button or status updates are done in Mysql
database but applications such as facebook messaging system runs on the top of HBASE which
is hadoops NoSql database framework.
The data warehousing solution of facebooks lies in HIVE which is built on the top of
HDFS.
The reporting needs of the FACEBOOK is also achieved by using HIVE.
Post 2011 with increase in the magnitude of data and to improve the efficiency
facebook started implementing apache corona which works very much like Yarn framework.
In apache corona the a new scheduling framework is used which separates cluster
resource management from job coordination.

Hadoop in yahoo:
When it comes about the size of the hadoop cluster,yahoo beats all by having the 42000
nodes in about 20 YARN (aka MapReduce 2.0)clusters with 600 petabytes of data on HDFS to
serve the companys mobile, search, advertising, personalization, media, and communication
efforts.
Yahoo uses hadoop to block around 20.5 billion messages and checks it to enter it into
its email server.Yahoos spam detection abilities has increased to manifolds since it started using
hadoop.
In the ever growing family of hadoop,yahoo has been one of the major contributor.
Yahoo has been the pioneer of many new technologies which have already embraced
itself into hadoop ecosystem.
Few notable technologies which yahoo has been using apart from mapreduce and hdfs
is Apache tez and spark.
One of the main vehicle of yahoos hadoop chariot is pig which started in yahoo and it
still tops the chart as 50-60 percent of jobs are processed using pig scripts.

Hadoop in Health care companies:


Hadoop in Cancer treatment: The response of a patients having same type of cancer is
different for same cancer medicine and this is because of the each ones individual genome.
Each persons genome contains around 1.5gigabytes of the data and to understand how
a particular drug responds to a particular genome requires the genomic data to be stored and
combined with other data like demographics and trial outcomes and finally an analysis to be
done to know which medicine is suitable for which kind of gentic spectrum.
Many top cancer research institutes have applied this hadoop technology to elevate the
success rate of their cancer treatments.

Hadoop in checking re-occurrence of heart cardiac attack: UC Irvine Health in USA while
discharging heart patients is equipping them with a wireless scale so that weight measured by
them in home could be transferred automatically and wirelessly to the hadoop cluster established
in the hospital inside which hadoop algorithm running determines a chance for reoccurrence for
heart attack by analyzing the risk factor associated with the received weight data.

Hadoop in Telecom industries:


Telecommunication sector is one of the most data driven industry.
Apart from processing millions of call per seconds it is also providing services for web
browsing,videos,television,streaming music,movies,text messages and email.
All these sources have flooded the telecom companies with drastic increase in the data
due to which storing and process overhead have increased manifolds.
Some of the case studies related to implementation of Hadoop in telecom sectors has
been discussed below:

Analyzing call data records: To reduce the rate of call drop and improve the sound quality,the
call details pouring in to the companys database in real time has to be analyzed to maximum
precision.
Telecom companies have been using tools like Flume to ingest the millions of call
records per second into hadoop and then using Apache storm for processing them in real time to
identify the troubling patterns.

Timely servicing of the equipments: Replacing the equipments from transmission tower of
telecom companies is very much costlier than the repairing.
To determine an optimum schedule for maintenance(not too early,not too late),hadoop
has been used by the companies for storing unstructured, sensor and streaming data.
Machine learning algorithms are applied on these data to reduce maintenance cost and
to do timely repair of the equipments before it gets any problem.

Hadoop in Financial sectors:


Companies in the financial sectors have been using hadoop to do deeper analysis on the
data to improve operational margins and to detect the malicious activities which gets unnoticed
in the normal scenario.
Some of the case studies which are in practice in financial sectors are as follows:

Anti money laundering practice: Before hadoop, finance companies used to follow the
approach where selective storing of the data used to take place by discarding historical data due
to storage limitations.
So the sample data available for analytics was not suffice to give a full proof results
which could be used to check money laundering.
But now companies have been using hadoop framework for greater storing and
processing abilities and to determine the sources of black money and keep it out of the system.
Companies are now able to manage millions of customer names and their transactions
in real time and the rate of detecting the suspicious transactions have increased drastically after
implementing hadoop ecosystem.

Hadoop in Banks
Many banks across the world have been using Hadoop platform to collect and analyze
all the data pertaining to their customers like daily transactional data,data coming from
interaction from multiple customer touch points like call centers, home value data and merchant
records.
All these data can be analyzed by banks to segregate customers into one or more
sections based on their needs in terms of banking product and services,their sales,promotion and
marketing accordingly.
Using Big data Hadoop architecture , many credit card issuing banks has been
implementing fraud detection system which detects
Suspicious activity by analyzing ones past history with spending patterns and trends
and have been disabling the cards of the suspects.

Hadoop Cluster Architecture:


Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework
for a cluster of systems with storage capacity and local computing power by leveraging
commodity hardware. Hadoop follows a Master Slave architecture for the transformation and
analysis of large datasets using Hadoop MapReduce paradigm. The 3 important hadoop
components that play a vital role in the Hadoop architecture are -
1. Client
2. Master
3. Slave
The role of each components are shown in the below image.

Client:
It is neither master nor slave, rather play a role of loading the data into cluster, submit
MapReduce jobs describing how the data should be processed and then retrieve the data to see
the response after job completion.

Masters:
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.
NameNode:
NameNode does NOT store the files but only the file's metadata. In later section we
will see it is actually the DataNode which stores the files.
NameNode oversees the health of DataNode and coordinates access to the data stored
in DataNode. Name node keeps track of all the file system related information such as to
Which section of file is saved in which part of the cluster
Last access time for the files
User permissions like which user have access to the file

JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.

Secondary Name Node:


Don't get confused with the name "Secondary". Secondary Node is NOT the backup or
high availability node for Name node.
So what Secondary Node does?

The job of Secondary Node is to contact NameNode in a periodic manner after certain time
interval(by default 1 hour).
NameNode which keeps all filesystem metadata in RAM has no capability to process that
metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't
have any backup of filesystem. What secondary node does is it contacts NameNode in an hour
and pulls copy of metadata information out of NameNode. It shuffle and merge this information
into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence
Secondary Node is not the backup rather it does job of housekeeping.
In case of NameNode failure, saved metadata can rebuild it easily.

Slaves:
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
Store the data
Process the computation

Each slave runs both a DataNode and Task Tracker daemon which communicates to their
masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a
slave to the NameNode

How does Hadoop Work?


Hadoop helps to execute large amount of processing where the user can connect together
multiple commodity computers to a single-CPU, as a single functional distributed system and
have the particular set of clustered machines that reads the dataset in parallel and provide
intermediate, and after integration gets the desired output.
Hadoop runs code across a cluster of computers and performs the following tasks:
Data are initially divided into files and directories. Files are divided into consistent sized
blocks ranging from 128M and 64M.
Then the files are distributed across various cluster nodes for further processing of data.
Job tracker starts its scheduling programs on individual nodes.
Once all the nodes are done with scheduling then the output is return back.

Advantages of Hadoop:
1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational
database systems (RDBMS) that cant scale to process large amounts of data, Hadoop enables
businesses to run applications on thousands of nodes involving many thousands of terabytes of
data.

2. Cost effective
Hadoop also offers a cost effective storage solution for businesses exploding data sets. The
problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an
effort to reduce costs, many companies in the past would have had to down-sample data and
classify it based on certain assumptions as to which data was the most valuable. The raw data
would be deleted, as it would be too cost-prohibitive to keep. While this approach may have
worked in the short term, this meant that when business priorities changed, the complete raw
data set was not available, as it was too expensive to store.

3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of data
(both structured and unstructured) to generate value from that data. This means businesses can
use Hadoop to derive valuable business insights from data sources such as social media, email
conversations. Hadoop can be used for a wide variety of purposes, such as log processing,
recommendation systems, data warehousing, market campaign analysis and fraud detection.

4. Fast
Hadoops unique storage method is based on a distributed file system that basically maps data
wherever it is located on a cluster. The tools for data processing are often on the same servers
where the data is located, resulting in much faster data processing. If youre dealing with large
volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just
minutes, and petabytes in hours.

5. Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node,
that data is also replicated to other nodes in the cluster, which means that in the event of failure,
there is another copy available for use.

6. Hadoop framework allows the user to quickly write and test distributed systems. It is efficient,
and it automatic distributes the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.

7. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.

8. Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.

9. Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
Disadvantages of Hadoop:
As the backbone of so many implementations, Hadoop is almost synomous with big data.

1. Security Concerns
Just managing a complex applications such as Hadoop can be challenging. A simple example can
be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If
whoever managing the platform lacks of know how to enable it, your data could be at huge risk.
Hadoop is also missing encryption at the storage and network levels, which is a major selling
point for government agencies and others that prefer to keep their data under wraps.

2. Vulnerable By Nature
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The
framework is written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by cybercriminals and as a
result, implicated in numerous security breaches.

3. Not Fit for Small Data


While big data is not exclusively made for big businesses, not all big data platforms are suited
for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity
design, the Hadoop Distributed File System, lacks the ability to efficiently support the random
reading of small files. As a result, it is not recommended for organizations with small quantities
of data.

4. Potential Stability Issues


Like all open source software, Hadoop has had its fair share of stability issues. To avoid these
issues, organizations are strongly recommended to make sure they are running the latest stable
version, or run it under a third-party vendor equipped to handle such problems.

5. General Limitations
The article introducesApache Flume, MillWheel, and Googles own Cloud Dataflow as possible
solutions. What each of these platforms have in common is the ability to improve the efficiency
and reliability of data collection, aggregation, and integration. The main point the article stresses
is that companies could be missing out on big benefits by using Hadoop alone.
Chapter 2
Hadoop Design, Architecture & MapReduce

2.1 IO Processing Challenges


Cloud computing is changing the IT landscape in a profound way. Via cloud computing,
dynamically scalable, virtualized resources are provided as a service over the Internet. Large
firms such as Google, Amazon, IBM, Microsoft, or Apple are providing cloud models that
incorporate sound data storage solutions. Today, most companies around the world are
processing vast amounts of data. In 2011, IDC (www.idc.com) estimated the total world wide
data size (labeled the digital data universe) at 1.8 ZB (zettabytes - 10^21). As a comparison, the
numeric value for a TB (terabyte) equals to 10^12 (the binary usage equals to 2^40). The ever
larger data pools required by most companies today obviously have a profound impact not only
on the HW storage requirements and user applications, but also on the file system design, the file
system implementation, as well as the actual IO performance and scalability behavior of today's
IT environments. To illustrate, the New York Stock Exchange generates approximately 1 TB of
new trade data each day. CERN's Large Hadron Collider produces approximately 15 PB
(petabytes) of scientific data per year. While hard drive (HDD) storage capacity has increased
significantly over the years, the actual HDD data access rate has not been improved much. To
illustrate, hypothetically, a contemporary 1TB disk with a 100MB/second transfer speed requires
more than 2 1/2 hours to read all the data. Solid State Disks (SSD) may be an option for some IO
workloads, but the rather high price per unit (next to a specific workload behavior required to
even utilize the performance behavior of the disks) deters many companies to deploy SSD's in
large volumes. To improve aggregate IO throughput, the obvious answer is to read/write to/from
multiple disks. Assuming a hypothetical setup with 100 drives, each holding 1/100 of the 1TB
data pool discussed above. If all disks are accessed in parallel at 100MB/second, the 1TB can be
fetched in less than 2 minutes. Any distributed IT environment faces several, similar challenges.
First, an appropriate redundancy at the HW and the SW level has to be designed into the solution
so that the environment meets the availability, reliability, maintainability, as well as performance
goals and objectives. Second, most distributed environments (the applications) require
combining the distributed data from multiple sources for post-processing and/or visualization
purposes. Ergo, most distributed processing environments are powered by rather large SAN
subsystems and a parallel file system solution such as IBM's GPFS, Lustre (GNU/GPL), or Red
Hats GFS, respectively.

2.2 Hadoop Vs Conventional Databases Which One to Choose?


Todays ultra-connected globe is actually producing enormous amounts of data at high
rates. Subsequently, big data analytics has turned into a highly effective tool for organizations
aiming to leverage piles of precious data for higher revenue and competitive benefit. Amid this
big data dash, Hadoop, being a cloud-based system continues to be intensely marketed as the
perfect solution for the big data issues in business world. While Hadoop has existed around much
of the buzz, there are specific circumstances in which running workloads over a conventional
database would be the superior solution.
For businesses wanting to know which features will better assist their big data
requirements, below are a few important questions that should be asked when selecting Hadoop
which includes cloud-based Hadoop or a conventional database.

Is your data structured or unstructured?


Structured Data: Data which exists inside the fixed limits of a file is called structured data.
Since the structured data can be inserted, saved, queried, and assessed in an easy and simple way,
such data is better served with a conventional database.
Unstructured Data: The type of data that emanates from many different sources, like emails,
text files, videos, images, audio files, as well as social media sites, is called unstructured data.
Being complicated and voluminous, this type of data generally cannot be managed or
proficiently queried with a conventional database. Hadoops power to join, blend, and assess
large amount of unstructured data without structuring it first enables businesses to achieve
deeper insights easily. Therefore Hadoop will be the ideal solution for businesses aiming to store
and assess huge amounts of unstructured data.

Do you need a scalable infrastructure?


Businesses with constant and predictable data workloads are going to be better suitable
for a conventional database.
Organizations challenged by growing data demands may wish to reap the benefits of
the scalable infrastructure of Hadoop. Scalability enables servers to support increasing
workloads. Being a cloud-based solution, Hadoop provides better flexibility and scalability
through spinning the servers within shorter time to accommodate changing workloads.

Will implementing Hadoop remain affordable?


Affordability is actually an issue for businesses seeking to take up new technologies.
When it comes to Hadoop implementation, businesses have to do their groundwork to ensure that
the recognized advantages of implementing Hadoop outweigh the expenses. Else its better to
stay with a conventional database to fulfill data management requirements.
With that said, Hadoop has quite a few points taking it which make implementation a
lot more affordable than businesses may comprehend. To begin with, Hadoop saves cash by
merging open source systems with virtual servers. Hadoop keep costs down even more by
reducing the cost of servers and warehouses.
Hybrid systems that assimilate Hadoop with conventional relational databases tend to
be gaining interest as affordable ways for businesses to gain the advantages of each platform.

Is fast analysis your requirement?


Hadoop was originally created processing large amount of distributed data that handles
every record in the database. Apparently, this kind of processing will take time. For tasks in
which fast processing isnt essential, like reviewing every day orders, checking historical data, or
carrying out analytics where a slower analysis can be accepted, Hadoop is suitable.
On the other side, in situations where companies demand faster data analysis, a
conventional database would be the better option. Thats due to the reason that quick analysis
isnt about analyzing substantial unstructured data, which can be nicely done with Hadoop. Its
more about analyzing shorter datasets in real time, which is just what conventional databases are
nicely outfitted to perform.
Hybrid systems will also be a great fit to think about, since they let businesses make
use of conventional databases to run smaller, hugely interactive workloads when employing
Hadoop to assess large, complicated data sets.

Which one is better?


That would depend. While big data analytics offer deeper insights providing competitive edge,
those advantages may simply be recognized by businesses that work out sufficient research in
considering Hadoop as an analytics tool that perfectly serves their requirements.

2.1.1 Difference between Haddop and Traditional Database


There are a lot of differences:

1. Hadoop is not a database. Hbase or Impala may be considered databases but Hadoop is
just a file system (hdfs) with built in redundancy, parallelism.
2. Traditional databases/RDBMS have ACID properties - Atomicity, Consistency, Isolation
and Durability. You get none of these out of the box with Hadoop. So if you have to for
example write code to take money from one bank account and put into another one, you
have to (painfully) code all the scenarios like what happens if money is taken out but a
failure occurs before its moved into another account.
3. Hadoop offers massive scale in processing power and storage at a very low comparable
cost to an RDBMS.
4. Hadoop offers tremendous parallel processing capabilities. You can run jobs in parallel
to crunch large volumes of data.
5. Some people argue that traditional databases do not work well with un-structured data,
but its not as simple as that. There are many applications built using traditional RDBMS
that use a lot of unstructured data or video files or PDFs that I have come across that
work well.
6. Typically RDBMS will manage a large chunk of the data in its cache for faster
processing while at the same time maintaining read consistency across sessions. I would
argue Hadoop does a better job at using the memory cache to process the data without
offering any other items like read consistency.
7. Hive SQL is almost always a magnitude of times slower than SQL you can run in
traditional databases. So if you are thinking SQL in Hive is faster than in a database, you
are in for a sad disappointment. It will not scale at all for complex analytics.
8. Hadoop is very good for parallel processing problems - like finding a set of keywords in a
large set of documents (this operation can be parallelized). However typically RDBMS
implementations will be faster for comparable data sets.

It is a fact that data has exploded in the past and with voulmes going through the roof, traditional
databases which were developed from the premise of a single cpu and RAM cache will no longer
be able to support the requirements that business has. In all fairness maybe businesses will also
start accepting that they will be able to live with partially or reasonably reports instead of
completely consistent (but old) reports. This will be an evolution but Hadoop and RDBMS both
will have to evolve to be able to address that.
2.3 Hadoop Ecosystem Overview
Big Data is the buzz word circulating in IT industry from 2008. The amount of data
being generated by social networks, manufacturing, retail, stocks, telecom, insurance, banking,
and health care industries is way beyond our imaginations.
Before the advent of Hadoop, storage and processing of big data was a big challenge.
But now that Hadoop is available, companies have realized the business impact of Big Data and
how understanding this data will drive the growth. For example:
Banking sectors have a better chance to understand loyal customers, loan defaulters
and fraud transactions.
n Retail sectors now have enough data to forecast demand.

Manufacturing sectors need not depend on the costly mechanisms for quality testing.
Capturing sensors data and analyzing it would reveal many patterns.

E-Commerce, social networks can personalize the pages based on customer interests.

Stock markets generate humongous amount of data, correlating from time to time will
reveal beautiful insights.

Big Data has many useful and insightful applications.

Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a combination of
technologies which have proficient advantage in solving business problems.

Hadoop ecosystem comprises of services like HDFS, Map reduce for storing and processing
large amount of data sets. In addition to services there are several tools provided in ecosystem to
perform different type data modeling operations. Ecosystem consists of hive for querying and
fetching the data that's stored in HDFS.

Similarly ecosystem consists of Pig for data flowing language and to implement some map
reduce jobs. For data migration and job scheduling, we use some more tools in Hadoop
ecosystem.
In order to handle large data sets, Hadoop has a distributed framework which can scale out to
thousands of nodes. Hadoop adopts Parallel Distributed Approach to process huge amount of
data. The two main components of Apache Hadoop are HDFS (Hadoop Distributed File System)
and Map Reduce (MR). The basic principle of Hadoop is to write once and read many times.

Remember that Hadoop is a framework. If Hadoop was a house, it wouldnt be a very


comfortable place to live. It would provide walls, windows, doors, pipes, and wires. The Hadoop
ecosystem provides the furnishings that turn the framework into a comfortable home for big data
activity that reflects your specific needs and tastes.

The Hadoop ecosystem includes both official Apache open source projects and a wide range of
commercial tools and solutions. Some of the best-known open source examples include Spark,
Hive, Pig, Oozie and Sqoop. Commercial Hadoop offerings are even more diverse and include
platforms and packaged distributions from vendors such as Cloudera, Hortonworks, and MapR,
plus a variety of tools for specific Hadoop development, production, and maintenance tasks.

Most of the solutions available in the Hadoop ecosystem are intended to supplement one or two
of Hadoops four core elements (HDFS, MapReduce, YARN, and Common). However, the
commercially available framework solutions provide more comprehensive functionality. The
sections below provide a closer look at some of the more prominent components of the Hadoop
ecosystem, starting with the Apache projects

The different components of Hadoop ecosystem are

a. Hadoop Distributed File System


HDFS is the primary storage system of Hadoop. Hadoop distributed file system (HDFS) is java
based file system that provides scalable, fault tolerance, reliable and cost efficient data storage
for big data. HDFS is a distributed filesystem that runs on commodity hardware. HDFS is
already configured with default configuration for many installations. Most of the time for large
clusters configuration is needed. Hadoop interact directly with HDFS by shell-like commands.

Components of HDFS:
i. NameNode: It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.
Tasks of NameNode
Manage file system namespace.
Regulates clients access to files.
Executes file system execution such as naming, closing, opening files and directories.

ii. DataNode: It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the clients. Replica
block of Datanode consists of 2 files on the file system. The first file is for data and second file is
for recording the blocks metadata. HDFS Metadata includes checksums for data. At startup,
each Datanode connects to its corresponding Namenode and does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.

Tasks of DataNode
DataNode performs operations like block replica creation, deletion and replication
according to the instruction of NameNode.
DataNode manages data storage of the system.

b. MapReduce
Hadoop MapReduce is the core component of hadoop which provides data processing.
MapReduce is a software framework for easily writing applications that process the vast amount
of structured and unstructured data stored in the Hadoop Distributed File system.
Hadoop MapReduce programs are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster. By this parallel
processing, speed and reliability of cluster is improved.

Working of MapReduce
MapReduce works by breaking the processing into two phases:
Map phase
Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also specifies two
functions: map function and reduce function
Map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
Reduce function takes the output from the Map as an input and combines those data tuples
based on the key and accordingly modifies the value of the key.

Features of MapReduce
i. Simplicity: MapReduce jobs are easy to run. Applications can be written in any language such
as java, C++, and python.

ii. Scalability: MapReduce can process petabytes of data.

iii. Speed: By means of parallel processing problems that take days to solve, it is solved in hours
and minutes by MapReduce.
iv. Fault tolerance: MapReduce takes care of failures. If one copy of data is unavailable, another
machine has a copy of the same key pair which can be used for solving the same subtask.

c. YARN
YARN provides the resource management. YARN is called as the operating system of hadoop as
it is responsible for managing and monitoring workloads. It allows multiple data processing
engines such as real-time streaming and batch processing to handle data stored on a single
platform.
YARN has been projected as a data operating system for Hadoop2. Main features of YARN are:
Flexibility: Enables other purpose-built data processing models beyond MapReduce
(batch), such as interactive and streaming. Due to this feature of YARN, other
applications can also be run along with Map Reduce programs in hadoop2.
Efficiency As many applications can be run on same cluster, efficiency of Hadoop
increases without much effect on quality of service.
Shared Provides a stable, reliable, secure foundation and shared operational services
across multiple workloads. Additional programming models such as graph processing
and iterative modeling are now possible for data processing.

d. Hive
Hive is an open source data warehouse system for querying and analyzing large datasets stored
in hadoop files. Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically
translates SQL-like queries into MapReduce jobs which will execute on hadoop.

Main parts of Hive are:


Metastore: metadata is stored in
Driver: manage the lifecycle of a HiveQL statement.
Query complier: compiles HiveQL into directed acyclic graph.
Hive server: provide a thrift interface and JDBC/ODBC server.

e. Pig
Pig is a HDFS. Language used in Pig is called PigLatin. It is very similar to SQL. It is used to
load the data, apply the required filters and dump the data in the required format. For Programs
execution, pig requires Java runtime environment.

Features of Apache Pig:


Extensibility: For carrying out special purpose processing, users are allowed to create
their own function.
Optimization opportunities: Pig allows the system to optimize automatic execution. This
allows the user to pay attention to semantics instead of efficiency.
Handles all kinds of data: Pig analyzes both structured as well as unstructured.

f. Hbase
It is distributed database that was designed to store structured data in tables that could have
billions of row and millions of columns. Hbase is scalable, distributed, and Nosql database that is
built on top of HDFS. Hbase provide real time access to read or write data in HDFS.

Components of Hbase

i. Hbase master: It is not part of the actual data storage but negotiates load balancing across all
RegionServer.

Maintain and monitor the hadoop cluster.


Performs administration (interface for creating, updating and deleting tables.)
Controls the failover.
HMaster handles DDL operation.

ii. RegionServer: It is the worker node which handle read, write, update and delete requests
from clients. Region server process runs on every node in hadoop cluster. Region server runs on
HDFS DateNode.

g. HCatalog
HCatalog is a table and storage management layer for hadoop. HCatalog supports different
components available in hadoop like MapReduce, hive and pig to easily read and write data from
the cluster. HCatalog is a key component of Hive that enables the user to store their data in any
format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.

Benefits of HCatalog:
Enables notifications of data availability.
With the table abstraction, HCatalog frees the user from overhead of data storage.
Provide visibility for data cleaning and archiving tools.

h. Avro
Avro is a most popular Data serialization system. Avro is an open source project that provides
data serialization and data exchange services for hadoop. These services can be used together or
independently. Big data can exchange programs written in different languages using Avro.
Using serialization service programs can serialize data into files or messages. Avro stores data
definition and data together in one message or file making it easy for programs to dynamically
understand information stored in Avro file or message.
Avro schema: Avro relies on schemas for serialization/deserialization. Avro requires schema
when data is written or read. When Avro data is stored in a file its schema is stored with it, so
that files may be processed later by any program.
Dynamic typing: It refers to serialization and deserialization without code generation. It
complements the code generation which is available in Avro for statically typed language as an
optional optimization.

Avro provides:
Rich data structures.
Remote procedure call.
Compact, fast, binary data format.
Container file, to store persistent data.

i. Thrift
It is a software framework for scalable cross-language services development. Thrift is an
interface definition language used for RPC communication. Hadoop does a lot of RPC calls so
there is a possibility of using Apache Thrift for performance or other reasons.

j. Apache Drill
The main purpose of the drill is large-scale data processing including structured and semi-
structured data. It is a low latency distributed query engine that is designed to scale to several
thousands of nodes and query petabytes of data. The drill is the first distributed SQL query
engine that has a schema-free model.

Application of Apache drill


The drill has become an invaluable tool at cardlytics, a company that provides consumer
purchase data for mobile and internet banking. Cardlytics is using a drill to quickly process
trillions of record and execute queries.

Features of Apache Drill:


The drill has specialized memory management system to eliminates garbage collection and
optimize memory allocation and usage. Drill plays well with Hive by allowing developers to
reuse their existing Hive deployment.
Extensibility: Drill provides an extensible architecture at all layers, including query layer,
query optimization, and client API. We can extend any layer for the specific need of an
organization.
Flexibility: Drill provides a hierarchical columnar data model that can represent complex,
highly dynamic data and allow efficient processing.
Dynamic schema discovery: Apache drill does not require schema or type specification
for data in order to start the query execution process. Instead, drill starts processing the
data in units called record batches and discover schema on the fly during processing.
Drill decentralized metadata: unlike other SQL Hadoop technologies, the drill does not
have centralized metadata requirement. Drill users do not need to create and manage
tables in metadata in order to query data.

k. Apache Mahout
Mahout is open source framework that is primarily used for creating scalable machine learning
algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the
data science tools to automatically find meaningful patterns in those big data sets.

Algorithms of Mahout are:


Clustering: Here it takes the item in particular class and organizes them into naturally
occurring groups, such that item belonging to the same group are similar to each other.
Collaborative filtering: It mines user behavior and makes product recommendations (e.g.
amazon recommendations)
Classifications: It learns from existing categorization and then assigns unclassified items
to the best category.
Frequent pattern mining: It analyzes items in a group (e.g. items in a shopping cart or
terms in query session) and then identifies which items typically appear together.

l. Apache Sqoop
Sqoop is used for importing data from external sources into related hadoop components like
HDFS, Hbase or Hive. It is also used for exporting data from hadoop to other external sources.
Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.

Features of Apache Sqoop:


Import sequential datasets from mainframe: Sqoop satisfies the growing need to move
data from the mainframe to HDFS.
Import direct to ORC files: Improves compression and light weight indexing and improve
query performance.
Parallel data transfer: For faster performance and optimal system utilization.
Efficient data analysis: Improve efficiency of data analysis by combining structured data
and unstructured data on a schema on reading data lake.
Fast data copies: from an external system into hadoop.

m. Apache Flume
Flume is used for efficiently collecting, aggregating and moving a large amount of data from its
origin and sending it back to HDFS. Flume is fault tolerant and reliable mechanism. Flume was
created to allow flow data from the source into Hadoop environment. It uses a simple extensible
data model that allows for the online analytic application. Using Flume, we can get the data from
multiple servers immediately into hadoop.

n. Ambari
Ambari is a management platform for provisioning, managing, monitoring and securing apache
Hadoop cluster. Hadoop management gets simpler as Ambari provide consistent, secure platform
for operational control.

Features of Ambari:
Simplified installation, configuration, and management: Ambari easily and efficiently
create and manage clusters at scale.
Centralized security setup: Ambari reduce the complexity to administer and configure
cluster security across the entire platform.
Highly extensible and customizable: Ambari is highly extensible for bringing custom
services under management.
Full visibility into cluster health: Ambari ensures that the cluster is healthy and available
with a holistic approach to monitoring.

o. Zookeeper
Apache Zookeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. Zookeeper is used to
manage and coordinate a large cluster of machines.

Features of zookeeper:
Fast: zookeeper is fast with workloads where reads to data are more common than writes.
The ideal read/write ratio is 10:1.
Ordered: zookeeper maintains a record of all transactions, which can be used for high-
level

p. Oozie
It is a workflow scheduler system for managing apache hadoop jobs. Oozie combines multiple
jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache
hadoop stack, YARN as an architecture center and supports hadoop jobs for apache MapReduce,
pig, Hive, and Sqoop.
In Oozie, users are permitted to create Directed Acyclic Graph of workflow, which can run in
parallel and sequentially in hadoop. Oozie is scalable and can manage timely execution of
thousands of workflow in a hadoop cluster. Oozie is very much flexible as well. One can easily
start, stop, suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in
Oozie.

There are two basic types of Oozie jobs:


Oozie workflow: It is to store and run workflows composed of hadoop jobs e.g.,
MapReduce, pig, Hive.
Oozie coordinator: It runs workflow jobs based on predefined schedules and availability
of data.

2.4 Hadoop MapReduce


MapReduce is a programming model and an associated implementation for processing and
generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure (method) that performs filtering and
sorting (such as sorting students by first name into queues, one queue for each name) and a
Reduce() method that performs a summary operation (such as counting the number of students in
each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure"
or "framework") orchestrates the processing by marshalling the distributed servers, running the
various tasks in parallel, managing all communications and data transfers between the various
parts of the system, and providing for redundancy and fault tolerance.
The model is a specialization of the split-apply-combine strategy for data analysis. It is inspired
by the map and reduce functions commonly used in functional programming, although their
purpose in the MapReduce framework is not the same as in their original forms. The key
contributions of the MapReduce framework are not the actual map and reduce functions (which,
for example, resemble the 1995 Message Passing Interface standard's reduce and scatter
operations), but the scalability and fault-tolerance achieved for a variety of applications by
optimizing the execution engine. As such, a single-threaded implementation of MapReduce will
usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually
only seen with multi-threaded implementations. The use of this model is beneficial only when
the optimized distributed shuffle operation (which reduces network communication cost) and
fault tolerance features of the MapReduce framework come into play. Optimizing the
communication cost is essential to a good MapReduce algorithm.

MapReduce libraries have been written in many programming languages, with different levels of
optimization. A popular open-source implementation that has support for distributed shuffles is
part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google
technology, but has since been genericized. By 2014, Google was no longer using MapReduce as
their primary Big Data processing model, and development on Apache Mahout had moved on to
more capable and less disk-oriented mechanisms that incorporated full map and reduce
capabilities.
MapReduce is a framework for processing parallelizable problems across large datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous hardware).
Processing can occur on data stored either in a filesystem (unstructured) or in a database
(structured). MapReduce can take advantage of the locality of data, processing it near the place it
is stored in order to minimize communication overhead.

"Map" step: Each worker node applies the "map()" function to the local data, and writes
the output to a temporary storage. A master node ensures that only one copy of redundant
input data is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker
node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that
each mapping operation is independent of the others, all maps can be performed in parallel
though in practice this is limited by the number of independent data sources and/or the number of
CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided
that all outputs of the map operation that share the same key are presented to the same reducer at
the same time, or that the reduction function is associative. While this process can often appear
inefficient compared to algorithms that are more sequential (because multiple rather than one
instance of the reduction process must be run), MapReduce can be applied to significantly larger
datasets than "commodity" servers can handle a large server farm can use MapReduce to sort a
petabyte of data in only a few hours. The parallelism also offers some possibility of recovering
from partial failure of servers or storage during the operation: if one mapper or reducer fails, the
work can be rescheduled assuming the input data is still available.

Another way to look at MapReduce is as a 5-step parallel and distributed computation:

1. Prepare the Map() input the "MapReduce system" designates Map processors, assigns
the input key value K1 that each processor would work on, and provides that processor
with all the input data associated with that key value.
2. Run the user-provided Map() code Map() is run exactly once for each K1 key value,
generating output organized by key values K2.
3. "Shuffle" the Map output to the Reduce processors the MapReduce system designates
Reduce processors, assigns the K2 key value each processor should work on, and
provides that processor with all the Map-generated data associated with that key value.
4. Run the user-provided Reduce() code Reduce() is run exactly once for each K2 key
value produced by the Map step.
5. Produce the final output the MapReduce system collects all the Reduce output, and
sorts it by K2 to produce the final outcome.
These five steps can be logically thought of as running in sequence each step starts only after
the previous step is completed although in practice they can be interleaved as long as the final
result is not affected.

In many situations, the input data might already be distributed ("sharded") among many different
servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers
that would process the locally present input data. Similarly, step 3 could sometimes be sped up
by assigning Reduce processors that are as close as possible to the Map-generated data they need
to process.

2.4.1 Key Features of MapReduce


Scale-out Architecture :- We can add server to increase its processing power. If we want
to add more data all we have to do is add one more node.
Security Authentication :- Working with HDFS make sure that only approved users can
work with the data in the system. It gives security to the user.
Flexibility :- We can write the map function and reduce() in any programming language
like java and python.
Resiliency and High availability :- Multiple job trackers and Task trackers monitors that
jobs fail independently and restart automatically.
Optimized Scheduling :- MapReduce performed scheduling according to prioritization.
Scale-out Architecture :- We can add server to increase its processing power. If we want
to add more data all we have to do is add one more node.
Security Authentication :- Working with HDFS make sure that only approved users can
work with the data in the system. It gives security to the user.
Flexibility :- We can write the map function and reduce() in any programming language
like java and python.
Resiliency and High availability :- Multiple job trackers and Task trackers monitors that
jobs fail independently and restart automatically.
Optimized Scheduling :- MapReduce performed scheduling according to prioritization.

2.4.2 Logical view


The Map and Reduce functions of MapReduce are both defined with respect to data structured in
(key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of
pairs in a different domain:

Map(k1,v1) list(k2,v2)

The Map function is applied in parallel to every pair (keyed by k1) in the input dataset. This
produces a list of pairs (keyed by k2) for each call. After that, the MapReduce framework
collects all pairs with the same key (k2) from all lists and groups them together, creating one
group for each key.
The Reduce function is then applied in parallel to each group, which in turn produces a
collection of values in the same domain:

Reduce(k2, list (v2)) list(v3)


Each Reduce call typically produces either one value v3 or an empty return, though one call is
allowed to return more than one value. The returns of all calls are collected as the desired result
list.
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This
behavior is different from the typical functional programming map and reduce combination,
which accepts a list of arbitrary values and returns one single value that combines all the values
returned by map.
It is necessary but not sufficient to have implementations of the map and reduce abstractions in
order to implement MapReduce. Distributed implementations of MapReduce require a means of
connecting the processes performing the Map and Reduce phases. This may be a distributed file
system. Other options are possible, such as direct streaming from mappers to reducers, or for the
mapping processors to serve up their results to reducers that query them.

Examples
The prototypical MapReduce example counts the appearance of each word in a set of documents:

function map(String name, String document):


// name: document name
// document: document contents
for each word w in document:
emit (w, 1)

function reduce(String word, Iterator partialCounts):


// word: a word
// partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += pc
emit (word, sum)

Here, each document is split into words, and each word is counted by the map function, using the
word as the result key. The framework puts together all the pairs with the same key and feeds
them to the same call to reduce. Thus, this function just needs to sum all of its input values to
find the total appearances of that word.

As another example, imagine that for a database of 1.1 billion people, one would like to compute
the average number of social contacts a person has according to age. In SQL, such a query could
be expressed as:

SELECT age, AVG(contacts)


FROM social.person
GROUP BY age
ORDER BY age
Using MapReduce, the K1 key values could be the integers 1 through 1100, each representing a
batch of 1 million records, the K2 key value could be a persons age in years, and this
computation could be achieved using the following functions:

function Map is
input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records
for each social.person record in the K1 batch do
let Y be the person's age
let N be the number of contacts the person has
produce one output record (Y,(N,1))
repeat
end function

function Reduce is
input: age (in years) Y
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in Cnew the sum of C
repeat
let A be S/Cnew
produce one output record (Y,(A,Cnew))
end function

The MapReduce System would line up the 1100 Map processors, and would provide each with
its corresponding 1 million input records. The Map step would produce 1.1 billion (Y,(N,1))
records, with Y values ranging between, say, 8 and 103. The MapReduce System would then line
up the 96 Reduce processors by performing shuffling operation of the key/value pairs due to the
fact that we need average per age, and provide each with its millions of corresponding input
records. The Reduce step would result in the much reduced set of only 96 output records (Y,A),
which would be put in the final result file, sorted by Y.

The count info in the record is important if the processing is reduced more than one time. If we
did not add the count of the records, the computed average would be wrong, for example:

-- map output #1: age, quantity of contacts


10, 9
10, 9
10, 9
-- map output #2: age, quantity of contacts
10, 9
10, 9
-- map output #3: age, quantity of contacts
10, 10

If we reduce files #1 and #2, we will have a new file with an average of 9 contacts for a 10-year-
old person ((9+9+9+9+9)/5):
-- reduce step #1: age, average of contacts
10, 9
If we reduce it with file #3, we lose the count of how many records we've already seen, so we
end up with an average of 9.5 contacts for a 10-year-old person ((9+10)/2), which is wrong. The
correct answer is 9.166 = 55 / 6 = (9*3+9*2+10*1)/(3+2+1).

2.4.3 Dataflow
The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the
application defines, are:
an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer

2.4.3.1 Input reader


The input reader divides the input into appropriate size 'splits' (in practice typically 64 MB to
128 MB) and the framework assigns one split to each Map function. The input reader reads data
from stable storage (typically a distributed file system) and generates key/value pairs.

A common example will read a directory full of text files and return each line as a record.

2.4.3.2 Map function


The Map function takes a series of key/value pairs, processes each, and generates zero or more
output key/value pairs. The input and output types of the map can be (and often are) different
from each other.
If the application is doing a word count, the map function would break the line into words and
output a key/value pair for each word. Each output pair would contain the word as the key and
the number of instances of that word in the line as the value.

2.4.3.3 Partition function


Each Map function output is allocated to a particular reducer by the application's partition
function for sharding purposes. The partition function is given the key and the number of
reducers and returns the index of the desired reducer.
A typical default is to hash the key and use the hash value modulo the number of reducers. It is
important to pick a partition function that gives an approximately uniform distribution of data per
shard for load-balancing purposes, otherwise the MapReduce operation can be held up waiting
for slow reducers to finish (i.e. the reducers assigned the larger shares of the non-uniformly
partitioned data).
Between the map and reduce stages, the data are shuffled (parallel-sorted / exchanged between
nodes) in order to move the data from the map node that produced them to the shard in which
they will be reduced. The shuffle can sometimes take longer than the computation time
depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce
computations.

2.4.3.4 Comparison function


The input for each Reduce is pulled from the machine where the Map ran and sorted using the
application's comparison function.

2.4.3.5 Reduce function


The framework calls the application's Reduce function once for each unique key in the sorted
order. The Reduce can iterate through the values that are associated with that key and produce
zero or more outputs.
In the word count example, the Reduce function takes the input values, sums them and generates
a single output of the word and the final sum.

2.4.2.6 Output writer


The Output Writer writes the output of the Reduce to the stable storage.

2.4.4 Implementation
Many different implementations of the MapReduce interface are possible. The right choice
depends on the environment. For example, one implementation may be suitable for a small
shared-memory machine, another for a large NUMA multi-processor, and yet another for an
even larger collection of networked machines. This section describes an implementation targeted
to the computing environment in wide use at Google: large clusters of commodity PCs connected
together with switched Ethernet. In our environment:
(1) Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of
memory per machine.
(2) Commodity networking hardware is used typically either 100 megabits/second or 1
gigabit/second at the machine level, but averaging considerably less in overall bisection
bandwidth.
(3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are
common.
(4) Storage is provided by inexpensive IDE disks attached directly to individual machines. A
distributed file system developed in-house is used to manage the data stored on these disks. The
file system uses replication to provide availability and reliability on top of unreliable hardware.
(5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped
by the scheduler to a set of available machines within a cluster.
Execution Overview

2.4.4.1 Execution Overview


The Map invocations are distributed across multiple machines by automatically partitioning the
input data into a set of M splits. The input splits can be processed in parallel by different
machines. Reduce invocations are distributed by partitioning the intermediate key space into R
pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and
the partitioning function are specified by the user.
Figure 1 shows the overall flow of a MapReduce operation in our implementation. When the user
program calls the MapReduce function, the following sequence of actions occurs (the numbered
labels in Figure 1 correspond to the numbers in the list below):
1. The MapReduce library in the user program first splits the input files into M pieces of
typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional
parameter). It then starts up many copies of the program on a cluster of machines.
2. One of the copies of the program is special the master. The rest are workers that are assigned
work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle
workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the corresponding input split. It
parses key/value pairs out of the input data and passes each pair to the user-defined Map
function. The intermediate key/value pairs produced by the Map function are buffered in
memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the
partitioning function. The locations of these buffered pairs on the local disk are passed back to
the master, who is responsible for forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it uses remote procedure
calls to read the buffered data from the local disks of the map workers. When a reduce worker
has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the
same key are grouped together. The sorting is needed because typically many different keys map
to the same reduce task. If the amount of intermediate data is too large to fit in memory, an
external sort is used.
6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate
key encountered, it passes the key and the corresponding set of intermediate values to the users
Reduce function. The output of the Reduce function is appended to a final output file for this
reduce partition.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user
program. At this point, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the mapreduce execution is available in the R output
files (one per reduce task, with file names as specified by the user). Typically, users do not need
to combine these R output files into one file they often pass these files as input to another
MapReduce call, or use them from another distributed application that is able to deal with input
that is partitioned into multiple files.

2.4.4.2 Master Data Structures


The master keeps several data structures. For each map task and reduce task, it stores the state
(idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is
propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master
stores the locations and sizes of the R intermediate file regions produced by the map task.
Updates to this location and size information are received as map tasks are completed. The
information is pushed incrementally to workers that have in-progress reduce tasks.

2.4.4.3 Fault Tolerance


Since the MapReduce library is designed to help process very large amounts of data
using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

Worker Failure
The master pings every worker periodically. If no response is received from a worker
in a certain amount of time, the master marks the worker as failed. Any map tasks completed by
the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also
reset to idle and becomes eligible for rescheduling.
Completed map tasks are re-executed on a failure because their output is stored on the
local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not
need to be re-executed since their output is stored in a global file system.
When a map task is executed first by worker A and then later executed by worker B
(because A failed), all workers executing reduce tasks are notified of the reexecution. Any
reduce task that has not already read the data from worker A will read the data from worker B.
MapReduce is resilient to large-scale worker failures. For example, during one
MapReduce operation, network maintenance on a running cluster was causing groups of 80
machines at a time to become unreachable for several minutes. The MapReduce mastersimply re-
executed the work done by the unreachable worker machines, and continued to make forward
progress, eventually completing the MapReduce operation.

Master Failure
It is easy to make the master write periodic checkpoints of the master data structures
described above. If the master task dies, a new copy can be started from the last checkpointed
state. However, given that there is only a single master, its failure is unlikely; therefore our
current implementation aborts the MapReduce computation if the master fails. Clients can check
for this condition and retry the MapReduce operation if they desire.

Semantics in the Presence of Failures


When the user-supplied map and reduce operators are deterministic functions of their
input values, our distributed implementation produces the same output as would have been
produced by a non-faulting sequential execution of the entire program.
We rely on atomic commits of map and reduce task outputs to achieve this property.
Each in-progress task writes its output to private temporary files. A reduce task produces one
such file, and a map task produces R such files (one per reduce task). When a map task
completes, the worker sends a message to the master and includes the names of the R temporary
files in the message. If the master receives a completion message for an already completed map
task, it ignores the message. Otherwise, it records the names of R files in a master data structure.
When a reduce task completes, the reduce worker atomically renames its temporary
output file to the final output file. If the same reduce task is executed on multiple machines,
multiple rename calls will be executed for the same final output file. We rely on the atomic
rename operation provided by the underlying file system to guarantee that the final file system
state contains just the data produced by one execution of the reduce task.
The vast majority of our map and reduce operators are deterministic, and the fact that
our semantics are equivalent to a sequential execution in this case makes it very easy for
programmersto reason about their programs behavior. When the map and/or reduce operators
are nondeterministic, we provide weaker but still reasonable semantics. In the presence of non-
deterministic operators, the output of a particular reduce task R1 is equivalent to the output for
R1 produced by a sequential execution of the non-deterministic program. However, the output
for a different reduce task R2 may correspond to the output for R2 produced by a different
sequential execution of the non-deterministic program.
Consider map task M and reduce tasks R1 and R2. Let e(Ri) be the execution of Ri that
committed (there is exactly one such execution). The weaker semantics arise because e(R1) may
have read the output produced by one execution of M and e(R2) may have read the output
produced by a different execution of M.

2.4.5 Performance considerations


MapReduce programs are not guaranteed to be fast. The main benefit of this programming model
is to exploit the optimized shuffle operation of the platform, and only having to write the Map
and Reduce parts of the program. In practice, the author of a MapReduce program however has
to take the shuffle step into consideration; in particular the partition function and the amount of
data written by the Map function can have a large impact on the performance and scalability.
Additional modules such as the Combiner function can help to reduce the amount of data written
to disk, and transmitted over the network. MapReduce applications can achieve sub-linear
speedups under specific circumstances.

When designing a MapReduce algorithm, the author needs to choose a good tradeoff between the
computation and the communication costs. Communication cost often dominates the
computation cost, and many MapReduce implementations are designed to write all
communication to distributed storage for crash recovery.

In tuning performance of MapReduce, the complexity of mapping, shuffle, sorting (grouping by


the key), and reducing has to be taken into account. The amount of data produced by the mappers
is a key parameter that shifts the bulk of the computation cost between mapping and reducing.
Reducing includes sorting (grouping of the keys) which has nonlinear complexity. Hence, small
partition sizes reduce sorting time, but there is a trade-off because having a large number of
reducers may be impractical. The influence of split unit size is marginal (unless chosen
particularly badly, say <1MB). The gains from some mappers reading load from local disks, on
average, is minor.

For processes that complete quickly, and where the data fits into main memory of a single
machine or a small cluster, using a MapReduce framework usually is not effective. Since these
frameworks are designed to recover from the loss of whole nodes during the computation, they
write interim results to distributed storage. This crash recovery is expensive, and only pays off
when the computation involves many computers and a long runtime of the computation. A task
that completes in seconds can just be restarted in the case of an error, and the likelihood of at
least one machine failing grows quickly with the cluster size. On such problems,
implementations keeping all data in memory and simply restarting a computation on node
failures or when the data is small enough non-distributed solutions will often be faster than
a MapReduce system.

2.4.6 Distribution and reliability


MapReduce achieves reliability by parceling out a number of operations on the set of data to
each node in the network. Each node is expected to report back periodically with completed
work and status updates. If a node falls silent for longer than that interval, the master node
(similar to the master server in the Google File System) records the node as dead and sends out
the node's assigned work to other nodes. Individual operations use atomic operations for naming
file outputs as a check to ensure that there are not parallel conflicting threads running. When files
are renamed, it is possible to also copy them to another name in addition to the name of the task
(allowing for side-effects).

The reduce operations operate much the same way. Because of their inferior properties with
regard to parallel operations, the master node attempts to schedule reduce operations on the same
node, or in the same rack as the node holding the data being operated on. This property is
desirable as it conserves bandwidth across the backbone network of the datacenter.

Implementations are not necessarily highly reliable. For example, in older versions of Hadoop
the NameNode was a single point of failure for the distributed filesystem. Later versions of
Hadoop have high availability with an active/passive failover for the "NameNode."
2.4.7 Uses
MapReduce is useful in a wide range of applications, including distributed pattern-based
searching, distributed sorting, web link-graph reversal, Singular Value Decomposition, web
access log stats, inverted index construction, document clustering, machine learning, and
statistical machine translation. Moreover, the MapReduce model has been adapted to several
computing environments like multi-core and many-core systems, desktop grids, multi-cluster,
volunteer computing environments, dynamic cloud environments, mobile environments, and
high-performance computing environments.

At Google, MapReduce was used to completely regenerate Google's index of the World Wide
Web. It replaced the old ad hoc programs that updated the index and ran the various analyses.
Development at Google has since moved on to technologies such as Percolator, FlumeJava and
MillWheel that offer streaming operation and updates instead of batch processing, to allow
integrating "live" search results without rebuilding the complete index.

MapReduce's stable inputs and outputs are usually stored in a distributed file system. The
transient data are usually stored on local disk and fetched remotely by the reducers.

2.4.8 Criticism
2.4.8.1 Lack of novelty
David DeWitt and Michael Stonebraker, computer scientists specializing in parallel databases
and shared-nothing architectures, have been critical of the breadth of problems that MapReduce
can be used for. They called its interface too low-level and questioned whether it really
represents the paradigm shift its proponents have claimed it is. They challenged the MapReduce
proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over
two decades. They also compared MapReduce programmers to CODASYL programmers, noting
both are "writing in a low-level language performing low-level record manipulation."
MapReduce's use of input files and lack of schema support prevents the performance
improvements enabled by common database system features such as B-trees and hash
partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive, YSmart, HBase
and BigTable are addressing some of these problems.

Greg Jorgensen wrote an article rejecting these views. Jorgensen asserts that DeWitt and
Stonebraker's entire analysis is groundless as MapReduce was never designed nor intended to be
used as a database.

DeWitt and Stonebraker have subsequently published a detailed benchmark study in 2009
comparing performance of Hadoop's MapReduce and RDBMS approaches on several specific
problems. They concluded that relational databases offer real advantages for many kinds of data
use, especially on complex processing or where the data is used across an enterprise, but that
MapReduce may be easier for users to adopt for simple or one-time processing tasks.

Google has been granted a patent on MapReduce. However, there have been claims that this
patent should not have been granted because MapReduce is too similar to existing products. For
example, map and reduce functionality can be very easily implemented in Oracle's PL/SQL
database oriented language or is supported for developers transparently in distributed database
architectures such as Clusterpoint XML database or MongoDB NoSQL database.

2.4.8.2 Restricted programming framework


MapReduce tasks must be written as acyclic dataflow programs, i.e. a stateless mapper followed
by a stateless reducer, that are executed by a batch job scheduler. This paradigm makes repeated
querying of datasets difficult and imposes limitations that are felt in fields such as machine
learning, where iterative algorithms that revisit a single working set multiple times are the norm.

2.5 MapReduce Architecture

2.5.1 Job Clients


The job client the one who submits the job. A job contains the mapper function and reducer
function and some configuration function that will drive a job.

2.5.2 Job Tracker


The job tracker is the master of task trackers which are the slaves work on data nodes. The job
tracker responsibilities to come up with the execution plan and it is coordinate and schedule the
plan cross the test trackers. It also can do phase coordination.

2.5.3 Test Tracker


Test tracker is the one who gonna break down the job into tasks that is map and reduce task
.Every task tracker has slots on it. the job tracker take map and reduce function all of them
compile binary and throw them into the task slots which actually do the execution over the map
and reduce functions. They also report the their report back to the job tracker. If something fails
the job tracker will know about it. It just reschedule that task on the another task tracker.

2.6 MapReduce wuth MongoDB


Map-reduce is a data processing paradigm for condensing large volumes of data into useful
aggregated results. For map-reduce operations, MongoDB provides the mapReduce database
command.

Consider the following map-reduce operation:

In this map-reduce operation, MongoDB applies the map phase to each input document (i.e. the
documents in the collection that match the query condition). The map function emits key-value
pairs. For those keys that have multiple values, MongoDB applies the reduce phase, which
collects and condenses the aggregated data. MongoDB then stores the results in a collection.
Optionally, the output of the reduce function may pass through a finalize function to further
condense or process the results of the aggregation.

All map-reduce functions in MongoDB are JavaScript and run within the mongod process. Map-
reduce operations take the documents of a single collection as the input and can perform any
arbitrary sorting and limiting before beginning the map stage. mapReduce can return the results
of a map-reduce operation as a document, or may write the results to collections. The input and
the output collections may be sharded.

NOTE
For most aggregation operations, the Aggregation Pipeline provides better performance and more
coherent interface. However, map-reduce operations provide some flexibility that is not presently
available in the aggregation pipeline.

Map-Reduce JavaScript Functions


In MongoDB, map-reduce operations use custom JavaScript functions to map, or associate,
values to a key. If a key has multiple values mapped to it, the operation reduces the values for the
key to a single object.
The use of custom JavaScript functions provide flexibility to map-reduce operations. For
instance, when processing a document, the map function can create more than one key and value
mapping or no mapping. Map-reduce operations can also use a custom JavaScript function to
make final modifications to the results at the end of the map and reduce operation, such as
perform additional calculations.

Map-Reduce Behavior
In MongoDB, the map-reduce operation can write results to a collection or return the results
inline. If you write map-reduce output to a collection, you can perform subsequent map-reduce
operations on the same input collection that merge replace, merge, or reduce new results with
previous results. See mapReduce and Perform Incremental Map-Reduce for details and
examples.
When returning the results of a map reduce operation inline, the result documents must be within
the BSON Document Size limit, which is currently 16 megabytes. For additional information on
limits and restrictions on map-reduce operations, see the mapReduce reference page.
MongoDB supports map-reduce operations on sharded collections. Map-reduce operations can
also output the results to a sharded collection. See Map-Reduce and Sharded Collections.
Views do not support map-reduce operations.

Map Reduce Concurrency


The map-reduce operation is composed of many tasks, including reads from the input collection,
executions of the map function, executions of the reduce function, writes to a temporary
collection during processing, and writes to the output collection.

During the operation, map-reduce takes the following locks:


The read phase takes a read lock. It yields every 100 documents.
The insert into the temporary collection takes a write lock for a single write.
If the output collection does not exist, the creation of the output collection takes a write
lock.
If the output collection exists, then the output actions (i.e. merge, replace, reduce) take a
write lock. This write lock is global, and blocks all operations on the mongod instance.
2.6.1 Map-Reduce Examples
In the mongo shell, the db.collection.mapReduce() method is a wrapper around the mapReduce
command. The following examples use the db.collection.mapReduce() method:
Consider the following map-reduce operations on a collection orders that contains documents of
the following prototype:

{
_id: ObjectId("50a8240b927d5d8b5891743c"),
cust_id: "abc123",
ord_date: new Date("Oct 04, 2012"),
status: 'A',
price: 25,
items: [ { sku: "mmm", qty: 5, price: 2.5 },
{ sku: "nnn", qty: 5, price: 2.5 } ]
}

Return the Total Price Per Customer


Perform the map-reduce operation on the orders collection to group by the cust_id, and calculate
the sum of the price for each cust_id:
1. Define the map function to process each input document:
In the function, this refers to the document that the map-reduce operation is processing.
The function maps the price to the cust_id for each document and emits the cust_id and
price pair.

var mapFunction1 = function() {


emit(this.cust_id, this.price);
};

2. Define the corresponding reduce function with two arguments keyCustId and valuesPrices:
The valuesPrices is an array whose elements are the price values emitted by the map
function and grouped by keyCustId.
The function reduces the valuesPrice array to the sum of its elements.

var reduceFunction1 = function(keyCustId, valuesPrices) {


return Array.sum(valuesPrices);
};

3. Perform the map-reduce on all documents in the orders collection using the mapFunction1
map function and the reduceFunction1 reduce function.
db.orders.mapReduce(
mapFunction1,
reduceFunction1,
{ out: "map_reduce_example" }
)
This operation outputs the results to a collection named map_reduce_example. If the
map_reduce_example collection already exists, the operation will replace the contents with the
results of this map-reduce operation:

Calculate Order and Total Quantity with Average Quantity Per Item
In this example, you will perform a map-reduce operation on the orders collection for all
documents that have an ord_date value greater than 01/01/2012. The operation groups by the
item.sku field, and calculates the number of orders and the total quantity ordered for each sku.
The operation concludes by calculating the average quantity per order for each sku value:

1. Define the map function to process each input document:


In the function, this refers to the document that the map-reduce operation is processing.
For each item, the function associates the sku with a new object value that contains the
count of 1 and the item qty for the order and emits the sku and value pair.

var mapFunction2 = function() {


for (var idx = 0; idx < this.items.length; idx++) {
var key = this.items[idx].sku;
var value = {
count: 1,
qty: this.items[idx].qty
};
emit(key, value);
}
};

2. Define the corresponding reduce function with two arguments keySKU and countObjVals:
countObjVals is an array whose elements are the objects mapped to the grouped keySKU
values passed by map function to the reducer function.
The function reduces the countObjVals array to a single object reducedValue that
contains the count and the qty fields.
In reducedVal, the count field contains the sum of the count fields from the individual
array elements, and the qty field contains the sum of the qty fields from the individual
array elements.

var reduceFunction2 = function(keySKU, countObjVals) {


reducedVal = { count: 0, qty: 0 };

for (var idx = 0; idx < countObjVals.length; idx++) {


reducedVal.count += countObjVals[idx].count;
reducedVal.qty += countObjVals[idx].qty;
}

return reducedVal;
};
3. Define a finalize function with two arguments key and reducedVal. The function modifies the
reducedVal object to add a computed field named avg and returns the modified object:
var finalizeFunction2 = function (key, reducedVal) {

reducedVal.avg = reducedVal.qty/reducedVal.count;

return reducedVal;

};

4. Perform the map-reduce operation on the orders collection using the mapFunction2,
reduceFunction2, and finalizeFunction2 functions.
db.orders.mapReduce( mapFunction2,
reduceFunction2,
{
out: { merge: "map_reduce_example" },
query: { ord_date:
{ $gt: new Date('01/01/2012') }
},
finalize: finalizeFunction2
}
)

This operation uses the query field to select only those documents with ord_date greater than
new Date(01/01/2012). Then it output the results to a collection map_reduce_example. If the
map_reduce_example collection already exists, the operation will merge the existing contents
with the results of this map-reduce operation.

2.7 Hadoop Streaming & Pipe Facilities


2.7.1 Hadoop Streaming
Hadoop Streaming is a generic API which allows writing Mappers and Reduces in any language.
But the basic concept remains the same. Mappers and Reducers receive their input and output on
stdin and stdout as (key, value) pairs. Apache Hadoop uses streams as per UNIX standard
between your application and Hadoop system. Streaming is the best fit for text processing. The
data view is line oriented and processed as a key/value pair separated by 'tab' character. The
program reads each line and processes it as per the requirement.
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to
create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
For example:
mapred streaming \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc
2.7.1.1 How Streaming Works
In the above example, both the mapper and the reducer are executables that read the
input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce
job, submit the job to an appropriate cluster, and monitor the progress of the job until it
completes.
When an executable is specified for mappers, each mapper task will launch the
executable as a separate process when the mapper is initialized. As the mapper task runs, it
converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the
mapper collects the line oriented outputs from the stdout of the process and converts each line
into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a
line up to the first tab character is the key and the rest of the line (excluding the tab character)
will be the value. If there is no tab character in the line, then entire line is considered as key and
the value is null. However, this can be customized by setting -inputformat command option.
When an executable is specified for reducers, each reducer task will launch the
executable as a separate process then the reducer is initialized. As the reducer task runs, it
converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the
meantime, the reducer collects the line oriented outputs from the stdout of the process, converts
each line into a key/value pair, which is collected as the output of the reducer. By default, the
prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab
character) is the value. However, this can be customized by setting -outputformat command
option, as discussed later.
This is the basis for the communication protocol between the Map/Reduce framework
and the streaming mapper/reducer.
User can specify stream.non.zero.exit.is.failure as true or false to make a streaming
task that exits with a non-zero status to be Failure or Success respectively. By default, streaming
tasks exiting with non-zero status are considered to be failed tasks.

2.7.1.2 Streaming Command Options


Streaming supports streaming command options as well as generic command options. The
general command line syntax is shown below.

Note: Be sure to place the generic options before the streaming options, otherwise the command
will fail. For an example, see Making Archives Available to Tasks.

mapred streaming [genericOptions] [streamingOptions]

The Hadoop streaming command options are listed here:


Parameter Optional/Required Description
-input directoryname Required Input location for
or filename mapper
-output Required Output location for
directoryname reducer
-mapper executable Optional Mapper executable. If
Parameter Optional/Required Description
or JavaClassName not specified,
IdentityMapper is used
as the default
-reducer executable Optional Reducer executable. If
or JavaClassName not specified,
IdentityReducer is used
as the default
-file filename Optional Make the mapper,
reducer, or combiner
executable available
locally on the compute
nodes
-inputformat Optional Class you supply should
JavaClassName return key/value pairs of
Text class. If not
specified,
TextInputFormat is used
as the default
-outputformat Optional Class you supply should
JavaClassName take key/value pairs of
Text class. If not
specified,
TextOutputformat is
used as the default
-partitioner Optional Class that determines
JavaClassName which reduce a key is
sent to
-combiner Optional Combiner executable for
streamingCommand map output
or JavaClassName
-cmdenv name=value Optional Pass environment
variable to streaming
commands
-inputreader Optional For backwards-
compatibility: specifies a
record reader class
(instead of an input
format class)
-verbose Optional Verbose output
-lazyOutput Optional Create output lazily. For
example, if the output
Parameter Optional/Required Description
format is based on
FileOutputFormat, the
output file is created only
on the first call to
Context.write
-numReduceTasks Optional Specify the number of
reducers
-mapdebug Optional Script to call when map
task fails
-reducedebug Optional Script to call when
reduce task fails

2.7.2 Hadoop pipes


Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce. Unlike Streaming, which
uses standard input and output to communicate with the map and reduce code, Pipes uses sockets
as the channel over which the tasktracker communicates with the process running the C++ map
or reduce function. JNI is not used.
The primary approach is to split the C++ code into a separate process that does the application
specific code. In many ways, the approach will be similar to Hadoop streaming, but using
Writable serialization to convert the types into bytes that are sent to the process via a socket.

The class org.apache.hadoop.mapred.pipes.Submitter has a public static method to submit a job


as a JobConf and a main method that takes an application and optional configuration file, input
directories, and output directory. The cli for the main looks like:

bin/hadoop pipes \
[-input inputDir] \
[-output outputDir] \
[-jar applicationJarFile] \
[-inputformat class] \
[-map class] \
[-partitioner class] \
[-reduce class] \
[-writer class] \
[-program program url] \
[-conf configuration file] \
[-D property=value] \
[-fs local|namenode:port] \
[-jt local|jobtracker:port] \
[-files comma separated list of files] \
[-libjars comma separated list of jars] \
[-archives comma separated list of archives]

The application programs link against a thin C++ wrapper library that handles the
communication with the rest of the Hadoop system. The C++ interface is "swigable" so that
interfaces can be generated for python and other scripting languages. All of the C++ functions
and classes are in the HadoopPipes namespace. The job may consist of any combination of Java
and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter.

Hadoop Pipes has a generic Java class for handling the mapper and reducer (PipesMapRunner
and PipesReducer). They fork off the application program and communicate with it over a
socket. The communication is handled by the C++ wrapper library and the PipesMapRunner and
PipesReducer.

The application program passes in a factory object that can create the various objects needed by
the framework to the runTask function. The framework creates the Mapper or Reducer as
appropriate and calls the map or reduce method to invoke the application's code. The JobConf is
available to the application.

The Mapper and Reducer objects get all of their inputs, outputs, and context via context objects.
The advantage of using the context objects is that their interface can be extended with additional
methods without breaking clients. Although this interface is different from the current Java
interface, the plan is to migrate the Java interface in this direction.

Although the Java implementation is typed, the C++ interfaces of keys and values is just a byte
buffer. Since STL strings provide precisely the right functionality and are standard, they will be
used. The decision to not use stronger types was to simplify the interface.

The application can also define combiner functions. The combiner will be run locally by the
framework in the application process to avoid the round trip to the Java process and back.
Because the compare function is not available in C++, the combiner will use memcmp to sort the
inputs to the combiner. This is not as general as the Java equivalent, which uses the user's
comparator, but should cover the majority of the use cases. As the map function outputs
key/value pairs, they will be buffered. When the buffer is full, it will be sorted and passed to the
combiner. The output of the combiner will be sent to the Java process.

The application can also set a partition function to control which key is given to a particular
reduce. If a partition function is not defined, the Java one will be used. The partition function
will be called by the C++ framework before the key/value pair is sent back to Java.

The application programs can also register counters with a group and a name and also increment
the counters and get the counter values.

2.7.2.1 Command Options


The following command parameters are supported for hadoop pipes:
Parameter Description

-output <path> Specify the output directory.

-jar <jar file> Specify the jar filename.

-inputformat <class> InputFormat class.

-map <class> Specify the Java Map class.

-partitioner <class> Specify the Java Partitioner.

-reduce <class> Specify the Java Reduce class.

-writer <class> Specify the Java RecordWriter.

-program <executable> Specify the URI of the executable.

-reduces <num> Specify the number of reduces.

2.8 Hadoop HDFS Overview


The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop
project. This Apache Software Foundation project is designed to provide a fault-tolerant file
system designed to run on commodity hardware.
According to The Apache Software Foundation, the primary objective of HDFS is to
store data reliably even in the presence of failures including NameNode failures, DataNode
failures and network partitions. The NameNode is a single point of failure for the HDFS cluster
and a DataNode stores data in the Hadoop file management system.
HDFS uses a master/slave architecture in which one device (the master) controls one or
more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master
server manages the file system namespace and regulates access to files.
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Hadoop DistributedFile System (HDFS) splits the large data files into parts which are
managed by different machines in the cluster. Each part is replicated across many machines in a
cluster, so that if there is a single machine failure it does not result in data being unavailable. In
the Hadoop programming framework data is record oriented. Specific to the application logic,
individual input data files are broken into various formats. Subsets of these records are then
processed by each process running on a machine in the cluster. Using the knowledge from the
DFS these processes are scheduled by the Hadoop framework based on the location of the record
or data. The files are spread across the DFS as chunks and are computed by the process running
on the node. Hadoop framework helps in preventing unwanted network transfers and strain on
network can be obtained by reading data from the local disk directly into the CPU. Thus with
hadoop one could have high performance results due to data locality, with their strategy of
moving the computation to the data.
HDFS stores file system metadata and application data separately. As in other
distributed file systems, like PVFS, Lustre and GFS, HDFS stores metadata on a dedicated
server, called the NameNode. Application data are stored on other servers called DataNodes. All
servers are fully connected and communicate with each other using TCP-based protocols.

2.8.1 Features of HDFS


It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

2.8.2 Goals
1 Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of
hundreds or thousands of server machines, each storing part of the file systems data. The fact
that there are a huge number of components and that each component has a nontrivial probability
of failure means that some component of HDFS is always non-functional. Therefore, detection of
faults and quick, automatic recovery from them is a core architectural goal of HDFS.

2 Streaming Data Access


Applications that run on HDFS need streaming access to their data sets. They are not general
purpose applications that typically run on general purpose file systems. HDFS is designed more
for batch processing rather than interactive use by users. The emphasis is on high throughput of
data access rather than low latency of data access. POSIX imposes many hard requirements that
are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas
has been traded to increase data throughput rates.

3 Large Data Sets


Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate
data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.

4 Simple Coherency Model


HDFS applications need a write-once-read-many access model for files. A file once created,
written, and closed need not be changed. This assumption simplifies data coherency issues and
enables high throughput data access. A MapReduce application or a web crawler application fits
perfectly with this model. There is a plan to support appending-writes to files in the future.

5 Moving Computation is Cheaper than Moving Data


A computation requested by an application is much more efficient if it is executed near the data
it operates on. This is especially true when the size of the data set is huge. This minimizes
network congestion and increases the overall throughput of the system. The assumption is that it
is often better to migrate the computation closer to where the data is located rather than moving
the data to where the application is running. HDFS provides interfaces for applications to move
themselves closer to where the data is located.

6 Portability Across Heterogeneous Hardware and Software Platforms


HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.

2.9 HDFS Architecture


A HDFS is file system component of Hadoop. HDFS has master/slave architechture. An HDFS
cluster consists of single namenode, a master server and many datanodes, called slaves in the
architecture. The HDFS stores file system metadata and application data separetly. HDFS stores
metadata on separate dedicated server called Namenode and Application data are stored on
separate servers called Datanodes. All servers fully connected and communicated with the TCP
based protocols. The below fig 2 shows the complete architecture of the HDFS.
A. NameNode
The HDFS namespace is a hierarchy of files and directories. Files and directories are
represented on the NameNode by inodes, which record attributes like permissions, modification
and access times, namespace and disk space quotas. The file content is split into large blocks
(typically 128 megabytes, but user selectable file-by-file) and each block of the file is
independently replicated at multiple DataNodes (typically three, but user selectable file-by-file).
The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes (the
physical location of file data). An HDFS client wanting to read a file first contacts the
NameNode for the locations of data blocks comprising the file and then reads block contents
from the DataNode closest to the client. When writing data, the client requests the NameNode to
nominate a suite of three DataNodes to host the block replicas. The client then writes data to the
DataNodes in a pipeline fashion. The current design has a single NameNode for each cluster.
The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster,
as each DataNode may execute multiple application tasks concurrently.
HDFS keeps the entire namespace in RAM. The inode data and the list of blocks
belonging to each file comprise the metadata of the name system called the image. The persistent
record of the image stored in the local hosts native files system is called a checkpoint. The
NameNode also stores the modification log of the image called the journal in the local hosts
native file system. For improved durability, redundant copies of the checkpoint and journal can
be made at other servers. During restarts the NameNode restores the namespace by reading the
namespace and replaying the journal. The locations of block replicas may change over time and
are not part of the persistent checkpoint.

B. DataNodes
Each block replica on a DataNode is represented by two files in the local hosts native
file system. The first file contains the data itself and the second file is blocks metadata including
checksums for the block data and the blocks generation stamp. The size of the data file equals
the actual length of the block and does not require extra space to round it up to the nominal block
size as in traditional file systems. Thus, if a block is half full it needs only half of the space of the
full block on the local drive.
During startup each DataNode connects to the NameNode and performs a handshake.
The purpose of the handshake is to verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the DataNode automatically shuts
down.
The namespace ID is assigned to the file system instance when it is formatted. The
namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace
ID will not be able to join the cluster, thus preserving the integrity of the file system.
The consistency of software versions is important because incompatible version may
cause data corruption or loss, and on large clusters of thousands of machines it is easy to
overlook nodes that did not shut down properly prior to the software upgrade or were not
available during the upgrade.
A DataNode that is newly initialized and without any namespace ID is permitted to join
the cluster and receive the clusters namespace ID.
After the handshake the DataNode registers with the NameNode. DataNodes
persistently store their unique storage IDs. The storage ID is an internal identifier of the
DataNode, which makes it recognizable even if it is restarted with a different IP address or port.
The storage ID is assigned to the DataNode when it registers with the NameNode for the first
time and never changes after that.
A DataNode identifies block replicas in its possession to the NameNode by sending a
block report. A block report contains the block id, the generation stamp and the length for each
block replica the server hosts. The first block report is sent immediately after the DataNode
registration. Subsequent block reports are sent every hour and provide the NameNode with an
up-todate view of where block replicas are located on the cluster.
During normal operation DataNodes send heartbeats to the NameNode to confirm that
the DataNode is operating and the block replicas it hosts are available. The default heartbeat
interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten
minutes the NameNode considers the DataNode to be out of service and the block replicas
hosted by that DataNode to be unavailable. The NameNode then schedules creation of new
replicas of those blocks on other DataNodes.
Heartbeats from a DataNode also carry information about total storage capacity,
fraction of storage in use, and the number of data transfers currently in progress. These statistics
are used for the NameNodes space allocation and load balancing decisions.
The NameNode does not directly call DataNodes. It uses replies to heartbeats to send
instructions to the DataNodes. The instructions include commands to:
replicate blocks to other nodes;
remove local block replicas;
re-register or to shut down the node;
send an immediate block report.
These commands are important for maintaining the overall system integrity and therefore it is
critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands
of heartbeats per second without affecting other NameNode operations.

C. HDFS Client
User applications access the file system using the HDFS client, a code library that
exports the HDFS file system interface.
Similar to most conventional file systems, HDFS supports operations to read, write and
delete files, and operations to create and delete directories. The user references files and
directories by paths in the namespace. The user application generally does not need to know that
file system metadata and storage are on different servers, or that blocks have multiple replicas.
When an application reads a file, the HDFS client first asks the NameNode for the list
of DataNodes that host replicas of the blocks of the file. It then contacts a DataNode directly and
requests the transfer of the desired block. When a client writes, it first asks the NameNode to
choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline
from node-to-node and sends the data. When the first block is filled, the client requests new
DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the
client sends the further bytes of the file. Each choice of DataNodes is likely to be different. The
interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1.
Figure 1. An HDFS client creates a new file by giving its path to the NameNode. For each block of the file, the
NameNode returns a list of DataNodes to host its replicas. The client then pipelines data to the chosen DataNodes,
which eventually confirm the creation of the block replicas to the NameNode

Unlike conventional file systems, HDFS provides an API that exposes the locations of
a file blocks. This allows applications like the MapReduce framework to schedule a task to
where the data are located, thus improving the read performance. It also allows an application to
set the replication factor of a file. By default a files replication factor is three. For critical files or
files which are accessed very often, having a higher replication factor improves their tolerance
against faults and increase their read bandwidth.

D. Image and Journal


The namespace image is the file system metadata that describes the organization of
application data as directories and files. A persistent record of the image written to disk is called
a checkpoint. The journal is a write-ahead commit log for changes to the file system that must be
persistent. For each client-initiated transaction, the change is recorded in the journal, and the
journal file is flushed and synched before the change is committed to the HDFS client. The
checkpoint file is never changed by the NameNode; it is replaced in its entirety when a new
checkpoint is created during restart, when requested by the administrator, or by the
CheckpointNode described in the next section. During startup the NameNode initializes the
namespace image from the checkpoint, and then replays changes from the journal until the image
is up-to-date with the last state of the file system. A new checkpoint and empty journal are
written back to the storage directories before the NameNode starts serving clients.
If either the checkpoint or the journal is missing, or becomes corrupt, the namespace
information will be lost partly or entirely. In order to preserve this critical information HDFS can
be configured to store the checkpoint and journal in multiple storage directories. Recommended
practice is to place the directories on different volumes, and for one storage directory to be on a
remote NFS server. The first choice prevents loss from single volume failures, and the second
choice protects against failure of the entire node. If the NameNode encounters an error writing
the journal to one of the storage directories it automatically excludes that directory from the list
of storage directories. The NameNode automatically shuts itself down if no storage directory is
available.
The NameNode is a multithreaded system and processes requests simultaneously from
multiple clients. Saving a transaction to disk becomes a bottleneck since all other threads need to
wait until the synchronous flush-and-sync procedure initiated by one of them is complete. In
order to optimize this process the NameNode batches multiple transactions initiated by different
clients. When one of the NameNodes threads initiates a flush-and-sync operation, all
transactions batched at that time are committed together. Remaining threads only need to check
that their transactions have been saved and do not need to initiate a flush-and-sync operation.

E. CheckpointNode
The NameNode in HDFS, in addition to its primary role serving client requests, can
alternatively execute either of two other roles, either a CheckpointNode or a BackupNode. The
role is specified at the node startup.
The CheckpointNode periodically combines the existing checkpoint and journal to
create a new checkpoint and an empty journal. The CheckpointNode usually runs on a different
host from the NameNode since it has the same memory requirements as the NameNode. It
downloads the current checkpoint and journal files from the NameNode, merges them locally,
and returns the new checkpoint back to the NameNode.
Creating periodic checkpoints is one way to protect the file system metadata. The
system can start from the most recent checkpoint if all other persistent copies of the namespace
image or journal are unavailable.
Creating a checkpoint lets the NameNode truncate the tail of the journal when the new
checkpoint is uploaded to the NameNode. HDFS clusters run for prolonged periods of time
without restarts during which the journal constantly grows. If the journal grows very large, the
probability of loss or corruption of the journal file increases. Also, a very large journal extends
the time required to restart the NameNode. For a large cluster, it takes an hour to process a week-
long journal. Good practice is to create a daily checkpoint.

F. BackupNode
A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode,
the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an
inmemory, up-to-date image of the file system namespace that is always synchronized with the
state of the NameNode.
The BackupNode accepts the journal stream of namespace transactions from the active
NameNode, saves them to its own storage directories, and applies these transactions to its own
namespace image in memory. The NameNode treats the BackupNode as a journal store the same
as it treats journal files in its storage directories. If the NameNode fails, the BackupNodes image
in memory and the checkpoint on disk is a record of the latest namespace state.
The BackupNode can create a checkpoint without downloading checkpoint and journal
files from the active NameNode, since it already has an up-to-date namespace image in its
memory. This makes the checkpoint process on the BackupNode more efficient as it only needs
to save the namespace into its local storage directories.
The BackupNode can be viewed as a read-only NameNode. It contains all file system
metadata information except for block locations. It can perform all operations of the regular
NameNode that do not involve modification of the namespace or knowledge of block locations.
Use of a BackupNode provides the option of running the NameNode without persistent storage,
delegating responsibility for the namespace state persisting to the BackupNode.

G. Upgrades, File System Snapshots


During software upgrades the possibility of corrupting the system due to software bugs
or human mistakes increases. The purpose of creating snapshots in HDFS is to minimize
potential damage to the data stored in the system during upgrades.
The snapshot mechanism lets administrators persistently save the current state of the
file system, so that if the upgrade results in data loss or corruption it is possible to rollback the
upgrade and return HDFS to the namespace and storage state as they were at the time of the
snapshot.
The snapshot (only one can exist) is created at the cluster administrators option
whenever the system is started. If a snapshot is requested, the NameNode first reads the
checkpoint and journal files and merges them in memory. Then it writes the new checkpoint and
the empty journal to a new location, so that the old checkpoint and journal remain unchanged.
During handshake the NameNode instructs DataNodes whether to create a local
snapshot. The local snapshot on the DataNode cannot be created by replicating the data files
directories as this will require doubling the storage capacity of every DataNode on the cluster.
Instead each DataNode creates a copy of the storage directory and hard links existing block files
into it. When the DataNode removes a block it removes only the hard link, and block
modifications during appends use the copy-on-write technique. Thus old block replicas remain
untouched in their old directories.
The cluster administrator can choose to roll back HDFS to the snapshot state when
restarting the system. The NameNode recovers the checkpoint saved when the snapshot was
created. DataNodes restore the previously renamed directories and initiate a background process
to delete block replicas created after the snapshot was made. Having chosen to roll back, there is
no provision to roll forward. The cluster administrator can recover the storage occupied by the
snapshot by commanding the system to abandon the snapshot, thus finalizing the software
upgrade.
System evolution may lead to a change in the format of the NameNodes checkpoint
and journal files, or in the data representation of block replica files on DataNodes. The layout
version identifies the data representation formats, and is persistently stored in the NameNodes
and the DataNodes storage directories. During startup each node compares the layout version of
the current software with the version stored in its storage directories and automatically converts
data from older formats to the newer ones. The conversion requires the mandatory creation of a
snapshot when the system restarts with the new software layout version.
HDFS does not separate layout versions for the NameNode and DataNodes because
snapshot creation must be an allcluster effort rather than a node-selective event. If an upgraded
NameNode due to a software bug purges its image then backing up only the namespace state still
results in total data loss, as the NameNode will not recognize the blocks reported by DataNodes,
and will order their deletion. Rolling back in this case will recover the metadata, but the data
itself will be lost. A coordinated snapshot is required to avoid a cataclysmic destruction.

2.10 File I/O Operations and Replica Management


2.10.1 File Read and Write
2.10.1.1 Write operation
Figure 3:writing data to HDFS

Here we are considering the case that we are going to create a new file, write data to it
and will close the file. Now in writing a data to HDFS there are seven steps involved. These
seven steps are:

Step 1: The client creates the file by create() method on Distributed File System.

Step 2: Distributed File System makes an RPC call to the namenode to create a new file in the
filesystems namespace, with no blocks associated with it. The namenode performs various
checks to make sure the file doesn't already exist and that the client has the right permissions to
create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file
creation fails and the client is thrown an IOException. The Distributed FileSystem returns an
FSDataOutputStream for the client to start writing data to. Just as in the read case,
FSDataOutputStream wraps a DFSOutput Stream, which handles communication with the
datanodes and namenode.

Step 3: As the client writes data, DFSOutput Stream splits it into packets, which it writes to an
internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is
responsible for asking the namenode to allocate new blocks by picking a list of suitable
datanodes to store the replicas. The list of datanodes forms a pipeline, and here well assume the
replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the
packets to the first datanode in the pipeline, which stores the packet and forwards it to the second
datanode in the pipeline.

Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline.

Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only
when it has been acknowledged by all the datanodes in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the stream.

Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete The
namenode already knows which blocks the file is made up of (via DataStreamer asking for block
allocations), so it only has to wait for blocks to be minimally replicated before returning
successfully.

2.10.1.2 Read Operation

Fig 4:reading the file from HDFS

Fig 4 shows six steps involved in reading the file from HDFS:
Let's suppose a Client (a HDFS Client) wants to read a file from HDFS. So the steps involved in
reading the file is:

Step 1: First the Client will open the file by giving a call to open() method onFileSystem object,
which for HDFS is an instance of DistributedFileSystemclass.

Step 2: DistributedFileSystem calls the Namenode, using RPC, to determine thelocations of the
blocks for the first few blocks of the file. For each block, the namenode returns the addresses of
all the datanodes that have a copy of that block. The DistributedFileSystem returns an object of
FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and
namenode I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first closest datanode
for the first block in the file.

Step 4: Data is streamed from the datanode back to the client, which calls read()repeatedly on
the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block. This happens transparently to the client,
which from its point of view is just reading a continuous stream.

Step 6: Blocks are read in order, with the DFSInputStream opening new connections to
datanodes as the client reads through the stream. It will also call the namenode to retrieve the
datanode locations for the next batch of blocks as needed. When the client has finished reading,
it calls close() on the FSDataInputStream.

2.10.2 Block Placement


For a large cluster, it may not be practical to connect all nodes in a flat topology. A
common practice is to spread the nodes across multiple racks. Nodes of a rack share a switch,
and rack switches are connected by one or more core switches. Communication between two
nodes in different racks has to go through multiple switches. In most cases, network bandwidth
between nodes in the same rack is greater than network bandwidth between nodes in different
racks. Fig. 3 describes a cluster with two racks, each of which contains three nodes.

Figure 3. Cluster topology example

HDFS estimates the network bandwidth between two nodes by their distance. The
distance from a node to its parent node is assumed to be one. A distance between two nodes can
be calculated by summing up their distances to their closest common ancestor. A shorter distance
between two nodes means that the greater bandwidth they can utilize to transfer data.
HDFS allows an administrator to configure a script that returns a nodes rack
identification given a nodes address. The NameNode is the central place that resolves the rack
location of each DataNode. When a DataNode registers with the NameNode, the NameNode
runs a configured script to decide which rack the node belongs to. If no such a script is
configured, the NameNode assumes that all the nodes belong to a default single rack.
The placement of replicas is critical to HDFS data reliability and read/write
performance. A good replica placement policy should improve data reliability, availability, and
network bandwidth utilization. Currently HDFS provides a configurable block placement policy
interface so that the users and researchers can experiment and test any policy thats optimal for
their applications.
The default HDFS block placement policy provides a tradeoff between minimizing the
write cost, and maximizing data reliability, availability and aggregate read bandwidth. When a
new block is created, HDFS places the first replica on the node where the writer is located, the
second and the third replicas on two different nodes in a different rack, and the rest are placed on
random nodes with restrictions that no more than one replica is placed at one node and no more
than two replicas are placed in the same rack when the number of replicas is less than twice the
number of racks. The choice to place the second and third replicas on a different rack better
distributes the block replicas for a single file across the cluster. If the first two replicas were
placed on the same rack, for any file, two-thirds of its block replicas would be on the same rack.
After all target nodes are selected, nodes are organized as a pipeline in the order of
their proximity to the first replica. Data are pushed to nodes in this order. For reading, the
NameNode first checks if the clients host is located in the cluster. If yes, block locations are
returned to the client in the order of its closeness to the reader. The block is read from
DataNodes in this preference order. (It is usual for MapReduce applications to run on cluster
nodes, but as long as a host can connect to the NameNode and DataNodes, it can execute the
HDFS client.)
This policy reduces the inter-rack and inter-node write traffic and generally improves
write performance. Because the chance of a rack failure is far less than that of a node failure, this
policy does not impact data reliability and availability guarantees. In the usual case of three
replicas, it can reduce the aggregate network bandwidth used when reading data since a block is
placed in only two unique racks rather than three.
The default HDFS replica placement policy can be summarized as follows:
1. No Datanode contains more than one replica of any block.
2. No rack contains more than two replicas of the same block, provided there are sufficient racks
on the cluster.

2.10.3 Replication management


The NameNode endeavors to ensure that each block always has the intended number of
replicas. The NameNode detects that a block has become under- or over-replicated when a block
report from a DataNode arrives. When a block becomes over replicated, the NameNode chooses
a replica to remove. The NameNode will prefer not to reduce the number of racks that host
replicas, and secondly prefer to remove a replica from the DataNode with the least amount of
available disk space. The goal is to balance storage utilization across DataNodes without
reducing the blocks availability.
When a block becomes under-replicated, it is put in the replication priority queue. A
block with only one replica has the highest priority, while a block with a number of replicas that
is greater than two thirds of its replication factor has the lowest priority. A background thread
periodically scans the head of the replication queue to decide where to place new replicas. Block
replication follows a similar policy as that of the new block placement. If the number of existing
replicas is one, HDFS places the next replica on a different rack. In case that the block has two
existing replicas, if the two existing replicas are on the same rack, the third replica is placed on a
different rack; otherwise, the third replica is placed on a different node in the same rack as an
existing replica. Here the goal is to reduce the cost of creating new replicas.
The NameNode also makes sure that not all replicas of a block are located on one rack.
If the NameNode detects that a blocks replicas end up at one rack, the NameNode treats the
block as under-replicated and replicates the block to a different rack using the same block
placement policy described above. After the NameNode receives the notification that the replica
is created, the block becomes over-replicated. The NameNode then will decides to remove an old
replica because the overreplication policy prefers not to reduce the number of racks.

2.10.4 Balancer
HDFS block placement strategy does not take into account DataNode disk space
utilization. This is to avoid placing newmore likely to be referenceddata at a small subset of
the DataNodes. Therefore data might not always be placed uniformly across DataNodes.
Imbalance also occurs when new nodes are added to the cluster.
The balancer is a tool that balances disk space usage on an HDFS cluster. It takes a
threshold value as an input parameter, which is a fraction in the range of (0, 1). A cluster is
balanced if for each DataNode, the utilization of the node (ratio of used space at the node to total
capacity of the node) differs from the utilization of the whole cluster (ratio of used space in the
cluster to total capacity of the cluster) by no more than the threshold value.
The tool is deployed as an application program that can be run by the cluster
administrator. It iteratively moves replicas from DataNodes with higher utilization to DataNodes
with lower utilization. One key requirement for the balancer is to maintain data availability.
When choosing a replica to move and deciding its destination, the balancer guarantees that the
decision does not reduce either the number of replicas or the number of racks.
The balancer optimizes the balancing process by minimizing the inter-rack data
copying. If the balancer decides that a replica A needs to be moved to a different rack and the
destination rack happens to have a replica B of the same block, the data will be copied from
replica B instead of replica A.
A second configuration parameter limits the bandwidth consumed by rebalancing
operations. The higher the allowed bandwidth, the faster a cluster can reach the balanced state,
but with greater competition with application processes.

2.10.5 Block Scanner


Each DataNode runs a block scanner that periodically scans its block replicas and
verifies that stored checksums match the block data. In each scan period, the block scanner
adjusts the read bandwidth in order to complete the verification in a configurable period. If a
client reads a complete block and checksum verification succeeds, it informs the DataNode. The
DataNode treats it as a verification of the replica.
The verification time of each block is stored in a human readable log file. At any time
there are up to two files in top level DataNode directory, current and prev logs. New verification
times are appended to current file. Correspondingly each DataNode has an in-memory scanning
list ordered by the replicas verification time.
Whenever a read client or a block scanner detects a corrupt block, it notifies the
NameNode. The NameNode marks the replica as corrupt, but does not schedule deletion of the
replica immediately. Instead, it starts to replicate a good copy of the block. Only when the good
replica count reaches the replication factor of the block the corrupt replica is scheduled to be
removed. This policy aims to preserve data as long as possible. So even if all replicas of a block
are corrupt, the policy allows the user to retrieve its data from the corrupt replicas.

2.10.6 Decommissioning
The cluster administrator specifies which nodes can join the cluster by listing the host
addresses of nodes that are permitted to register and the host addresses of nodes that are not
permitted to register. The administrator can command the system to re-evaluate these include and
exclude lists. A present member of the cluster that becomes excluded is marked for
decommissioning. Once a DataNode is marked as decommissioning, it will not be selected as the
target of replica placement, but it will continue to serve read requests. The NameNode starts to
schedule replication of its blocks to other DataNodes. Once the NameNode detects that all blocks
on the decommissioning DataNode are replicated, the node enters the decommissioned state.
Then it can be safely removed from the cluster without jeopardizing any data availability.

2.10.7 Inter-Cluster Data Copy


When working with large datasets, copying data into and out of a HDFS cluster is
daunting. HDFS provides a tool called DistCp for large inter/intra-cluster parallel copying. It is a
MapReduce job; each of the map tasks copies a portion of the source data into the destination file
system. The MapReduce framework automatically handles parallel task scheduling, error
detection and recovery.

2.11 HDFS Data Organization


2.11.1 Data Blocks
HDFS is designed to support very large files. Applications that are compatible with
HDFS are those that deal with large data sets. These applications write their data only once but
they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS
supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB.
Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on
a different DataNode.

2.11.2 Staging
A client request to create a file does not reach the NameNode immediately. In fact,
initially the HDFS client caches the file data into a temporary local file. Application writes are
transparently redirected to this temporary local file. When the local file accumulates data worth
over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file
name into the file system hierarchy and allocates a data block for it. The NameNode responds to
the client request with the identity of the DataNode and the destination data block. Then the
client flushes the block of data from the local temporary file to the specified DataNode. When a
file is closed, the remaining un-flushed data in the temporary local file is transferred to the
DataNode. The client then tells the NameNode that the file is closed. At this point, the
NameNode commits the file creation operation into a persistent store. If the NameNode dies
before the file is closed, the file is lost.
The above approach has been adopted after careful consideration of target applications
that run on HDFS. These applications need streaming writes to files. If a client writes to a remote
file directly without any client side buffering, the network speed and the congestion in the
network impacts throughput considerably. This approach is not without precedent. Earlier
distributed file systems, e.g. AFS, have used client side caching to improve performance. A
POSIX requirement has been relaxed to achieve higher performance of data uploads.

2.11.3 Replication Pipelining


When a client is writing data to an HDFS file, its data is first written to a local file as
explained in the previous section. Suppose the HDFS file has a replication factor of three. When
the local file accumulates a full block of user data, the client retrieves a list of DataNodes from
the NameNode. This list contains the DataNodes that will host a replica of that block. The client
then flushes the data block to the first DataNode. The first DataNode starts receiving the data in
small portions (4 KB), writes each portion to its local repository and transfers that portion to the
second DataNode in the list. The second DataNode, in turn starts receiving each portion of the
data block, writes that portion to its repository and then flushes that portion to the third
DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode
can be receiving data from the previous one in the pipeline and at the same time forwarding data
to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

2.12 HDFS Accessibility


HDFS can be accessed from applications in many different ways. Natively, HDFS
provides a Java API for applications to use. A C language wrapper for this Java API is also
available. In addition, an HTTP browser can also be used to browse the files of an HDFS
instance. Work is in progress to expose HDFS through the WebDAV protocol.

2.12.1 FS Shell
HDFS allows user data to be organized in the form of files and directories. It provides a
command line interface called FS shell that lets a user interact with the data in HDFS. The syntax
of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with.
Here are some sample action/command pairs:

FS shell is targeted for applications that need a scripting language to interact with the stored data.

2.12.2 DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. These are
commands that are used only by an HDFS administrator. Here are some sample action/
command pairs:

2.12.3 Browser Interface


A typical HDFS install configures a web server to expose the HDFS namespace
through a configurable TCP port. This allows a user to navigate the HDFS namespace and view
the contents of its files using a web browser.

2.13 HDFS Operations


2.13.1 Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS
server), and execute the following command.
$ hadoop namenode -format

After formatting the HDFS, start the distributed file system. The following command will start
the namenode as well as the data nodes as cluster.
$ start-dfs.sh

2.13.2 Listing Files in HDFS


After loading the information in the server, we can find the list of files in a directory,
status of a file, using ls. Given below is the syntax of ls that you can pass to a directory or a
filename as an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>

2.13.3 Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is ought to be
saved in the hdfs file system. Follow the steps given below to insert the required file in the
Hadoop file system.

Step 1: You have to create an input directory.


$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Step 2: Transfer and store a data file from local systems to the Hadoop file system using the put
command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input

Step 3: You can verify the file using ls command.


$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

2.13.4 Retrieving Data from HDFS


Assume we have a file in HDFS called outfile. Given below is a simple demonstration
for retrieving the required file from the Hadoop file system.

Step 1: Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile

Step 2: Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

2.13.5 Shutting Down the HDFS


You can shut down the HDFS by using the following command.
$ stop-dfs.sh

2.14 The Hadoop Schedulers


Since the pluggable scheduler framework (similar to the Linux IO schedulers) was
introduced, several different scheduler algorithms have been designed, developed, and made
available to the Hadoop community. In the next few paragraphs, the FIFO, the Fair, as well as
the Capacity schedulers are briefly introduced.
The FIFO Scheduler -> FIFO reflects the original Hadoop scheduling algorithm that
was integrated into the JobTracker framework. With FIFO scheduling, a JobTracker
basically just pulls the oldest job from the work queue. The FIFO scheduling approach
has no concept of either job priority or job size, but is rather simple to implement and
efficient to execute (very low overhead).
The Fair Scheduler -> Originally, the Fair scheduler was developed by Facebook. The
fundamental design objective for the Fair scheduler revolves around the idea to assign
resources to jobs in a way that (on average) over time, each job receives an equal share of
the available resources. With the Fair scheduler, there is a certain degree of interactivity
among Hadoop jobs, permitting a Hadoop cluster to better response to the variety of job
types that are submitted over time. From an implementation perspective, a set of pools is
setup, and the jobs are placed into these pools and so are made available for selection by
the scheduler. Each pool operates on (assigned) shares to balance the resource usage
among the jobs in the pools. The heuristic used is that the more shares the greater the
resource usage potential to execute the jobs. By default, all pools are setup with equal
shares, but configuration based pool share adjustments can be made based on job types.
The number of concurrent active jobs can be constrained to minimize congestion and to
allow the workload to be processed in a timely manner. To ensure fairness, each user is
assigned to a pool. Regardless of the shares that are assigned to the pools, if the system is
underutilized (based on the current workload), the active jobs receive the unused shares
(the shares are split among the current jobs). For each job, the scheduler keeps track of
the compute time. Periodically, the scheduler examines the jobs to calculate the delta
between the actual compute time received and the compute time that the job should have
received. The results reflect the deficit matrix for the tasks. It is the scheduler's
responsibility to schedule the task with the greatest deficit.
The Capacity Scheduler -> Originally, the Capacity scheduler was developed by Yahoo.
The design focus for the Capacity scheduler was on large cluster environments that
execute many independent applications. Hence, the Capacity scheduler provides the
ability to provide a minimum capacity guarantee, as well as to share excess capacity
among the users. The Capacity scheduler operates on queues. Each queue can be
configured with a certain number of map and reduce slots. Further, each queue can be
assigned a guaranteed capacity while the overall capacity of the cluster equals to the sum
of all the individual queue capacity values. All the queues are actively monitored and in
scenarios where a queue is not consuming its allocated capacity potential, the excess
capacity can be (temporarily) allocated to other queues. Compared to the Fair scheduler,
the Capacity scheduler controls all the prioritizing tasks within a queue. In general,
higher priority jobs are allowed access to the cluster resources earlier than lower priority
jobs. With the Capacity scheduler, queue properties can be adjusted on-the-fly, and hence
do not require any disruptions in the cluster usage/processing.

While not considered a scheduler per se, Hadoop also supports the scheme of
provisioning virtual cluster environments (within physical clusters). This concept is labeled
Hadoop On Demand (HOD). The HOD approach utilizes the Torque resource manager for node
allocation to size the virtual cluster. Within the virtual environment, the HOD system prepares
the configuration files in an automated manner, and initializes the system based on the nodes that
comprise the virtual cluster. Once initialized, the HOD virtual cluster can be used in a rather
independent manner. A certain level of elasticity is build into HOD, as the system adapts to
changing workload conditions. To illustrate, HOD automatically de-allocates nodes from the
virtual cluster after detecting no active jobs over a certain time period. This shrinking behavior
allows for the most efficient usage of the overall physical cluster assets. HOD is considered as a
valuable option for deploying Hadoop clusters within a cloud infrastructure.

2.15 MapReduce Performance Evaluation - NN Algorithm


A Nearest Neighbor (NN) analysis represents a method to classify cases based on their
similarity to other cases. In machine learning, NN algorithms are utilized to recognize data
patterns without requiring an exact match to any of the stored cases. Similar cases are in close
proximity while dissimilar cases are more distant from each other. Hence, the distance between 2
cases is used as the measure of (dis)similarity. Cases that are in close proximity are labeled
neighbors. As a new case is presented, its distance from each of the cases in the model is
computed. The classifications of the most similar cases (aka the nearest neighbors) are tallied,
and the new case is placed into the category that contains the greatest number of nearest
neighbors (the number of nearest neighbors to examine is normally labeled as k).
Hence, a k-nearest neighbor (kNN) analysis retrieves the k nearest points from a
dataset. A kNN query is considered as one of the fundamental query types for spatial databases.
A spatial database reflects a database solution that is optimized to store/query data that is related
to objects in space (including points, lines, and polygons). While most database solutions do
understand the concept of various numeric and character data types, additional functionality is
required for database systems to process spatial data types (typically called the geometry). An all
k-nearest neighbor (akNN) analysis depicts a variation of a kNN based study, as a akNN query
determines the k-nearest neighbors for each point in the dataset. A akNN based approach is
extensively being used for batch-based processing of large, distributed (point) datasets. To
illustrate, location-based services that identify for each user their nearby counterparts (or new
friends) are good candidates for a akNN based approach. Given that the user locations are
maintained by the underlying database systems, the resulting recommendation lists can be
generated by issuing an akNN query against the database. Further, akNN queries are often
employed to preprocess data for subsequent data mining purposes.
For this study, the goal is to quantify the performance behavior of processing akNN
queries in a Hadoop environment. The methodology used to execute the queries in a MapReduce
framework closely follows the guidelines discussed in Yokoyama, Afrati, and Vernica. As
extensively discussed in Yokoyama, it is feasible to decompose the given space into cells, and to
execute a akNN query via the MapReduce framework in a distributed, parallel manner.
Compared to some of the other work on kNN solutions in distributed environments, this study
proposes a kd-tree based approach that utilizes a variable number of neighboring cells for the cell
merging process discussed below.
Figure 7: akNN Processing (k = 2, Cell Based)

To briefly illustrate the methodology, 2-dimensional points with x and y axes are
considered. The target space is decomposed into 2^n x 2^n small cells, where the constant n
determines the granularity of the decomposition. As the k-nearest neighbor points for a data
point are normally located in close proximity, the assumption made is that most of the kNN
objects are located in the nearby cells. Hence, the approach is based on classifying the data
points into the corresponding cells and to compute candidate kNN points for each point. This
process can easily be parallelized, and hence is suited for a MapReduce framework. It has to be
pointed out though that the approach may not be able to determine the kNN points in a single
processing cycle and therefore, additional processing steps may be necessary (the crux of the
issue is that data points in other nearby cells may belong to the k-nearest neighbors). To illustrate
this problem Figure 7 depicts a akNN processing scenario for k = 2. Figure 7 outlines that the
query can locate 2 NN points for A by just considering the inside (boundary) of cell 0. In other
words, the circle centered around A already covers the target 2 NN objects without having to go
beyond the boundary of cell 0.
On the other hand, the circle for B overlaps with cells 1, 2, and 3, respectively. In such
a scenario, there is a possibility to locate the 2 NN objects in 4 different cells. Ergo, in some
circumstances, it may not be feasible to draw the circle for a point just based on the cell
boundary. For point C, there is only 1 point available in cell 1, and hence cell 1 violates the k = 2
requirement. Therefore, it is a necessary design choice for this study to prohibit scenarios where
cells contain less than k points. This is accomplished by 1st identifying the number of points
within each cell, and 2nd by merging a cell with < k points with a neighboring cell to assure that
the number of points in the merged cell is >= k. At that point, the boundary circle can be drawn.
The challenge with this approach is though that an additional counting cycle prior to the NN
computation is required. The benefit of the approach is that during the counting phase, cells with
no data points can be identified and hence, can be eliminated from the actual processing cycle.
The entire processing cycle encompasses 4 steps that are discussed below. The input dataset
reflects a set of records that are formatted in a [id, x, y] fashion (the parameters n and k are
initially specified by the user).

MapReduce S1 Step: Acquiring Distribution and Cell Merging Information. In this


initial step, the entire space is decomposed into 2^n x 2^n cells and the number of points per cell
is determined. The actual counting procedure is implemented as a MapReduce process. The Map
function receives the formatted input and computes the cell number for each data point based on
its coordinates. The format of the output record is a key-value pair [cell_id, data_point]. The
Reduce function receives a set of records (with the same cell ids) and sums the value parts of the
records. The output reflects a format [cell_id, sum].
To improve performance in Hadoop, an optional Combiner in the MapReduce process
can be used. A Combiner is used to aggregate intermediate data within a Map task, a process that
may reduce the size of the intermediate dataset. After the MapReduce S1 phase, cell merging is
performed. For this study, a kd-tree tree structure was implemented to merge neighboring cells.
In a nutshell, the cell merging function generates the mapping information (how each cell id is
matched up with a potentially new cell id in the merged cell decomposition scheme) that is used
as additional input into the MapReduce S2 step.

MapReduce S2 Step: akNN Computation. In the 2nd step, the input records for each
cell are collected and candidate kNN points (for each point in the cell region) are computed. The
Map function receives the original data points and computes the corresponding cell id. The
output records are formatted as [cell_id, id, coordinates], where the id represents the point id, and
the coordinates reflect the actual coordinates of the point. The reduce function receives the
records corresponding to a cell, formatted as [id, coordinates]. The Reduce function calculates
the distance for each 2-point combo and computes the kNN points for each point in the cell. The
output records are formatted as [id, coordinates, cell_id, kNN_list], where the id is used as the
key, and the kNN_list reflects the list of the kNN points for point id (formatted as [ho1; d1i;.... ;
hoi; dki], where hoi represents the i-th NN point and dki the corresponding distance.

MapReduce3 S3 Step: Update k-NN Points. In the 3d step, the actual boundary circles
are determined. In this step, depending on the data set, additional k-NN point processing
scenarios may apply. The Map function basically receives the result of the MapReduce S2 step
and for each point, the boundary circle is computed. Two different scenarios are possible here:
1. The boundary circle does not overlap with other cells: In this scenario, no further
processing is required, and the output (key value pairs) is formatted as [cell_id, id,
coordinates, kNN_list, true]. The cell_id reflects the key while the Boolean true denotes
the fact that the processing cycle is completed. In other words, in the corresponding
Reduce function, no further processing is necessary and only the record format is
converted.
2. The boundary circle overlaps with other cells: In this scenario, there is a possibility that
neighboring cells may contain kNN points, and hence an additional (check) process is
required. To accomplish that, the [key, value] pairs are formatted as [cell_idx, id,
coordinates, kNN_list, false]. The key cell_idx reflects the cell id of one of the
overlapped cells while the Boolean false indicates a non-completed process. In scenarios
where n cells overlap, n corresponding [key, value] pairs are used.

The Shuffle operation submits records with the same cell ids to the corresponding
node, and the records are used as input into the Reduce function. As different types of records
may exist as input, it is necessary to first classify the records and second to update the kNN
points in scenarios where additional checks are required. The output records are formatted as [id,
coordinates, cell_id, kNN_list], where the id field represents the key.

MapReduce S4 Step: This final processing step may be necessary as multiple update
scenarios of kNN points per point are possible. In this step, if necessary, these kNN lists are
fused, and a final kNN (result) list is generated. The Map function receives the results of the S3
step and outputs the kNN list. The output format of the records is [id, kNN_list], where the id
represents the key. The Reduce function receives the records with the same keys and generates
the integrated list formatted as [id, kNN_list]. If multiple cells per point scenarios are possible,
the Shuffle process clusters the n outputs and submits the data to the Reduce function.
Ultimately, the kNN points per reference point are determined, and the processing cycle
completes.

2.15.1 Hadoop akNN MapReduce Benchmarks


For this study, several large-dataset benchmark studies were conducted, focusing on
Hadoop MapReduce performance for solving akNN related data projects. The Hadoop (1.0.2)
cluster used for the study consisted of 12 Ubuntu 12.04 server nodes that were all equipped with
Intel Xeon 3060 processors (12GB RAM, 4M Cache, 2.40GHz, Dual Core). Each node was
configured with 4 1TB hard disks in a JBOD (Just a Bunch Of Disks) setup. The interconnect
consisted of a 10Gbit (switched) network. The benchmark data set consisted of 4,000,000,
8,000,000, 16,000,000, and 32,000,000 reference points, respectively. For this study, the Hadoop
replica factor was set to 1, while the HDFS block size was left at the default value of 64MB. The
Hadoop cluster utilized the FIFO scheduler. The (above discussed) granularity n was set to 8. In
this report, the results for the 8,000,000 reference point dataset runs are elaborated on. For the 1st
set of benchmarks, the number of Reduce tasks was scaled from 1 to 16, while k was set to 4,
and the number of Map tasks was held constant at 8 (see Figure 8).

Figure 8: akNN MapReduce Performance - Varying Number of Reduce Tasks

As Figure 8 outlines, the average total execution time decreases as the number of Reduce tasks is
increased (total execution time equals to the sum of the steps S1, S2, and if necessary, S3, and
S4). With 16 Reduce tasks, the 8,000,000 data points are processed in approximately 262
seconds on the Hadoop cluster described above. The largest performance increase (delta of
approximately 505 seconds) was measured while increasing the number of Reduce tasks from 2
to 4. The study showed that with the data set at hand, only approximately 29.3% of the data
points require executing the MapReduce S3 step discussed above. In other words, due to the
rather high data point density in the target space (8,000,000 reference points), over 70% of the
data points only require executing steps S1 and S2, respectively. Further, the study revealed that
while increasing the number of worker threads, the execution time for step S1 is increased. This
is due to the fact that the input size per Reducer task diminishes, and as the data processing cost
is rather low, increasing the number of Reducer tasks actually adds overhead into the aggregate
processing cycle. For the 2nd set of benchmark runs, the number of Map and Reduce tasks was
set to 8 and 24, respectively, while the number of k Nearest Neighbors was scaled from 4 to 32.
As Figure 9 depicts, the average total execution time increases as k is scaled up. The study
disclosed that while scaling k, the processing cost for the MapReduce steps S3 and S4 increases
significantly. The cost increase is mainly due to the increase of (larger size) intermediate records
that have to be processed, as well as due to the increased number of data points that require steps
S3 and S4 to be processed. To illustrate, increasing k from 16 to 32 resulted in an average record
size increase of approximately 416 bytes, while at the same time an additional 10.2% of the data
points required executing steps S3 and S4. From an average total execution time perspective, the
delta between the k=16 and k=32 was approximately 534 seconds, while the delta for the k=8
and k=16 runs was only approximately 169 seconds. To summarize, while processing the akNN
MapReduce framework in parallel, it is possible to reduce the total execution time rather
considerably (see Figure 8).

Figure 9: akNN MapReduce Performance - Varying Number of k Nearest Neighbors


CHAPTER 3: Hadoop Installation

3.1 Ubuntu Introduction


Ubuntu is a Debian-based Linux operating system for personal computers, tablets and
smartphones, where Ubuntu Touch edition is used. It also runs network servers. That is usually
with the Ubuntu Server edition, either on physical or virtual servers (such as on mainframes) or
with containers, that is with enterprise-class features. It runs on the most popular architectures,
including server-class ARM-based.
Ubuntu is published by Canonical Ltd, who offer commercial support. It is based on
free software and named after the Southern African philosophy of ubuntu, which Canonical Ltd.
suggests can be loosely translated as "humanity to others" or "I am what I am because of who we
all are". Since Ubuntu 11.04 Natty Narwhal Ubuntu has used Unity as its default user interface
for the desktop, but following the release of Ubuntu 17.10 it will move to the GNOME 3 desktop
instead, as work on Unity ends. Ubuntu is the most popular operating system running in hosted
environments, socalled "clouds", as it is the most popular server Linux distribution.
Development of Ubuntu is led by UK-based Canonical Ltd., a company of South
African entrepreneur Mark Shuttleworth. Canonical generates revenue through the sale of
technical support and other services related to Ubuntu. The Ubuntu project is publicly committed
to the principles of open-source software development; people are encouraged to use free
software, study how it works, improve upon it, and distribute it.
Here you can find information on how to install and configure various server
applications. It is a step-by-step, task-oriented guide for configuring and customizing your
system.
This guide assumes you have a basic understanding of your Ubuntu system.

Support
There are a couple of different ways that Ubuntu Server Edition is supported:
commercial support and community support. The main commercial support (and development
funding) is available from Canonical, Ltd. They supply reasonably- priced support contracts on a
per desktop or per server basis. For more information see the Ubuntu Advantage page.
Community support is also provided by dedicated individuals and companies that wish
to make Ubuntu the best distribution possible. Support is provided through multiple mailing lists,
IRC channels, forums, blogs, wikis, etc. The large amount of information available can be
overwhelming, but a good search engine query can usually provide an answer to your questions.

3.2 Ubuntu Installation


3.2.1 Preparing to Install
3.2.1.1 System Requirements
Ubuntu 16.04 LTS Server Edition supports three (3) major architectures: Intel x86, AMD64 and
ARM. The table below lists recommended hardware specifications. Depending on your needs,
you might manage with less than this. However, most users risk being frustrated if they ignore
these suggestions.

Recommended Minimum Requirements


Hard Drive Space
Install Type CPU RAM
Base System All Tasks Installed
Server (Standard) 1 gigahertz 512 megabytes 1 gigabyte 1.75 gigabytes
Server (Minimal) 300 megahertz 192 megabytes 700 megabytes 1.4 gigabytes

The Server Edition provides a common base for all sorts of server applications. It is a minimalist
design providing a platform for the desired services, such as file/print services, web hosting,
email hosting, etc.

3.2.1.2 Server and Desktop Differences


There are a few differences between the Ubuntu Server Edition and the Ubuntu
Desktop Edition. It should be noted that both editions use the same apt repositories, making it
just as easy to install a server application on the Desktop Edition as it is on the Server Edition.
The differences between the two editions are the lack of an X window environment in
the Server Edition and the installation process.

Kernel Differences:
Ubuntu version 10.10 and prior, actually had different kernels for the server and
desktop editions. Ubuntu no longer has separate -server and -generic kernel flavors. These have
been merged into a single -generic kernel flavor to help reduce the maintenance burden over the
life of the release.

Note: When running a 64-bit version of Ubuntu on 64-bit processors you are not limited by
memory addressing space.

To see all kernel configuration options you can look through /boot/config-4.4.0-server. Also,
Linux Kernel in a Nutshell is a great resource on the options available.

3.2.1.3 Backing Up
Before installing Ubuntu Server Edition you should make sure all data on the system is backed
up. See Backups for backup options.

If this is not the first time an operating system has been installed on your computer, it is likely
you will need to re-partition your disk to make room for Ubuntu.

Any time you partition your disk, you should be prepared to lose everything on the disk should
you make a mistake or something goes wrong during partitioning. The programs used in
installation are quite reliable, most have seen years of use, but they also perform destructive
actions.

3.2.2 Installing from CD


The basic steps to install Ubuntu Server Edition from CD are the same as those for
installing any operating system from CD. Unlike the Desktop Edition, the Server Edition does
not include a graphical installation program. The Server Edition uses a console menu based
process instead.

1. Download and burn the appropriate ISO file from the Ubuntu web site.

2. Boot the system from the CD-ROM drive.

3. At the boot prompt you will be asked to select a language.

4. From the main boot menu there are some additional options to install Ubuntu Server
Edition. You can install a basic Ubuntu Server, check the CD-ROM for defects, check the
system's RAM, boot from first hard disk, or rescue a broken system. The rest of this
section will cover the basic Ubuntu Server install.

5. The installer asks which language it should use. Afterwards, you are asked to select your
location.

6. Next, the installation process begins by asking for your keyboard layout. You can ask the
installer to attempt auto-detecting it, or you can select it manually from a list.

7. The installer then discovers your hardware configuration, and configures the network
settings using DHCP. If you do not wish to use DHCP at the next screen choose "Go
Back", and you have the option to "Configure the network manually".

8. Next, the installer asks for the system's hostname.

9. A new user is set up; this user will have root access through the sudo utility.

10. After the user settings have been completed, you will be asked if you want to encrypt
your home directory.

11. Next, the installer asks for the system's Time Zone.

12. You can then choose from several options to configure the hard drive layout. Afterwards
you are asked which disk to install to. You may get confirmation prompts before
rewriting the partition table or setting up LVM depending on disk layout. If you choose
LVM, you will be asked for the size of the root logical volume. For advanced disk
options see Advanced Installation.
13. The Ubuntu base system is then installed.

14. The next step in the installation process is to decide how you want to update the system.
There are three options:

i. No automatic updates: this requires an administrator to log into the machine and
manually install updates.

ii. Install security updates automatically: this will install the unattended-upgrades
package, which will install security updates without the intervention of an
administrator. For more details see Automatic Updates.

iii. Manage the system with Landscape: Landscape is a paid service provided by
Canonical to help manage your Ubuntu machines. See the Landscape site for
details.

15. You now have the option to install, or not install, several package tasks. See Package
Tasks for details. Also, there is an option to launch aptitude to choose specific packages
to install. For more information see Aptitude.

16. Finally, the last step before rebooting is to set the clock to UTC.

Note: If at any point during installation you are not satisfied by the default setting, use the "Go
Back" function at any prompt to be brought to a detailed installation menu that will allow you to
modify the default settings.

At some point during the installation process you may want to read the help screen provided by
the installation system. To do this, press F1.

Package Tasks
During the Server Edition installation you have the option of installing additional packages from
the CD. The packages are grouped by the type of service they provide.

1. DNS server: Selects the BIND DNS server and its documentation.

2. LAMP server: Selects a ready-made Linux/Apache/MySQL/PHP server.

3. Mail server: This task selects a variety of packages useful for a general purpose mail
server system.

4. OpenSSH server: Selects packages needed for an OpenSSH server.

5. PostgreSQL database: This task selects client and server packages for the PostgreSQL
database.
6. Print server: This task sets up your system to be a print server.

7. Samba File server: This task sets up your system to be a Samba file server, which is
especially suitable in networks with both Windows and Linux systems.

8. Tomcat Java server: Installs Apache Tomcat and needed dependencies.

9. Virtual Machine host: Includes packages needed to run KVM virtual machines.

10. Manually select packages: Executes aptitude allowing you to individually select
packages.

Installing the package groups is accomplished using the tasksel utility. One of the important
differences between Ubuntu (or Debian) and other GNU/Linux distribution is that, when
installed, a package is also configured to reasonable defaults, eventually prompting you for
additional required information. Likewise, when installing a task, the packages are not only
installed, but also configured to provided a fully integrated service.
Once the installation process has finished you can view a list of available tasks by
entering the following from a terminal prompt:
tasksel --list-tasks

Note: The output will list tasks from other Ubuntu based distributions such as Kubuntu and
Edubuntu. Note that you can also invoke the tasksel command by itself, which will bring up a
menu of the different tasks available.

You can view a list of which packages are installed with each task using the --task-packages
option. For example, to list the packages installed with the DNS Server task enter the following:
tasksel --task-packages dns-server

The output of the command should list:


bind9-doc
bind9utils
bind9

If you did not install one of the tasks during the installation process, but for example you decide
to make your new LAMP server a DNS server as well, simply insert the installation CD and from
a terminal:
sudo tasksel install dns-server

3.2.3 Upgrading
There are several ways to upgrade from one Ubuntu release to another. This section
gives an overview of the recommended upgrade method: do-release-upgrade.

do-release-upgrade
The recommended way to upgrade a Server Edition installation is to use the do-release-
upgrade utility. Part of the update-manager-core package, it does not have any graphical
dependencies and is installed by default.
Debian based systems can also be upgraded by using apt dist-upgrade. However, using
do-release-upgrade is recommended because it has the ability to handle system configuration
changes sometimes needed between releases.
To upgrade to a newer release, from a terminal prompt enter:
do-release-upgrade

It is also possible to use do-release-upgrade to upgrade to a development version of Ubuntu. To


accomplish this use the -d switch:
do-release-upgrade -d

Note: Upgrading to a development release is not recommended for production environments.

For further stability of a LTS release there is a slight change in behaviour if you are currently
running a LTS version. LTS systems are only automatically considered for an upgrade to the
next LTS via do-release-upgrade with the first point release. So for example 14.04 will only
upgrade once 16.04.1 is released. If you want to update before, e.g. on a subset of machines to
evaluate the LTS upgrade for your setup the same argument as an upgrade to a dev release has to
be used via the -d switch.

3.2.4 Advanced Installation

3.2.4.1 Software RAID


Redundant Array of Independent Disks "RAID" is a method of using multiple disks to
provide different balances of increasing data reliability and/or increasing input/output
performance, depending on the RAID level being used. RAID is implemented in either software
(where the operating system knows about both drives and actively maintains both of them) or
hardware (where a special controller makes the OS think there's only one drive and maintains the
drives 'invisibly').
The RAID software included with current versions of Linux (and Ubuntu) is based on
the 'mdadm' driver and works very well, better even than many so-called 'hardware' RAID
controllers. This section will guide you through installing Ubuntu Server Edition using two
RAID1 partitions on two physical hard drives, one for / and another for swap.

Partitioning
Follow the installation steps until you get to the Partition disks step, then:
1. Select Manual as the partition method.
2. Select the first hard drive, and agree to "Create a new empty partition table on this
device?".
Repeat this step for each drive you wish to be part of the RAID array.
3. Select the "FREE SPACE" on the first drive then select "Create a new partition".
4. Next, select the Size of the partition. This partition will be the swap partition, and a
general rule for swap size is twice that of RAM. Enter the partition size, then choose
Primary, then Beginning.
Note: A swap partition size of twice the available RAM capacity may not always be
desirable, especially on systems with large amounts of RAM. Calculating the swap
partition size for servers is highly dependent on how the system is going to be used.
5. Select the "Use as:" line at the top. By default this is "Ext4 journaling file system",
change that to "physical volume for RAID" then "Done setting up partition".
6. For the / partition once again select "Free Space" on the first drive then "Create a new
partition".
7. Use the rest of the free space on the drive and choose Continue, then Primary.
8. As with the swap partition, select the "Use as:" line at the top, changing it to "physical
volume for RAID". Also select the "Bootable flag:" line to change the value to "on".
Then choose "Done setting up partition".
9. Repeat steps three through eight for the other disk and partitions.

RAID Configuration
With the partitions setup the arrays are ready to be configured:
1. Back in the main "Partition Disks" page, select "Configure Software RAID" at the top.
2. Select "yes" to write the changes to disk.
3. Choose "Create MD device".
4. For this example, select "RAID1", but if you are using a different setup choose the
appropriate type (RAID0 RAID1 RAID5).
Note: In order to use RAID5 you need at least three drives. Using RAID0 or RAID1 only
two drives are required.
5. Enter the number of active devices "2", or the amount of hard drives you have, for the
array. Then select "Continue".
6. Next, enter the number of spare devices "0" by default, then choose "Continue".
7. Choose which partitions to use. Generally they will be sda1, sdb1, sdc1, etc. The numbers
will usually match and the different letters correspond to different hard drives.
For the swap partition choose sda1 and sdb1. Select "Continue" to go to the next step.
8. Repeat steps three through seven for the / partition choosing sda2 and sdb2.
9. Once done select "Finish".

Formatting
There should now be a list of hard drives and RAID devices. The next step is to format and set
the mount point for the RAID devices. Treat the RAID device as a local hard drive, format and
mount accordingly.
1. Select "#1" under the "RAID1 device #0" partition.
2. Choose "Use as:". Then select "swap area", then "Done setting up partition".
3. Next, select "#1" under the "RAID1 device #1" partition.
4. Choose "Use as:". Then select "Ext4 journaling file system".
5. Then select the "Mount point" and choose "/ - the root file system". Change any of the
other options as appropriate, then select "Done setting up partition".
6. Finally, select "Finish partitioning and write changes to disk".

If you choose to place the root partition on a RAID array, the installer will then ask if
you would like to boot in a degraded state. See Degraded RAID for further details.
The installation process will then continue normally.

Degraded RAID
At some point in the life of the computer a disk failure event may occur. When this
happens, using Software RAID, the operating system will place the array into what is known as a
degraded state.
If the array has become degraded, due to the chance of data corruption, by default
Ubuntu Server Edition will boot to initramfs after thirty seconds. Once the initramfs has booted
there is a fifteen second prompt giving you the option to go ahead and boot the system, or
attempt manual recover. Booting to the initramfs prompt may or may not be the desired
behavior, especially if the machine is in a remote location. Booting to a degraded array can be
configured several ways:
1. The dpkg-reconfigure utility can be used to configure the default behavior, and during the
process you will be queried about additional settings related to the array. Such as
monitoring, email alerts, etc. To reconfigure mdadm enter the following:
sudo dpkg-reconfigure mdadm

2. The dpkg-reconfigure mdadm process will change the /etc/initramfs-tools/conf.d/mdadm


configuration file. The file has the advantage of being able to pre-configure the system's
behavior, and can also be manually edited:
BOOT_DEGRADED=true

Note: The configuration file can be overridden by using a Kernel argument.


3. Using a Kernel argument will allow the system to boot to a degraded array as well:
i. When the server is booting press Shift to open the Grub menu.
ii. Press e to edit your kernel command options.
iii. Press the down arrow to highlight the kernel line.
iv. Add "bootdegraded=true" (without the quotes) to the end of the line.
v. Press Ctrl+x to boot the system.

Once the system has booted you can either repair the array see RAID Maintenance for
details, or copy important data to another machine due to major hardware failure.

RAID Maintenance
The mdadm utility can be used to view the status of an array, add disks to an array, remove disks,
etc:
1. To view the status of an array, from a terminal prompt enter:
sudo mdadm -D /dev/md0
The -D tells mdadm to display detailed information about the /dev/md0 device. Replace
/dev/md0 with the appropriate RAID device.
2. To view the status of a disk in an array:
sudo mdadm -E /dev/sda1
The output if very similar to the mdadm -D command, adjust /dev/sda1 for each disk.

3. If a disk fails and needs to be removed from an array enter:


sudo mdadm --remove /dev/md0 /dev/sda1
Change /dev/md0 and /dev/sda1 to the appropriate RAID device and disk.

4. Similarly, to add a new disk:


sudo mdadm --add /dev/md0 /dev/sda1

Sometimes a disk can change to a faulty state even though there is nothing physically
wrong with the drive. It is usually worthwhile to remove the drive from the array then re-add it.
This will cause the drive to re-sync with the array. If the drive will not sync with the array, it is a
good indication of hardware failure.
The /proc/mdstat file also contains useful information about the system's RAID
devices:
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
10016384 blocks [2/2] [UU]

unused devices: <none>

The following command is great for watching the status of a syncing drive:
watch -n1 cat /proc/mdstat
Press Ctrl+c to stop the watch command.

If you do need to replace a faulty drive, after the drive has been replaced and synced, grub will
need to be installed. To install grub on the new drive, enter the following:
sudo grub-install /dev/md0
Replace /dev/md0 with the appropriate array device name.

3.2.4.2 Logical Volume Manager (LVM)


Logical Volume Manger, or LVM, allows administrators to create logical volumes out of one or
multiple physical hard disks. LVM volumes can be created on both software RAID partitions and
standard partitions residing on a single disk. Volumes can also be extended, giving greater
flexibility to systems as requirements change.

Overview
A side effect of LVM's power and flexibility is a greater degree of complication. Before diving
into the LVM installation process, it is best to get familiar with some terms.
1. Physical Volume (PV): physical hard disk, disk partition or software RAID partition
formatted as LVM PV.
2. Volume Group (VG): is made from one or more physical volumes. A VG can can be
extended by adding more PVs. A VG is like a virtual disk drive, from which one or more
logical volumes are carved.
3. Logical Volume (LV): is similar to a partition in a non-LVM system. A LV is formatted
with the desired file system (EXT3, XFS, JFS, etc), it is then available for mounting and
data storage.

Installation
As an example this section covers installing Ubuntu Server Edition with /srv mounted
on a LVM volume. During the initial install only one Physical Volume (PV) will be part of the
Volume Group (VG). Another PV will be added after install to demonstrate how a VG can be
extended.
There are several installation options for LVM, "Guided - use the entire disk and setup
LVM" which will also allow you to assign a portion of the available space to LVM, "Guided -
use entire and setup encrypted LVM", or Manually setup the partitions and configure LVM. At
this time the only way to configure a system with both LVM and standard partitions, during
installation, is to use the Manual approach.

1. Follow the installation steps until you get to the Partition disks step, then:
2. At the "Partition Disks screen choose "Manual".
3. Select the hard disk and on the next screen choose "yes" to "Create a new empty partition
table on this device".
4. Next, create standard /boot, swap, and / partitions with whichever filesystem you prefer.
5. For the LVM /srv, create a new Logical partition. Then change "Use as" to "physical
volume for LVM" then "Done setting up the partition".
6. Now select "Configure the Logical Volume Manager" at the top, and choose "Yes" to
write the changes to disk.
7. For the "LVM configuration action" on the next screen, choose "Create volume group".
Enter a name for the VG such as vg01, or something more descriptive. After entering a
name, select the partition configured for LVM, and choose "Continue".
8. Back at the "LVM configuration action" screen, select "Create logical volume". Select
the newly created volume group, and enter a name for the new LV, for example srv since
that is the intended mount point. Then choose a size, which may be the full partition
because it can always be extended later. Choose "Finish" and you should be back at the
main "Partition Disks" screen.
9. Now add a filesystem to the new LVM. Select the partition under "LVM VG vg01, LV
srv", or whatever name you have chosen, the choose Use as. Setup a file system as
normal selecting /srv as the mount point. Once done, select "Done setting up the
partition".
Finally, select "Finish partitioning and write changes to disk". Then confirm the changes
and continue with the rest of the installation.

There are some useful utilities to view information about LVM:


1. pvdisplay: shows information about Physical Volumes.
2. vgdisplay: shows information about Volume Groups.
3. lvdisplay: shows information about Logical Volumes.

Extending Volume Groups


Continuing with srv as an LVM volume example, this section covers adding a second
hard disk, creating a Physical Volume (PV), adding it to the volume group (VG), extending the
logical volume srv and finally extending the filesystem. This example assumes a second hard
disk has been added to the system. In this example, this hard disk will be named /dev/sdb and we
will use the entire disk as a physical volume (you could choose to create partitions and use them
as different physical volumes)

Note: Make sure you don't already have an existing /dev/sdb before issuing the commands
below. You could lose some data if you issue those commands on a non-empty disk.

1. First, create the physical volume, in a terminal execute:


sudo pvcreate /dev/sdb

2. Now extend the Volume Group (VG):


sudo vgextend vg01 /dev/sdb

3. Use vgdisplay to find out the free physical extents - Free PE / size (the size you can
allocate). We will assume a free size of 511 PE (equivalent to 2GB with a PE size of
4MB) and we will use the whole free space available. Use your own PE and/or free
space.
The Logical Volume (LV) can now be extended by different methods, we will only see
how to use the PE to extend the LV:
sudo lvextend /dev/vg01/srv -l +511
The -/ option allows the LV to be extended using PE. The -L option allows the LV to be
extended using Meg, Gig, Tera, etc bytes.

4. Even though you are supposed to be able to expand an ext3 or ext4 filesystem without
unmounting it first, it may be a good practice to unmount it anyway and check the
filesystem, so that you don't mess up the day you want to reduce a logical volume (in that
case unmounting first is compulsory).
The following commands are for an EXT3 or EXT4 filesystem. If you are using another
filesystem there may be other utilities available.
sudo umount /srv
sudo e2fsck -f /dev/vg01/srv
The -f option of e2fsck forces checking even if the system seems clean.

5. Finally, resize the filesystem:


sudo resize2fs /dev/vg01/srv

6. Now mount the partition and check its size.


mount /dev/vg01/srv /srv && df -h /srv
3.2.4.3 iSCSI
The iSCSI protocol can be used to install Ubuntu on systems with or without hard disks attached.

Installation on a diskless system


The first steps of a diskless iSCSI installation are identical to the Installing from CD section up
to "Hard drive layout".

1. The installer will display a warning with the following message:


No disk drive was detected. If you know the name of the driver needed by your disk
drive, you can select it from the list.

2. Select the item in the list titled login to iSCSI targets.

3. You will be prompted to Enter an IP address to scan for iSCSI targets with a description
of the format for the address. Enter the IP address for the location of your iSCSI target
and navigate to <continue> then hit ENTER

4. If authentication is required in order to access the iSCSI device, provide the username in
the next field. Otherwise leave it blank.

5. If your system is able to connect to the iSCSI provider, you should see a list of available
iSCSI targets where the operating system can be installed. The list should be similar to
the following :
Select the iSCSI targets you wish to use.
iSCSI targets on 192.168.1.29:3260:
[ ] iqn.2016-03.TrustyS-iscsitarget:storage.sys0
<Go Back> <Continue>

6. Select the iSCSI target that you want to use with the space bar. Use the arrow keys to
navigate to the target that you want to select.

7. Navigate to <Continue> and hit ENTER.

If the connection to the iSCSI target is successful, you will be prompted with the [!!]
Partition disks installation menu. The rest of the procedure is identical to any normal installation
on attached disks. Once the installation is completed, you will be asked to reboot.

Installation on a system with disk attached


Again, the iSCSI installation on a normal server with one or many disks attached is identical to
the Installing from CD section until we reach the disk partitioning menu. Instead of using any of
the Guided selection, we need to perform the following steps :

1. Navigate to the Manual menu entry


2. Select the Configure iSCSI Volumes menu entry

3. Choose the Log into iSCSI targets

4. You will be prompted to Enter an IP address to scan for iSCSI targets. with a description
of the format for the address. Enter the IP address and navigate to <continue> then hit
ENTER

5. If authentication is required in order to access the iSCSI device, provide the username in
the next field or leave it blank.

6. If your system is able to connect to the iSCSI provider, you should see a list of available
iSCSI targets where the operating system can be installed. The list should be similar to
the following :
Select the iSCSI targets you wish to use.
iSCSI targets on 192.168.1.29:3260:
[ ] iqn.2016-03.TrustyS-iscsitarget:storage.sys0
<Go Back> <Continue>

7. Select the iSCSI target that you want to use with the space bar. Use the arrow keys to
navigate to the target that you want to select

8. Navigate to <Continue> and hit ENTER.

9. If successful, you will come back to the menu asking you to Log into iSCSI targets.
Navigate to Finish and hit ENTER

The newly connected iSCSI disk will appear in the overview section as a device
prefixed with SCSI. This is the disk that you should select as your installation disk. Once
identified, you can choose any of the partitioning methods.

Note: Depending on your system configuration, there may be other SCSI disks attached to the
system. Be very careful to identify the proper device before proceeding with the installation.
Otherwise, irreversible data loss may result from performing an installation on the wrong disk.

Rebooting to an iSCSI target


The procedure is specific to your hardware platform. As an example, here is how to reboot to
your iSCSI target using iPXE
iPXE> dhcp
Configuring (net0 52:54:00:a4:f2:a9)....... ok
iPXE> sanboot iscsi:192.168.1.29::::iqn.2016-03.TrustyS-iscsitarget:storage.sys0

If the procedure is successful, you should see the Grub menu appear on the screen.
3.3 Hadoop Installation and Deployment
Now that we have been introduced to Hadoop and learned about its core components, HDFS and
YARN and their related processes, as well as different deployment modes for Hadoop, lets look
at the different options for getting a functioning Hadoop cluster up and running.

3.3.1 Installation Platforms and Prerequisites


Before you install Hadoop there are a few installation requirements, prerequisites, and
recommendations of which you should be aware.

3.3.1.1 Operating System Requirements


The vast majority of Hadoop implementations are platformed on Linux hosts. This is due to a
number of reasons:

The Hadoop project, although cross-platform in principle, was originally targeted at


Linux. It was several years after the initial release that a Windows-compatible
distribution was introduced.

Many of the commercial vendors only support Linux.

Many other projects in the open source and Hadoop ecosystem have compatibility issues
with non-Linux platforms.

That said there are options for installing Hadoop on Windows, should this be your platform of
choice. We will use Linux for all of our exercises and examples in this book, but consult the
documentation for your preferred Hadoop distribution for Windows installation and support
information if required.

If you are using Linux, choose a distribution you are comfortable with. All major distributions
are supported (Red Hat, Centos, Ubuntu, SLES, etc.). You can even mix distributions if
appropriate; for instance, master nodes running Red Hat and slave nodes running Ubuntu.

Caution: Dont Use Logical Volume Manager (LVM) in Linux


If you are using Linux to deploy Hadoop nodes, master or slaves, it is strongly recommended
that you not use LVM in Linux. This will restrict performance, especially on slave nodes.

3.3.1.2 Hardware Requirements


Although there are no hard and fast requirements, there are some general heuristics
used in sizing instances, or hosts, appropriately for roles within a Hadoop cluster. First, you need
to distinguish between master and slave node instances, and their requirements.

Master Nodes
A Hadoop cluster relies on its master nodes, which host the NameNode and
ResourceManager, to operate, although you can implement high availability for each subsystem.
Failure and failover of these components is not desired. Furthermore, the master node processes,
particularly the NameNode, require a large amount of memory to operate efficiently, when we
dive into the internals of HDFS. Therefore, when specifying hardware requirements the
following guidelines can be used for medium to large-scale production Hadoop implementations:

16 or more CPU cores (preferably 32)


128GB or more RAM (preferably 256GB)
RAID Hard Drive Configuration (preferably with hot-swappable drives)
Redundant power supplies
Bonded Gigabit Ethernet or 10Gigabit Ethernet

This is only a guide, and as technology moves on quickly, these recommendations will change as
well. The bottom line is that you need carrier class hardware with as much CPU and memory
capacity as you can get!

Slave Nodes
Slave nodes do the actual work in Hadoop, both for processing and storage so they will benefit
from more CPU and memoryphysical memory, not virtual memory. That said, slave nodes are
designed with the expectation of failure, which is one of the reasons blocks are replicated in
HDFS. Slave nodes can also be scaled out linearly. For instance, you can simply add more nodes
to add more aggregate storage or processing capacity to the cluster, which you cannot do with
master nodes. With this in mind, economic scalability is the objective when it comes to slave
nodes. The following is a guide for slave nodes for a well-subscribed, computationally intensive
Hadoop cluster; for instance, a cluster hosting machine learning and in memory processing using
Spark.

16-32 CPU cores


64-512 GB of RAM
12-24 1-4 TB hard disks in a JBOD Configuration

Note: JBOD
JBOD is an acronym for just a bunch of disks, meaning directly attached storage that is not in a
RAID configuration, where each disk operates independently of the other disks. RAID is not
recommended for block storage on slave nodes as the access speed is limited by the slowest disk
in the array, unlike JBOD where the average speed can be greater than that of the slowest disk.
JBOD has been proven to outperform RAID 0 for block storage by 30% to 50% in benchmarks
conducted at Yahoo!.

Caution: Storing Too Much Data on Any Slave Node May Cause Issues
As slave nodes typically host the blocks in a Hadoop filesystem, and as storage costs,
particularly for JBOD configurations, are relatively inexpensive, it may be tempting to allocate
excess block storage capacity to each slave node. However, as you will learn in the next hour on
HDFS, you need to consider the network impact of a failed node, which will trigger re-
replication of all blocks that were stored on the slave node.
Slave nodes are designed to be deployed on commodity-class hardware, and yet while
they still need ample processing power in the form of CPU cores and memory, as they will be
executing computational and data transformation tasks, they dont require the same degree of
fault tolerance that master nodes do.

Networking Considerations
Fully distributed Hadoop clusters are very chatty, with control messages, status updates
and heartbeats, block reports, data shuffling, and block replication, and there is often heavy
network utilization between nodes of the cluster. If you are deploying Hadoop on-premise, you
should always deploy Hadoop clusters on private subnets with dedicated switches. If you are
using multiple racks for your Hadoop cluster (you will learn more about this in Hour 21,
Understanding Advanced HDFS), you should consider redundant core and top of rack
switches.
Hostname resolution is essential between nodes of a Hadoop cluster, so both forward
and reverse DNS lookups must work correctly between each node (master-slave and slave-slave)
for Hadoop to function. Either DNS or a hosts files can be used for resolution. IPv6 should also
be disabled on all hosts in the cluster.
Time synchronization between nodes of the cluster is essential as well, as some
components, such as Kerberos, which is discussed in Hour 22, Securing Hadoop, rely on this
being the case. It is recommended you use ntp (Network Time Protocol) to keep clocks
synchronized between all nodes.

3.3.1.3 Software Requirements


As discussed, Hadoop is almost entirely written in Java and compiled to run in a Java Runtime
Environment (JRE); therefore Java is a prerequisite to installing Hadoop. Current prerequisites
include:

Java Runtime Envrionment (JRE) 1.7 or above


Java Development Kit (JDK) 1.7 or aboverequired if you will be compiling Java
classes such as MapReduce applications

Other ecosystem projects will have their specific prerequisites; for instance, Apache Spark
requires Scala and Python as well, so you should always refer to the documentation for these
specific projects.

3.3.2 Installing Hadoop


You have numerous options for installing Hadoop and setting up Hadoop clusters. As
Hadoop is a top-level Apache Software Foundation (ASF) open source project, one method is to
install directly from the Apache builds on http://hadoop.apache.org/. To do this you first need
one or more hosts, depending upon the mode you wish to use, with appropriate hardware
specifications, an appropriate operating system, and a Java runtime environment available (all of
the prerequisites and considerations discussed in the previous section).
Once you have this, it is simply a matter of downloading and unpacking the desired
release. There may be some additional configuration to be done afterwards, but then you simply
start the relevant services (master and slave node daemons) on their designated hosts and you are
up and running.

Non-Commercial Hadoop
Lets deploy a Hadoop cluster using the latest Apache release now.

Try It Yourself: Installing Hadoop Using the Apache Release

In this exercise we will install a pseudo-distributed mode Hadoop cluster using the latest Hadoop
release downloaded from hadoop.apache.org.

As this is a test cluster the following specifications will be used in our example:
Red Hat Enterprise Linux 7.2 (The installation steps would be similar using other Linux
distributions such as Ubuntu)
2 CPU cores
8GB RAM
30GB HDD
hostname: hadoopnode0

1. Disable SELinux (this is known to cause issues with Hadoop):


$ sudo sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

2. Disable IPv6 (this is also known to cause issues with Hadoop):


$ sudo sed -i "\$anet.ipv6.conf.all.disable_ipv6 = 1" /etc/sysctl.conf
$ sudo sed -i "\$anet.ipv6.conf.default.disable_ipv6 = 1" /etc/sysctl.conf
$ sudo sysctl -p

3. Reboot

4. Run the sestatus command to ensure SELinux is not enabled:


$ sestatus

5. Install Java. We will install the OpenJDK, which will install both a JDK and JRE:
$ sudo yum install java-1.7.0-openjdk-devel

a. Test that Java has been successfully installed by running the following command:
$ java -version

If Java has been installed correctly you should see output similar to the following:
java version "1.7.0_101"
OpenJDK Runtime Environment (rhel-2.6.6.1.el7_2-x86_64..)
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)

Note that depending upon which operating system you are deploying on, you may
have a version of Java and a JDK installed already. In these cases it may not be
necessary to install the JDK, or you may need to set up alternatives so you do not
have conflicting Java versions.

6. Locate the installation path for Java, and set the JAVA_HOME environment variable:
$ export JAVA_HOME=/usr/lib/jvm/REPLACE_WITH_YOUR_PATH/

7. Download Hadoop from your nearest Apache download mirror. You can obtain the link
by selecting the binary option for the version of your choice at
http://hadoop.apache.org/releases.html. We will use Hadoop version 2.7.2 for our
example.
$ wget http://REPLACE_WITH_YOUR_MIRROR/hadoop-2.7.2.tar.gz

8. Unpack the Hadoop release, move it into a system directory, and set an environment
variable from the Hadoop home directory:
$ tar -xvf hadoop-2.7.2.tar.gz
$ mv hadoop-2.7.2 hadoop
$ sudo mv hadoop/ /usr/share/
$ export HADOOP_HOME=/usr/share/hadoop

9. Create a directory which we will use as an alternative to the Hadoop configuration


directory:
$ sudo mkdir -p /etc/hadoop/conf

10. Create a mapred-site.xml file (I will discuss this later) in the Hadoop configuration
directory:
$ sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template
$HADOOP_HOME/etc/hadoop/mapred-site.xml

11. Add JAVA_HOME environment variable to hadoop-env.sh (file used to source


environment variables for Hadoop processes):
$ sed -i "\$aexport JAVA_HOME=/REPLACE_WITH_YOUR_JDK_PATH/"
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
Substitute the correct path to your Java home directory as defined in Step 6.

12. Create a symbolic link between the Hadoop configuration directory and the /etc/hadoop
/conf directory created in Step 10:
$ sudo ln -s $HADOOP_HOME/etc/hadoop/* /etc/hadoop/conf/

13. Create a logs directory for Hadoop:


$ mkdir $HADOOP_HOME/logs

14. Create users and groups for HDFS and YARN:


$ sudo groupadd hadoop
$ sudo useradd -g hadoop hdfs
$ sudo useradd -g hadoop yarn
15. Change the group and permissions for the Hadoop release files:
$ sudo chgrp -R hadoop /usr/share/hadoop
$ sudo chmod -R 777 /usr/share/hadoop

16. Run the built in Pi Estimator example included with the Hadoop release.
$ cd $HADOOP_HOME
$ sudo -u hdfs bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.2.jar pi 16 1000

As we have not started any daemons or initialized HDFS, this program runs in
LocalJobRunner mode (recall that I discussed this in Hour 2, Understanding the Hadoop
Cluster Architecture). If this runs correctly you should see output similar to the
following:

...
Job Finished in 2.571 seconds
Estimated value of Pi is 3.14250000000000000000
Now lets configure a pseudo-distributed mode Hadoop cluster from your installation.

17. Use the vi editor to update the core-site.xml file, which contains important information
about the cluster, specifically the location of the namenode:
$ sudo vi /etc/hadoop/conf/core-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoopnode0:8020</value>
</property>

Note that the value for the fs.defaultFS configuration parameter needs to be set to
hdfs://HOSTNAME:8020, where the HOSTNAME is the name of the NameNode host,
which happens to be the localhost in this case.

18. Adapt the instructions in Step 17 to similarly update the hdfs-site.xml file, which contains
information specific to HDFS, including the replication factor, which is set to 1 in this
case as it is a pseudo-distributed mode cluster:
sudo vi /etc/hadoop/conf/hdfs-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

19. Adapt the instructions in Step 17 to similarly update the yarn-site.xml file, which
contains information specific to YARN. Importantly, this configuration file contains the
address of the resourcemanager for the clusterin this case it happens to be the
localhost, as we are using pseudo-distributed mode:
$ sudo vi /etc/hadoop/conf/yarn-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoopnode0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

20. Adapt the instructions in Step 17 to similarly update the mapred-site.xml file, which
contains information specific to running MapReduce applications using YARN:
$ sudo vi /etc/hadoop/conf/mapred-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

21. Format HDFS on the NameNode:


$ sudo -u hdfs bin/hdfs namenode -format
Enter [Y] to re-format if prompted.

22. Start the NameNode and DataNode (HDFS) daemons:


$ sudo -u hdfs sbin/hadoop-daemon.sh start namenode
$ sudo -u hdfs sbin/hadoop-daemon.sh start datanode

23. Start the ResourceManager and NodeManager (YARN) daemons:


$ sudo -u yarn sbin/yarn-daemon.sh start resourcemanager
$ sudo -u yarn sbin/yarn-daemon.sh start nodemanager

24. Use the jps command included with the Java JDK to see the Java processes that are
running:
$ sudo jps

You should see output similar to the following:


2374 DataNode
2835 Jps
2280 NameNode
2485 ResourceManager
2737 NodeManager
25. Create user directories and a tmp directory in HDFS and set the appropriate permissions
and ownership:
$ sudo -u hdfs bin/hadoop fs -mkdir -p /user/<your_user>
$ sudo -u hdfs bin/hadoop fs -chown <your_user>:<your_user> /user/<your_user>
$ sudo -u hdfs bin/hadoop fs -mkdir /tmp
$ sudo -u hdfs bin/hadoop fs -chmod 777 /tmp

26. Now run the same Pi Estimator example you ran in Step 16. This will now run in pseudo-
distributed mode:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 16
1000

The output you will see in the console will be similar to that in Step 16. Open a browser
and go to localhost:8088. You will see the YARN ResourceManager Web UI (which I
discuss in Hour 6, Understanding Data Processing in Hadoop) (Figure 3.1):

Figure 3.1 YARN ResourceManager Web UI.

Congratulations! You have just set up your first pseudo-distributed mode Hadoop cluster.

3.3.3 Using a Commercial Hadoop Distribution


The commercial Hadoop landscape is well established. With the advent of the ODPi
(the Open Data Platform initiative), a once-numerous array of vendors and derivative
distributions has been consolidated to a much simpler landscape which includes three primary
pure-play Hadoop vendors:

Cloudera
Hortonworks
MapR

Importantly, enterprise support agreements and subscriptions can be purchased from the various
Hadoop vendors for their distributions. Each vendor also supplies a suite of management utilities
to help you deploy and manage Hadoop clusters. Lets look at each of the three major pure play
Hadoop vendors and their respective distributions.

3.3.3.1 Cloudera
Cloudera was the first mover in the commercial Hadoop space, establishing their first
commercial release in 2008. Cloudera provides a Hadoop distribution called CDH (Cloudera
Distribution of Hadoop), which includes the Hadoop core and many ecosystem projects. CDH is
entirely open source.
Cloudera also provides a management utility called Cloudera Manager (which is not
open source). Cloudera Manager provides a management console and framework enabling the
deployment, management, and monitoring of Hadoop clusters, and which makes many
administrative tasks such as setting up high availability or security much easier. The mix of open
source and proprietary software is often referred to as open core. A screenshot showing Cloudera
Manager is pictured in Figure 3.2.

Figure 3.2 Cloudera Manager

As mentioned, Cloudera Manager can be used to deploy Hadoop clusters, including


master nodes, slave nodes, and ecosystem technologies. Cloudera Manager distributes
installation packages for Hadoop components through a mechanism called parcels. As Hadoop
installations are typically isolated from public networks, Cloudera Manager, which is technically
not part of the cluster and will often have access to the Internet, will download parcels and
distribute these to new target hosts nominated to perform roles in a Hadoop cluster or to existing
hosts to upgrade components.
Deploying a fully distributed CDH cluster using Cloudera Manager would involve the
following steps at a high level:
1. Install Cloudera Manager on a host that has access to other hosts targeted for roles
in the cluster.
2. Specify target hosts for the cluster using Cloudera Manager.
3. Use Cloudera Manager to provision Hadoop services, including master and slave
nodes.

Cloudera also provides a Quickstart virtual machine, which is a pre-configured pseudo-


distributed Hadoop cluster with the entire CDH stack, including core and ecosystem components,
as well as a Cloudera Manager installation. This virtual machine is available for VirtualBox,
VMware, and KVM, and works with the free editions of each of the virtualization platforms. The
Cloudera Quickstart VM is pictured in Figure 3.3.

Figure 3.3 Cloudera Quickstart VM

The Quickstart VM is a great way to get started with the Cloudera Hadoop offering. To find out
more, go to http://www.cloudera.com/downloads.html.

3.3.3.2 Hortonworks
Hortonworks provides pure open source Hadoop distribution and a founding member
of the open data platform initiative (ODPi) discussed in Hour 1. Hortonworks delivers a
distribution called HDP (Hortonworks Data Platform), which is a complete Hadoop stack
including the Hadoop core and selected ecosystem components. Hortonworks uses the Apache
Ambari project to provide its deployment configuration management and cluster monitoring
facilities. A screenshot of Ambari is shown in Figure 3.4.

Figure 3.4 Ambari console

The simplest method to deploy a Hortonworks Hadoop cluster would involve the following
steps:
1. Install Ambari using the Hortonworks installer on a selected host.
2. Add hosts to the cluster using Ambari.
3. Deploy Hadoop services (such as HDFS and YARN) using Ambari.

Hortonworks provides a fully functional, pseudo-distributed HDP cluster with the complete
Hortonworks application stack in a virtual machine called the Hortonworks Sandbox. The
Hortonworks Sandbox is available for VirtualBox, VMware, and KVM. The Sandbox virtual
machine includes several demo applications and learning tools to use to explore Hadoop and its
various projects and components. The Hortonworks Sandbox welcome screen is shown in Figure
3.5.
Figure 3.5 Hortonworks Sandbox

3.3.3.3 MapR
MapR delivers a Hadoop-derived software platform that implements an API-compatible
distributed filesystem called MapRFS (the MapR distributed Filesystem). MapRFS has been
designed to maximize performance and provide read-write capabilities not offered by native
HDFS. MapR delivers three versions of their offering called the Converged Data Platform.
These include:
M3 or Converged Community Edition (free version)
M5 or Converged Enterprise Edition (supported version)
M7 (M5 version that includes MapRs custom HBase-derived data store)

Like the other distributions, MapR has a demo offering called the MapR Sandbox, which is
available for VirtualBox or VMware. It is pictured in Figure 3.6.
Figure 3.6 MapR Sandbox VM.

MapRs management offering is called the MapR Control System (MCS), which offers a central
console to configure, monitor and manage MapR clusters. It is shown in Figure 3.7.
Figure 3.7 MapR Control System (MCS).

3.3.4 Deploying Hadoop in the Cloud


The rise and proliferation of cloud computing and virtualization technologies has
definitely been a game changer for the way organizations think about and deploy technology,
and Hadoop is no exception. The availability and maturity around IaaS (Infrastructure-as-a-
Service), PaaS (Platform-as-a-Service) and SaaS (Software-as-a-Service) solutions makes
deploying Hadoop in the cloud not only viable but, in some cases, desirable.
There are many public cloud variants that can be used to deploy Hadoop including
Google, IBM, Rackspace, and others. Perhaps the most pervasive cloud platform to date has been
AWS (Amazon Web Services), which I will use as the basis for our discussions.
Before you learn about deployment options for Hadoop in AWS, lets go through a
quick primer and background on some of the key AWS components. If you are familiar with
AWS, feel free to jump straight to the Try it Yourself exercise on deploying Hadoop using AWS
EMR.
3.3.4.1 EC2
Elastic Compute Cloud (EC2) EC2 is Amazons web service-enabled virtual computing
platform. EC2 enables users to create virtual servers and networks in the cloud. The virtual
servers are called instances. EC2 instances can be created with a variety of different instance
permutations. The Instance Type property determines the number of virtual CPUs and the
amount of memory and storage an EC2 instance has available to it. An example instance type is
m4.large. A complete list of the different EC2 instance types available can be found at
https://aws.amazon.com/ec2/instance-types/ .
EC2 instances can be optimized for compute, memory, storage and mixed purposes and
can even include GPUs (Graphics Processing Units), a popular option for machine learning and
deep analytics.
There are numerous options for operating systems with EC2 instances. All of the
popular Linux distributions are supported, including Red Hat, Ubuntu, and SLES, as well various
Microsoft Windows options.
EC2 instances are created in security groups. Security groups govern network
permissions and Access Control Lists (ACLs). Instances can also be created in a Virtual Private
Cloud (VPC). A VPC is a private network, not exposed directly to the Internet. This is a popular
option for organizations looking to minimize exposure of EC2 instances to the public Internet.
EC2 instances can be provisioned with various storage options, including instance
storage or ephemeral storage, which are terms for volatile storage that is lost when an instance is
stopped, and Elastic Block Store (EBS), which is persistent, fault-tolerant storage. There are
different options with each, such as SSD (solid state) for instance storage, or provisioned IOPS
with EBS.
Additionally, AWS offers Spot instances, which enable you to bid on spare Amazon
EC2 computing capacity, often available at a discount compared to normal on-demand EC2
instance pricing.
EC2, as well as all other AWS services, is located in an AWS region. There are
currently nine regions, which include the following:

US East (N. Virginia)


US West (Oregon)
US West (N. California)
EU (Ireland)
EU (Frankfurt)
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
South America (Sao Paulo)

3.3.4.2 S3
Simple Storage Service (S3) is Amazons cloud-based object store. An object store
manages data (such as files) as objects. These objects exist in buckets. Buckets are logical, user-
created containers with properties and permissions. S3 provides APIs for users to create and
manage buckets as well as to create, read, and delete objects from buckets.
The S3 bucket namespace is global, meaning that any buckets created must have a
globally unique name. The AWS console or APIs will let you know if you are trying to create a
bucket with a name that already exists.
S3 objects, like files in HDFS, are immutable, meaning they are write once, read many.
When an S3 object is created and uploaded, an ETag is created, which is effectively a signature
for the object. This can be used to ensure integrity when the object is accessed (downloaded) in
the future.
There are also public buckets in S3 containing public data sets. These are datasets
provided for informational or educational purposes, but they can be used for data operations such
as processing with Hadoop. Public datasets, many of which are in the tens or hundreds of
terabytes, are available, and topics range from historical weather data to census data, and from
astronomical data to genetic data.

3.3.4.3 Elastic MapReduce (EMR)


Elastic MapReduce (EMR) is Amazons Hadoop-as-a-Service platform. EMR clusters
can be provisioned using the AWS Management Console or via the AWS APIs. Options for
creating EMR clusters include number of nodes, node instance types, Hadoop distribution, and
additional ecosystem applications to install.
EMR clusters can read data and output results directly to and from S3. They are
intended to be provisioned on demand, run a discrete workflow, a job flow, and terminate. They
do have local storage, but they are not intended to run in perpetuity. You should only use this
local storage for transient data.
EMR is a quick and scalable deployment method for Hadoop. More information about
EMR can be found at https://aws.amazon.com/elasticmapreduce/.

3.3.4.4 AWS Pricing and Getting Started


AWS products, including EC2, S3, and EMR, are charged based upon usage. Each EC2
instance type within each region has an instance per hour cost associated with it. The usage costs
per hour are usually relatively low and the medium- to long-term costs are quite reasonable, but
the more resources you use for a longer period of time, the more you are charged.
If you have not already signed up with AWS, youre in luck! AWS has a free tier
available for new accounts that enables you to use certain instance types and services for free for
the first year. You can find out more at https://aws.amazon.com/free/. This page walks you
through setting up an account with no ongoing obligations.
Once you are up and running with AWS, you can create an EMR cluster by navigating
to the EMR link in the AWS console as shown in Figure 3.8.
Figure 3.8 AWS consoleEMR option.

Then click Create Cluster on the EMR welcome page as shown in Figure 3.9, and simply follow
the dialog prompts.

Figure 3.9 AWS EMR welcome screen.


You can use an EMR cluster for many of our exercises. However, be aware that leaving the
cluster up and running will incur usage charges.

3.4 Single-Node Installation


3.4.1 Prerequisites
1. Java 6 JDK
Hadoop requires a working Java 1.5+ (aka Java 5) installation.

Update the source list


user@ubuntu:~$ sudo apt-get update

or

Install Sun Java 6 JDK

Note: If you already have Java JDK installed on your system, then you need not run the above
command.

To install it
user@ubuntu:~$ sudo apt-get install sun-java6-jdk

The full JDK which will be placed in /usr/lib/jvm/java-6-openjdk-amd64 After installation,


check whether java JDK is correctly installed or not, with the following command
user@ubuntu:~$ java -version

2. Adding a dedicated Hadoop system user


We will use a dedicated Hadoop user account for running Hadoop.
user@ubuntu:~$ sudo addgroup hadoop_group
user@ubuntu:~$ sudo adduser --ingroup hadoop_group hduser1

This will add the user hduser1 and the group hadoop_group to the local machine. Add hduser1 to
the sudo group
user@ubuntu:~$ sudo adduser hduser1 sudo

3. Configuring SSH
The hadoop control scripts rely on SSH to peform cluster-wide operations. For example, there is
a script for stopping and starting all the daemons in the clusters. To work seamlessly, SSH needs
to be setup to allow password-less login for the hadoop user from machines in the cluster. The
simplest way to achive this is to generate a public/private key pair, and it will be shared across
the cluster.
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine.
For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for
the hduser user we created in the earlier.

We have to generate an SSH key for the hduser user.


user@ubuntu:~$ su hduser1
hduser1@ubuntu:~$ ssh-keygen -t rsa -P ""

The second line will create an RSA key pair with an empty password.

Note: P , here indicates an empty password

You have to enable SSH access to your local machine with this newly created key which is done
by the following command.
hduser1@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to the local machine with the hduser1 user.
The step is also needed to save your local machines host key fingerprint to the hduser users
known hosts file.
hduser@ubuntu:~$ ssh localhost

If the SSH connection fails, we can try the following (optional):


Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config. If you made any changes to
the SSH server configuration file, you can force a configuration reload with sudo
/etc/init.d/ssh reload.

3.4.2 Installation
3.4.2.1 Main Installation
Now, I will start by switching to hduser
hduser@ubuntu:~$ su - hduser1

Now, download and extract Hadoop 1.2.0


Setup Environment Variables for Hadoop

Add the following entries to .bashrc file


# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Add Hadoop bin/ directory to PATH
export PATH= $PATH:$HADOOP_HOME/bin
3.4.2.2 Configuration

hadoop-env.sh
Change the file: conf/hadoop-env.sh
#export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to in the same file


# export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 (for 64 bit)
# export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 (for 32 bit)

conf/*-site.xml
Now we create the directory and set the required ownerships and permissions
hduser@ubuntu:~$ sudo mkdir -p /app/hadoop/tmp
hduser@ubuntu:~$ sudo chown hduser:hadoop /app/hadoop/tmp
hduser@ubuntu:~$ sudo chmod 750 /app/hadoop/tmp

The last line gives reading and writing permissions to the /app/hadoop/tmp directory
Error: If you forget to set the required ownerships and permissions, you will see a
java.io.IO Exception when you try to format the name node.

Paste the following between <configuration>


In file conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

In file conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

3.4.2.3 Formatting the HDFS filesystem via the NameNode


To format the filesystem (which simply initializes the directory specified by the dfs.name.dir
variable). Run the command
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode format

3.4.2.4 Starting your single-node cluster


Before starting the cluster, we need to give the required permissions to the directory with the
following command
hduser@ubuntu:~$ sudo chmod -R 777 /usr/local/hadoop

Run the command


hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on the machine.
hduser@ubuntu:/usr/local/hadoop$ jps

3.4.2.5 Errors
1. If by chance your datanode is not starting, then you have to erase the contents of the
folder /app/hadoop/tmp
The command that can be used
hduser@ubuntu:~:$ sudo rm Rf /app/hadoop/tmp/*

2. You can also check with netstat if Hadoop is listening on the configured ports.
The command that can be used
hduser@ubuntu:~$ sudo netstat -plten | grep java

3. Errors if any, examine the log files in the /logs/ directory.

3.4.2.6 Stopping your single-node cluster


Run the command to stop all the daemons running on your machine.
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

3.4.3 Error Points


If datanode is not starting, then clear the tmp folder before formatting the namenode using the
following command
hduser@ubuntu:~$ rm -Rf /app/hadoop/tmp/*

Note:
The masters and slaves file should contain localhost.
In /etc/hosts, the ip of the system should be given with the alias as localhost.
Set the java home path in hadoop-env.sh as well bashrc.

3.5 Multi-Node Installation


3.5.1 From single-node clusters to a multi-node cluster
We will build a multi-node cluster merge two or more single-node clusters into one multi-node
cluster in which one Ubuntu box will become the designated master but also act as a slave , and
the other box will become only a slave.

3.5.2 Prerequisites
Configuring single-node clusters first,here we have used two single node clusters. Shutdown
each single-node cluster with the following command
user@ubuntu:~$ bin/stop-all.sh

3.5.3 Networking
The easiest is to put both machines in the same network with regard to hardware and
software configuration.
Update /etc/hosts on both machines .Put the alias to the ip addresses of all the machines.
Here we are creating a cluster of 2 machines , one is master and other is slave 1
hduser@master:$ cd /etc/hosts

Add the following lines for two node cluster


10.105.15.78 master (IP address of the master node)
10.105.15.43 slave1 (IP address of the slave node)
3.5.4 SSH access
The hduser user on the master (aka hduser@master) must be able to connect:
1. to its own user account on the master - i.e. ssh master in this context.
2. to the hduser user account on the slave (i.e. hduser@slave1) via a password-less SSH
login.

Add the hduser@master public SSH key using the following command
hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave1

Connect with user hduser from the master to the user account hduser on the slave.

1. From master to master


hduser@master:~$ ssh master

2. From master to slave


hduser@master:~$ ssh slave1

3.5.5 Hadoop Cluster Overview


This will describe how to configure one Ubuntu box as a master node and the other Ubuntu box
as a slave node.

3.5.5.1 Configuration
conf/masters
The machine on which bin/start-dfs.sh is running will become the primary NameNode. This file
should be updated on all the nodes. Open the masters file in the conf directory
hduser@master/slave :~$ /usr/local/hadoop/conf
hduser@master/slave :~$ sudo gedit masters

Add the following line


Master

conf/slaves
This file should be updated on all the nodes as master is also a slave. Open the slaves file in the
conf directory
hduser@master/slave:~/usr/local/hadoop/conf$ sudo gedit slaves

Add the following lines


Master
Slave1

conf/*-site.xml (all machines)


Open this file in the conf directory
hduser@master:~/usr/local/hadoop/conf$ sudo gedit core-site.xml

Change the fs.default.name parameter (in conf/core-site.xml), which specifies the NameNode
(the HDFS master) host and port.

conf/core-site.xml (ALL machines .ie. Master as well as slave)


<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

conf/mapred-site.xml
Open this file in the conf directory
hduser@master:~$ /usr/local/hadoop/conf
hduser@master:~$ sudo gedit mapred-site.xml

Change the mapred.job.tracker parameter (in conf/mapred-site.xml), which specifies the


JobTracker (MapReduce master) host and port.

conf/mapred-site.xml (ALL machines)


<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

conf/hdfs-site.xml
Open this file in the conf directory
hduser@master:~$ /usr/local/hadoop/conf
hduser@master:~$ sudo gedit hdfs-site.xml

Change the dfs.replication parameter (in conf/hdfs-site.xml) which specifies the default block
replication. We have two nodes available, so we set dfs.replication to 2.

conf/hdfs-site.xml (ALL machines)


Changes to be made
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

3.5.5.2 Formatting the HDFS filesystem via the NameNode


Format the clusters HDFS file system
hduser@master:~/usr/local/hadoop$ bin/hadoop namenode -format

3.5.5.3 Starting the multi-node cluster


Starting the cluster is performed in two steps.
1. We begin with starting the HDFS daemons: the NameNode daemon is started on master,
and DataNode daemons are started on all slaves (here: master and slave).
2. Then we start the MapReduce daemons: the JobTracker is started on master, and
TaskTracker daemons are started on all slaves (here: master and slave).

Cluster is started by running the commnd on master


hduser@master:~$ /usr/local/hadoop
hduser@master:~$ bin/start-all.sh

By this command:
The NameNode daemon is started on master, and DataNode daemons are started on all
slaves (here: master and slave).
The JobTracker is started on master, and TaskTracker daemons are started on all slaves
(here: master and slave)

To check the daemons running , run the following commands


hduser@master:~$ jps

On slave, datanode and jobtracker should run.


hduser@slave:~/usr/local/hadoop$ jps

3.5.5.4 Stopping the multi-node cluster


To stop the multinode cluster , run the following command on master pc
hduser@master:~$ cd /usr/local/hadoop
hduser@master:~/usr/local/hadoop$ bin/stop-all.sh

3.5.6 Error Points


1. Number of slaves = Number of replications in hdfs-site.xml
also number of slaves = all slaves + master(if master is also considered to be a slave)

2. When you start the cluster, clear the tmp directory on all the nodes (master+slaves) using
the following command
hduser@master:~$ rm -Rf /app/hadoop/tmp/*

3. Configuration of /etc/hosts , masters and slaves files on both the masters and the slaves
nodes should be the same.

4. If namenode is not getting started run the following commands:


To give all permissions of hadoop folder to hduser
hduser@master:~$ sudo chmod -R 777 /app/hadoop

This command deletes the junk files which gets stored in tmp folder of hadoop
hduser@master:~$ sudo rm -Rf /app/hadoop/tmp/*

3.6 Single-Node Cluster vs. Multi-Node Cluster


Single Node or Psuedo-Distributed Cluster is the one in which all the essential daemons (like
NameNode, DataNode, JobTracker and TaskTracker) run on the same machine. The default
replication factor for a single node cluster is 1. A single node cluster is basically used to simulate
a full cluster like environment and to test hadoop applications and unlike the Stand-alone mode,
HDFS is accessible in this. By default, Hadoop is configured to run in a non-distributed or
standalone mode, as a single Java process. There are no daemons running and everything runs in
a single JVM instance. HDFS is not used.
Multi Node or Fully Distributed Cluster is a typical hadoop cluster which follows a master-
slave architecture. It will basically comprise of one master machine (running the NameNode and
TaskTracker daemon) and one or more slave machines (running the DataNode and TaskTracker
daemon). The default replication factor for a multi node cluster is 3. It is basically used for full
stack development of hadoop application and projects. The Hadoop daemons run on a local
machine, thus simulating a cluster on a small scale. Different Hadoop daemons run in different
JVM instances, but on a single machine. HDFS is used instead of local FS

3.7

Q&A
Q. Why do master nodes normally require a higher degree of fault tolerance than slave
nodes?

A. Slave nodes are designed to be implemented on commodity hardware with the expectation
of failure, and this enables slave nodes to scale economically. The fault tolerance and
resiliency built into HDFS and YARN enables the system to recover seamlessly from a failed
slave node. Master nodes are different; they are intended to be always on. Although there are
high availability implementation options for master nodes, failover is not desirable. Therefore,
more local fault tolerance, such as RAID disks, dual power supplies, etc., is preferred for
master nodes.
Q. What does JBOD stand for, and what is its relevance for Hadoop?

A. JBOD is an acronym for Just a Bunch of Disks, which means spinning disks that operate
independently of one another, in contrast to RAID, where disks operate as an array. JBOD is
recommended for slave nodes, which are responsible for HDFS block storage. This is because
the average speed of all disks on a slave node is greater than the speed of the slowest disk. By
comparison, RAID read and write speeds are limited by the speed of the slowest disk in the
array.

Q. What are the advantages to deploying Hadoop using a commercial distribution?

A. Commercial distributions contain a stack of core and ecosystem components that are
tested with one another and certified for the respective distribution. The commercial vendors
typically include a management application, which is very useful for managing multi-node
Hadoop clusters at scale. The commercial vendors also offer enterprise support as an option as
well.

Quiz
1. True or False: A Java Runtime Environment (JRE) is required on hosts that run Hadoop
services.

2. Which AWS PaaS product is used to deploy Hadoop as a service?

A. EC2

B. EMR

C. S3

D. DynamoDB

3. Slave nodes are typically deployed on what class of hardware?

4. The open source Hadoop cluster management utility used by Hortonworks is called ______.

Answers
1. True. Hadoop services and processes are written in Java, are compiled to Java bytecode, and
run in Java Virtual Machines (JVMs), and therefore require a JRE to operate.
2. B. Elastic MapReduce (EMR).

3. Commodity.

4. Ambari.

You might also like