You are on page 1of 33

Big Data All-Stars

Real-World Stories and Wisdom


from the Best in Big Data

Presented by D a t a n a m i Sponsored by

1
Introduction
Those of us looking to take a significant step towards creating
a data-driven business sometimes need a little inspiration from
those that have traveled the path we are looking to tread. This
book presents a series of real-world stories from those on the big
data frontier who have moved beyond experimentation to creating
sustainable, successful big data solutions within their organiza-
tions. Read these stories to get an inside look at nine big data
all-stars who have been recognized by MapR and Datanami as
having achieved great success in the expanding field of big data.
Use the examples in this guide to help you develop your own
methods, approaches, and best practices for creating big data
solutions within your organization. Whether you are a business
analyst, data scientist, enterprise architect, IT administrator, or
developer, youll gain key insights from these big data luminar-
iesinsights that will help you tackle the big data challenges
you face in your own company.

3
Table of Contents
6 How comScore Uses Hadoop and MapR to Build its Business
Michael Brown, CTO at comScore
comScore uses MapR to manage and scale their Hadoop cluster of 450 servers, cre-
ate more files, process more data faster, and produce better streaming and random
I/O results. MapR allows comScore to easily access data in the cluster and just as
easily store it in a variety of warehouse environments.

9 Making Good Things Happen at Wells Fargo


Paul Cao, Director of Data Services for Wells Fargos Capital
Markets business
Wells Fargo uses MapR to serve the companys data needs across the entire banking
business, which involve a variety of data types including reference data, market data,
and structured and unstructured data, all under the same umbrella. Using NoSQL
and Hadoop, their solution requires the utmost in security, ease of ingest, ability to
scale, high performance, andparticularly important for Wells Fargomulti-tenancy.

11 Coping with Big Data at ExperianDont Wait, Dont Stop


Tom Thomas, Director of IT at Experian
Experian uses MapR to store in-bound source data. The files are then available for
analysts to query with SQL via Hive, without the need to build and load a structured
database. Experian is now able to achieve significantly more processing power and
storage space, and clients have access to deeper data.

14 Trevor Mason and Big Data: Doing What Comes Naturally


Trevor Mason, Vice President Technology Research at IRI
IRI used MapR to maximize file system performance, facilitate the use of a large
number of smaller files, and send files via FTP from the mainframe directly to the
cluster. With Hadoop, they have been able to speed up data processing while reduc-
ing mainframe load, saving more than $1.5 million.

17 Leveraging Big Data to Economically Fuel Growth


Kevin McClowry, Director of Analytics Application Development
at TransUnion
TransUnion uses a hybrid architecture made of commercial databases and Hadoop
so that their analysts can work with data in a way that was previously out of reach.
The company is introducing the analytics architecture worldwide and sizing it to fit
the needs and resources of each countrys operation.

4
20 Making Big Data Work for a Major Oil & Gas Equipment Manufacturer
Warren Sharp, Big Data Engineer at National Oilwell VARCO (NOV)
NOV created a data platform for time-series data from sensors and control systems
to support deep analytics and machine learning. The organization is now able to build,
test, and deliver complicated condition-based maintenance models and applications.

23 The NIH Pushes the Boundaries of Health Research with Data Analytics
Chuck Lynch, Chief Knowledge Officer at National Institutes of Health
The National Institutes for Health created a five-server cluster that enables the office
to effectively apply analytical tools to newly-shared data. NIH can now do things
with health science data it couldnt do before, and in the process, advance medicine.

27 Keeping an Eye on the Analytic End Game at UnitedHealthcare


Alex Barclay, Vice President of Advanced Analytics at UnitedHealthcare
UnitedHealthcare uses Hadoop as a basic data framework and built a single plat-
form equipped with the tools needed to analyze information generated by claims,
prescriptions, plan participants, care providers, and claim review outcomes. They can
now identify mispaid claims in a systematic, consistent way.

29 Creating Flexible Big Data Solutions for Drug Discovery


David Tester, Application Architect at Novartis Institutes for
Biomedical Research
Novartis Institutes for Biomedical Research built a workflow system that uses
Hadoop for performance and robustness. Bioinformaticians use their familiar tools
and metadata to write complex workflows, and researchers can take advantage of
the tens of thousands of experiments that public organizations have conducted.

5
How comScore Uses Hadoop and
MapR to Build its Business
Michael Brown When comScore was founded in 1999, Mike Brown, the companys first
CTO at comScore engineer, was immediately immersed in the world of Big Data.

The company was created to provide digital marketing intelligence and digital
media analytics in the form of custom solutions in online audience measure-
ment, e-commerce, advertising, search, video and mobile. Browns job was to
create the architecture and design to support the founders ambitious plans.

It worked. Over the past 15 years comScore has built a highly successful
business and a customer base composed of some of the worlds top com-
paniesMicrosoft, Google, Yahoo!, Facebook, Twitter, craigslist, and the BBC
to name just a few. Overall the company has more than 2,100 clients world-
wide. Measurements are derived from 172 countries with 43 markets reported.

To service this extensive client base, well over 1.8 trillion interactions are cap-
tured monthly, equal to about 40% of the monthly page views of the entire
Internet. This is Big Data on steroids.
With MapR we see a
3X performance increase Brown, who was named CTO in 2012, continues to grow and evolve the compa-
nys IT infrastructure to keep pace with this constantly increasing data deluge.
running the same data
We were a Dell shop from the beginning. In 2002 we put together our own grid
and the same codethe processing stack to tie all our systems together in order to deal with the fast
jobs just run faster. growing data volumes, Brown recalls.
Mike Brown, CTO, comScore
Introducing Unified Digital Measurement
In addition to its ongoing business, in 2009 the company embarked on
a new initiative called Unified Digital Measurement (UDM), which directly
addresses the frequent disparity between census-based site analytics data and
panel-based audience measurement data. UDM blends these two
approaches into a best of breed approach that combines person-level
measurement from the two million person comScore global panel with census
informed consumption to account for 100 percent of a clients audience.

UDM helped prompt a new round of IT infrastructure upgrades. The volume of


data was growing rapidly and processing requirements were growing dramat-
ically as well, Brown says. In addition, our clients were asking us to turn the

6
MapR has built in to data around much faster. So we looked into building our own stack again, but
the design an automated decided wed be better off adopting a well accepted, open source, heavy duty
processing modelHadoop.
DR strategy.
With the implementation of Hadoop, comScore continued to expand its server
Mike Brown,
CTO, comScore cluster. Multiple servers also meant they had to solve the Hadoop shuffle prob-
lem. During the high volume, parallel processing of data sets coming in from
around the world, data is scattered across the server farm. To count the number
of events, all this data has to be gathered, or shuffled into one location.

comScore needed a Hadoop platform that could not only scale, but also provide
data protection, high availability, as well as being easy to use.

It was requirements like these that led Brown to adopt the MapR distribution
for Hadoop.

He was not disappointedby using the MapR distro, the company is able
to more easily manage and scale their Hadoop cluster, create more files and
process more data faster, and produce better streaming and random I/O results
than other Hadoop distributions. With MapR we see a 3X performance increase
running the same data and the same codethe jobs just run faster.

In addition, the MapR solution provides the requisite data protection and
disaster recovery functions: MapR has built in to the design an automated DR
strategy, Brown notes.

Solving the Shuffle


He said they leveraged a feature in MapR known as volumes to directly ad-
dress the shuffle problem. It allows us to make this process run superfast. We
reduced the processing time from 36 hours to three hoursno new hardware,
no new software, no new anything, just a design change. This is just what we
needed to colocate the data for efficient processing.

Using volumes to optimize processing was one of several unique solutions that
Brown and his team applied to processing comScores massive amounts of data.
Another innovation is pre-sorting the data before it is loaded into the Hadoop
cluster. Sorting optimizes the datas storage compression ratio, from the usual
ratio of 3:1 to a highly compressed 8:1 with no data loss. And this leads to a cas-

7
With MapR, you can cade of benefits: more efficient processing with far fewer IOPS, less data to read
just mount HDFS as NFS from disk, and less equipment which, in turn, means savings on power, cooling
and floor space.
and then use native
tools whether theyre in HDFS is great internally, says Brown. But to get data in and out of Hadoop,
you have to do some kind of HDFS export. With MapR, you can just mount
Windows, Unix, Linux or
HDFS as NFS and then use native tools whether theyre in Windows, Unix, Linux
whatever. NFS allowed or whatever. NFS allowed our enterprise to easily access data in the cluster and
our enterprise to easily just as easily store it in a variety of warehouse environments.
access data in the cluster For the near future, Brown says the comScore IT infrastructure will continue to
and just as easily store scale to meet new customer demand. The Hadoop cluster has grown to 450
it in a variety of ware- servers with 17,000 cores and more than 10 petabytes of disk.

house environments. MapRs distro of Hadoop is also helping to support a major new product
announced in 2012 and enjoying rapid growth. Know as validated Campaign
Mike Brown,
CTO, comScore Essential (vCE), the new measurement solution provides a holistic view of cam-
paign delivery and a verified assessment of ad-exposed audiences via a single,
third-party source. vCE also allows the identification of non-human traffic and
fraudulent delivery.

Start Small
When asked if he had any advice for his peers in IT who are also wrestling with
Big Data projects, Brown commented, We all know we have to process moun-
tains of data, but when you begin developing your environment, start small.
Cut out a subset of the data and work on that first while testing your code and
making sure everything functions properly. Get some small wins. Then you can
move on to the big stuff.

8
Making Good Things Happen
at Wells Fargo
Paul Cao When Paul Cao joined Wells Fargo several years ago, his timing was perfect.
Director of Data Services Big Data analytic technology had just made a major leap forward, providing
for Wells Fargos Capital him with the tools he needed to implement an ambitious program designed
Markets business to meet the companys analytic needs.

Wells Fargo is biga nationwide, community-based financial services company


with $1.8 trillion in assets. It provides its various services through 8,700 locations
as well as on the Internet and through mobile apps. The company has some
265,000 employees and offices in 36 countries. They generate a lot of data.

Cao has been working with data for twenty years. Now, as the Director of Data
Services for Wells Fargos Capital Markets business, he is creating systems that
The MapR solution,
support the Business Intelligence and analytic needs of its far-flung operations.
for example, provides
Meeting Customer and Regulatory Needs
powerful features
We receive massive amounts of data from a variety of different systems,
to logically partition covering all types of securities (equity, fixed income, FX, etc.) from around the
a physical cluster to world, Cao says. Many of our models reflect the interactions between these
provide separate systemsits multi-layered. The analytic solutions we offer are not only driven
by customers needs, but by regulatory considerations as well.
administrative control,
data placement, We serve the companys data needs across the entire banking business and
so we work with a variety of data types including reference data, market data,
job execution and
structured and unstructured data, all under the same umbrella, he continues.
network access.
Because of the broad scope of the data we are dealing with, we needed tools
that could handle the volume, speed and variety of data as well as all the require-
ments that had to be met in order to process that data. Just one example is
market tick data. For North American cash equities, we are dealing with up to
three million ticks per second, a huge amount of data that includes all the differ-
ent price points for the various equity stocks and the movement of those stocks.

Enterprise NoSQL on Hadoop


Cao says that given his experience with various Big Data solutions in the past
and the recent revolution in the technology, he and his team were well aware
of the limitations of more traditional relational databases. So they concentrat-
ed their attention on solutions that support NoSQL and Hadoop. They wanted
9
The new technology to deal with vendors like MapR that could provide commercial support for the
we are introducing Hadoop distribution rather than relying on open source channels. The vendors
had to meet criteria such as their ability to provide utmost in security, ease of
is not an incremental
ingest, ability to scale, high performance, andparticularly important for Wells
changethis is a Fargomulti-tenancy.
dramatic change in
Cao explains that he is partnering with the Wells Fargo Enterprise Data & Ana-
the way we are lytics and Enterprise Technology Infrastructure teams to develop a platform
handling data. servicing many different kinds of capital markets related data including files
Paul Cao, Director of Data of all sizes and real time and batch data from a variety of sources within Well
Services for Wells Fargos Fargo. Multi-tenancy is a must to cost-efficiently and securely share IT resources
Capital Markets business
and allow different business lines, data providers and data consumer appli-
cations to coexist on the same cluster with true job isolation and customized
security. The MapR solution, for example, provides powerful features to logically
partition a physical cluster to provide separate administrative control, data
placement, job execution and network access.

Dramatic Change to Handling Data


The new technology we are introducing is not an incremental changethis
is a dramatic change in the way we are handling data, Cao says. Among our
challenges is to get users to accept working with the new Hadoop and NoSQL
infrastructure, which is so different from what they were used to. Within Data
Services, we have been fortunate to have people who not only know the new
technology, but really know the business. This domain expertise is essential to
an understanding of how to deploy and apply the new technologies to solve
essential business problems and work successfully with our users.

When asked what advice he would pass on to others working with Big Data,
Cao reiterates his emphasis on gaining a solid understanding of the new tech-
nologies along with a comprehensive knowledge of their business domain.

This allows you to marry business and technology to solve business problems,
he concludes. Youll be able to understand your users concerns and work with
them to make good things happen.

10
Coping with Big Data at Experian
Dont Wait, Dont Stop
Tom Thomas Experian is no stranger to Big Data. The company can trace its origins back
Director of IT at Experian to 1803 when a group of London merchants began swapping information on
customers who had failed to meet their debts.

Fast forward 211 years. The rapid growth of the credit reference industry and
the market for credit risk management services set the stage for the reliance on
increasing amounts of consumer and business data that has culminated in an
explosion of Big Data. Data that is Experians lifes blood.

With global revenues of $4.8 billion ($2.4 billion in North America and 16,000
employees worldwide (6,000 in North America), Experian is an internation-
al information services organization working with a majority of the worlds
largest companies. It has four primary business lines: credit services, decision
analytics, direct-to-consumer products, and a marketing services group.

Tom Thomas is the director of the Data Development Technology Group within
the Consumer Services Division. Our group provides production operations
support as well as technology solutions for our various business units includ-
ing Automotive, Business, Collections, Consumer, Fraud, and various Data Lab
joint-development initiatives, he explains. I work closely with Norbert Frohlich
and Dave Garnier, our lead developers. They are responsible for the design
and development of our various solutions, including those that leverage MapR
Hadoop environments.

Processing More Data in Less Time


Until recently, the Group had been getting by, as Thomas puts it with solu-
tions running on a couple of Windows servers and a SAN. But as the company
added new products and new sets of data quality rules, more data had to be
processed in the same or less time. It was time to upgrade. But simply adding
to the existing Windows/SAN system wasnt an optiontoo cumbersome
and expensive.

So the group upgraded to a Linux-based HPC cluster withfor the time being
six nodes. Says Thomas, We have a single customer solution right now. But as
we get new customers who can use this kind of capability, we can add additional
nodes and storage and processing capacity at the same time.
11
All our solutions NFS Provides Direct Access to Data
leverage MapR NFS All our solutions leverage MapR NFS functionality, he continues. This allows us
to transition from our previous internal or SAN storage to Hadoop by mounting
functionality.
the cluster directly. In turn, this provides us with access to the data via HDFS and
Tom Thomas, Director IT
at Experian Hadoop environment tools, such as Hive.

ETL tools like DMX-h from Syncsort also figured prominently in the new infra-
structure, as does MapR NFS. MapR is the only distribution for Apache Hadoop
that leverages the full power of the NFS protocol for remote access to shared
disks across the network.

Our first solution includes well-known and defined metrics and aggregations,
Thomas says. We leverage DMX-h to determine metrics for each record and
pre-aggregate other metrics, which are then stored in Hadoop to be used in
downstream analytics as well as real-time rules based actions. Our second
solution follows a traditional data operations flow, except in this case we use
DMX-h to prepare in-bound source data that is then stored in MapR Hadoop.
Then we run Experian-proprietary models that read the data via Hive and create
client-specific and industry-unique results.

Data Analysts Use SQL to Query on Hadoop


Our latest endeavor copies data files from a legacy dual application server and
SAN product solution to a MapR Hadoop cluster quite easily as facilitated by
the MapR NFS functionality, Thomas continues. The files are then available
for analysts to query with SQL via Hive without the need to build and load a
structured database. Since we are just starting to work with this data, we are
not stuck with that initial database schema that we would have developed, and
thus eliminated that rework time. Our analysts have Tableau and DMX-h avail-
able to them, and will generate our initial reports and any analytics data files.
Once the useful data, reports, and results formats are firmed up, we will work
on optimizing production.

Developers Garnier and Frohlich point out that by taking advantage of the
Hadoop cluster, the team was able to realize substantial more processing power
and storage space, without the costs associated with traditional blade servers

12
By taking advantage equipped with SAN storage. Two of the servers from the cluster are also application
of the Hadoop cluster, servers running SmartLoad code and components. The result is a more efficient
use of hardware with no need for separate servers to run the application.
the team was able to
realize substantial more Improved Speed to Market
Heres how Thomas summarizes the benefits of the upgraded system to both
processing power and
the company and its customers: We are realizing increased processing speed
storage space, without which leads to shorter delivery times. In addition, reduced storage expenses
the costs associated means that we can store more, not acquire less. Both the companys internal
with traditional blade operations and our clients have access to deeper data supporting and aiding
servers equipped with insights into their business areas.

SAN storage. Overall, we are seeing reduced storage expenses while gaining processing
and store capabilities and capacities, he adds. This translates into an improved
speed to market for our business units. It also positions our Group to grow our
Hadoop ecosystem to meet future Big Data requirements.

And when it comes to being a Big Data All Star in todays information-
intensive world, Thomas advice is short and to the point: Dont wait and
dont stop.

13
Trevor Mason and Big Data:
Doing What Comes Naturally
Trevor Mason Mason is the vice president for Technology Research at IRI, a 30 year old Chicago-
Vice President Technology based company that provides information, analytics, business intelligence and
Research at IRI domain expertise for the worlds leading CPG, retail and healthcare companies.

Ive always had a love of mathematics and proved to be a natural when it


came to computer science, Mason says. So I combined both disciplines and
it has been my interest ever since. I joined IRI 20 years ago to work with Big
Data (although it wasnt called that back then). Today I head up a group that is
responsible for forward looking research into tools and systems for processing,
analyzing and managing massive amounts of data. Our mission is two-fold:
keep technology costs as low as possible while providing our clients with the
state-of-the-art analytic and intelligence tools they need to drive their insights.

Big Data Challenges


We looked at traditional Recent challenges facing Mason and his team included a mix of business and
technological issues. They were attempting to realize significant cost reductions
warehouse technologies,
by reducing mainframe load, and continue to reduce mainframe support risk
but Hadoop was by far
that is increasing due to the imminent retirement of key mainframe support per-
the most cost effective sonnel. At the same time, they wanted to build the foundations for a more cost
solution, Mason says. effective, flexible and expandable data processing and storage environment.
Within Hadoop we The technical problem was equally challenging. The team wanted to achieve
investigated all the random extraction rates averaging 600,000 records per second, peaking to
main distributions and over one million records persecond from a 15 TB fact table. This table feeds a
large multi-TB downstream client-facing reporting farm. Given IRIs emphasis on
various hardware
economy, the solution had to be very efficient, using only 16 to 24 nodes.
options before settling
We looked at traditional warehouse technologies, but Hadoop was by far the
on MapR on a Cisco
most cost effective solution, Mason says. Within Hadoop we investigated all
UCS (Unified Computing the main distributions and various hardware options before settling on MapR
System) cluster. on a Cisco UCS (Unified Computing System) cluster.
Trevor Mason, Vice President for The fact table resides on the mainframe where it is updated and maintained
Technology Research at IRI
daily. These functions are very complex and proved costly to migrate to the
cluster. However, the extraction process, which represents the majority of the
current mainframe load, is relatively simple, Mason says.
14
With Hadoop, they The solution was to keep the update and maintenance processes on the
have been able to mainframe and maintain a synchronized copy on the Hadoop cluster by using
our mainframe change logging process, he notes. All extraction processes go
speed up the process
against the Hadoop cluster, significantly reducing the mainframe load. This met
while reducing main- our objective of maximum performance with minimal new development.
frame load. The result:
The team chose MapR to maximize file system performance, facilitate the use of
annual savings of a large number of smaller files, and take full advantage of its NFS capability so
more than $1.5 million. files could be sent via FTP from the mainframe directly to the cluster.

Shaking up the System


They also gave their system a real workout. Recalls Mason, To maximize effi-
ciency we had to see how far we could push the hardware and software before
it broke. After several months of pushing the system to its limits, we weeded
out several issues, including a bad disk, a bad node, and incorrect OS, network
and driver settings. We worked closely with our vendors to root out and correct
these issues.

Overall, he says, the development took about six months followed by two
months of final testing and running in parallel with the regular production pro-
cesses. He also stressed that Much kudos go to the IRI engineering team and
Zaloni consulting team who worked together to implemented all the minute
details needed to create the current fully functional production system in only
six months.

To accomplish their ambitious goals, the team took some unique approaches.
For instance, the methods they used to organize the data and structure the ex-
traction process allowed them to achieve between two million and three million
records per second extraction rates on a 16 node cluster.

They also developed a way to always have a consistent view of the data used in
the extraction process while continuously updating it.

By far one of the most effective additions to the IRI IT infrastructure was the
implementation of Hadoop. Before Hadoop the technology team relied on the
mainframe running 247 to process the data in accordance with their customers

15
Hadoop is not only tight timelines. With Hadoop, they have been able to speed up the process
saving us money, it while reducing mainframe load. The result: annual savings of more than
$1.5 million.
also provides a flexible
platform that can easily Says Mason, Hadoop is not only saving us money, it also provides a flexible
platform that can easily scale to meet future corporate growth. We can do a lot
scale to meet future
more in terms of offering our customers unique analytic insightsthe Hadoop
corporate growth. platform and all its supporting tools allow us to work with large datasets in a
Trevor Mason, Vice President highly parallel manner.
for Technology Research at IRI
IRI specialized in Big Data before the term became popularthis is not new to
us, he concludes. Big Data has been our business now for more than 30 years.
Our objective is to continue to find ways to collect, process and manage Big
Data efficiently so we can provide our clients with leading insights to drive their
business growth.

And finally, when asked what advice he might have for others who would like
to become Big Data All Stars, Mason is very clear: Find and implement efficient
and innovative ways to solve critical Big Data processing and management
problems that result in tangible value to the company.

16
Leveraging Big Data to
Economically Fuel Growth
Kevin McClowry Kevin McClowry has been working with Big Data even before the term hit the
Director of Analytics mainstream. And these days, McClowry, currently the lead architect over analyt-
Application Development ics with TransUnion, LLC, is looking to Big Data technologies to add even more
at TransUnion impetus to his organizations growth.

My role is to build systems that enable new insights and innovative product
development, he says. And as we grow, these systems need to go beyond
traditional consumer financial information. When people hear TransUnion, they
immediately think of their credit scoreand that is a huge part of our busi-
ness. What they dont realize is that we also have been providing services across
a number of industries for yearsinsurance, telecommunications, banking,
automotive to name a fewand we have a wealth of information. Enabling our
analysts to more effectively experiment within and across data domains is what
keeps that needle of innovation moving.

But growth is rarely achieved without combatting some degree of inertia.


Within most organizations, it is the customer facing, mission critical systems
that warrant the financial investment in enterprise-class commercial solutions.
In contrast, R&D environments and innovation centers tend to receive discre-
tionary funding. We started down this road because we knew we wanted to do
something great, but we were feeling the pinch from our current technology
stack, McClowry says.

Data to the People


The first problem I wanted to address was the amount of time our analysts
spent requesting, waiting on, and piecing together disparate data from across
the organization. But there was a lot of data, so we wanted to incorporate lower
cost storage platforms to reduce our investment.

So McClowry went to a hybrid architecture that made use of commercial da-


tabases for the most desirable, well-understood data assetswhere the cost
can be justified by the demandsurrounded by a more cost-effective Hadoop
platform for everything else. The more recent, enterprise data assets are what
most people want, and weve invested in that. But the trends in historical data

17
Were seeing analysts or the newly acquired sources that dont have the same demand are where a
work with data in a lot of the unseen potential lies. We put that data in Hadoop, and get value from
it without having those uncomfortable conversations about why our analysts
way that was previously
need it aroundmost of us have been there before.
out of reach, and its
Since adopting this new tiered architecture, McClowrys organization has seen
fantastic. When the
the benefits. Were seeing analysts work with data in a way that was previously
new normal for our out of reach, and its fantastic. When the new normal for our folks starts to
folks starts to include include trillion-row datasets at their fingertipsthats fun.
trillion-row datasets New Tools, Enabling Better Insights
at their fingertips The next step for McClowrys team was to enable tools that would make that
thats fun. data accessible and usable by the companys statisticians and analysts.
Kevin McClowry, Were finding that a lot of the Data Scientists were hiring, particularly those
Lead Analytics Architect,
coming out of academia, are proficient with tools like R and Python. So were
TransUnion
making sure they have those tools at their disposal. And that is having a great
influence on our other team members and leading them to adopt these tools.

But that was not the only type of analyst that McClowrys team is looking to
empower. Im consistently impressed by the number skilled analysts we have
hidden within our organization that have simply been lacking the opportunity
to fall down the rabbit hole. These are the people that have invaluable tribal
knowledge about our data, and were seeing them use that knowledge and data
visualization tools like Tableau to tell some really powerful stories.

Hitting the Road


McClowrys team is globalizing their Big Data capabilities, introducing the
analytics architecture worldwide and sizing it to fit the needs and resources
of each countrys operation. In the process, they are building an international
community that is taking full advantage of Big Dataa community that did
not exist before.

18
We are trying to foster innovation and growth. Embracing these new Big Data
platforms and architectures has helped lay that foundation. Nurturing the ex-
pertise and creativity within our analysts is how well build on that foundation.

And when asked what he would say to other technologists introducing Big Data
into their organizations, McClowry advises, Youll be tempted and pressured
to over-promise the benefits. Try to keep the expectations grounded in reality.
You can probably reliably avoid costs, or rationalize a few silos to start out, and
that can be enough as a first exercise. But along the way, acknowledge every
failure and use it to drive the next foray into the unknown; those unintended
consequences are often where you get the next big idea. And as far as technol-
ogies go, these tools are not just for the techno-giants anymore and there is no
indication that there will be fewer of them tomorrow. Organizations need to
understand how they will and will not leverage them.

19
Making Big Data Work for a Major
Oil & Gas Equipment Manufacturer
Warren Sharp Big data requires a big vision. This was one of the primary reasons that Warren
Big Data Engineer at Sharp was asked to join National Oilwell Varco (NOV) a little over six months
National Oilwell VARCO ago. NOV is a worldwide leader in the design, manufacture and sale of equip-
(NOV) ment and components used in oil and gas drilling and production operations
and the provision of oilfield services to the upstream oil and gas industry.

Sharp, whose title is Big Data Engineer in NOVs Corporate Engineering and
Technology Group, honed his Big Data analytic skills with a previous employer
a leading waste management company that was collecting information about
driver behavior by analyzing GPS data for 15,000 trucks around the country.

The goals are more complicated and challenging at NOV. Says Sharp, We are
creating a data platform for time-series data from sensors and control systems
to support the deep analytics and machine learning. This platform will efficiently
ingest and store all time-series data from any source within the organization
and make it widely available to tools that talk Hadoop or SQL. The first business
use case is to support Condition-Based Maintenance efforts by making years of
equipment sensor information available to all machine learning applications
from a single source.

MapR at NOV
For Sharp using the MapR data platform was a givenhe was already familiar
with its features and capabilities. Coincidentally, his boss-to-be at NOV had
already come to the same conclusion six months earlier and made MapR a part
of their infrastructure. Learning that MapR was part of the infrastructure was
one of the reasons I took the job, comments Sharp. I realized we had compati-
ble ideas about how to solve Big Data problems.

MapR is relatively easy to install and setup, and the POSIX-compliant NFS-
enabled clustered file system makes loading data onto MapR very easy, Sharp
adds. It is the quickest way to get started with Hadoop and the most flexible in
terms of using ecosystem tools. The next step was to figure out which tools in
the Hadoop ecosystem to include to create a viable solution.

20
MapR is relatively easy Querying OpenTSDB
to install and setup, and The initial goal was to load large volumes of data into OpenTSDB, a time series
database. However, Sharp realized that other Hadoop SQL-based tools could
the POSIX-compliant
not query the native OpenTSDB data table easily. So he designed a partitioned
NFS-enabled clustered Hive-table to store all ingested data as well. This hybrid storage approach sup-
file system makes ported options to negotiate the tradeoffs between storage size and query time,
loading data onto MapR and has yielded some interesting results. For example, Hive allowed data to be
very easy. accessed by common tools such as Spark and Drill for analytics with query times
in minutes, whereas OpenTSDB offered for near-instantaneous visualization of
Warren Sharp, Big Data
Engineer at National months and years of data. The ultimate solution, says Sharp, was to ingest data
Oilwell VARCO into a canonical partitioned Hive table for use by Spark and Drill and use Hive to
generate files for the OpenTSDB import process.

Coping with Data Preparation


Storage presented another problem. Hundreds of billions of data points uses
a lot of storage space, he notes. Storage space is less expensive now than its
ever been, but the physical size of the data also affects read times of the data
while querying. Understanding the typical read patterns of the data allows
us to lay down the data in MapR in a way to maximize the read performance.
Moreover, partitioning data by its source and date leads to compact daily files.

Sharp found both ORC (Optimized Row Columnar) format and Spark were
essential tools for handling time-series data and analytic queries over larger
time ranges.

Bottom Line
As a result of his efforts, he has created a very compact, lossless storage mecha-
nism for sensor data. Each terabyte of storage has the capacity to store 750 bil-
lion to 5 trillion data points. This is equivalent to 20,000150,000 sensor-years
of 1 Hz data and will allow NOV to store all sensor data on a single MapR cluster.

Our organization now has platform data capabilities to enable Condition-Based


Maintenance, Sharp says. All sensor data are accessible by any authorized user

21
MapR is the quickest or application at any time for analytics, machine learning, and visualization
way to get started with with Hive, Spark, OpenTSDB and other vendor software. The Data Science and
Product teams have all the tools and data necessary to build, test, and deliver
Hadoop and the most
complicated CBM models and applications.
flexible in terms of
Becoming a Big Data All Star
using ecosystem tools.
When asked what advice he might have for other potential Big Data All Stars,
The next step was to Sharp comments, Have a big vision. Use cases are great to get started, vision is
figure out which tools critical to creating a sustainable platform.
in the Hadoop ecosys- Learn as much of the ecosystem as you can, what each tool does and how it
tem to include to create can be applied. End-to-end solutions wont come from a single tool or imple-
a viable solution. mentation, but rather by assembling the use of a broad range of available Big
Warren Sharp, Big Data Data tools to create solutions.
Engineer at National
Oilwell VARCO

22
The NIH Pushes the Boundaries of
Health Research with Data Analytics
Chuck Lynch Few things probably excite a data analyst more than data on a mission, espe-
Chief Knowledge Officer at cially when that mission has the potential to literally save lives.
National Institutes of Health That fact might make the National Institutes for Health the mother-load of
gratifying work projects for data analysts that work there. In fact, the NIH is
27 separate Institutes and Centers under one umbrella title, all dedicated to
the most advanced biomedical research in the world.

At approximately 20,000 employees strong, including some of the most pres-


tigious experts in their respective fields, the NIH is generating a tremendous
amount of data on healthcare research. From studies on cancer, to infectious
diseases, to Aids, or womens health issues, the NIH probably has more data on
each topic than nearly everyone else. Even the agencys librarythe National
Library of Medicineis the largest of its kind in the world.

Data Lake Gives Access to Research Data


Big data has been a very big thing for the NIH for some time. But this fall the
NIH will benefit from a new ability to combine and compare separate institute
grant data sets in a single data lake.

With the help of MapR, the NIH created a five-server clusterwith approximately
150 terabytes of raw storagethat will be able to accumulate that data, ma-
nipulate the data and clean it, and then apply analytics tools against it, explains
Chuck Lynch, a senior IT specialist with the NIH Office of Portfolio Analysis, in
the Division of Program Coordination, Planning, and Strategic Initiatives.

If Lynchs credentials seem long, they actually get longer. Add to the above the
Office of the Director, which coordinates the activities of all of the institutes.
Each individual institute in turn has its own director, and a separate budget, set
by Congress.

What the NIH does is basically drive the biomedical research in the United
States in two ways, Lynch explains. Theres an intermural program where we
have the scientists here on campus do biomedical research in laboratories. They
are highly credentialed and many of them are world famous.

23
This is all really great Additionally, we have an extramural program where we issue billions of dollars
stuff, but it just got a in grants to universities and to scientists around the world to perform biomed-
ical researchboth basic and appliedto advance different areas of research
lot better. The new
that are of concern to the nation and to the world, Lynch says.
cluster enables the
This is all really great stuff, but it just got a lot better. The new cluster enables
office to effectively
the office to effectively apply analytical tools to the newly-shared data. The hope
apply analytical tools is that the NIH can now do things with health science data it couldnt do before,
to the newly-shared and in the process advance medicine.
data. The hope is that Expanding Access to Knowledge Stores
the NIH can now do As Lynch notes, big data is not about having volumes of information. It is
things with health about the ability to apply analytics to data to find new meaning and value in it.
That includes the ability to see new relationships between seemingly unrelated
science data it couldnt
data, and to discover gaps in those relationships. As Lynch describes it, analytics
do before, and in
helps you better know what you dont know. If done well, big data raises as
the process advance many questions as it provides answers, he says.
medicine. The challenge for the NIH was the large number of institutes collecting and
managing their own data. Lynch refers to them as knowledge stores of
the scientific research being done.

We would tap into these and do research on them, but the problem was that we
really needed to have all the information at one location where we could manip-
ulate it without interfering with the [original] system of record, Lynch says.

For instance, we have an organization that manages all of the grants, research,
and documentation, and we have the Library of Medicine that handles all of the
publications in medicine. We need that information, but its very difficult to tap
into those resources and have it all accumulated to do the analysis that we need
to do. So the concept that we came up with was building a data lake, Lynch recalls.

That was exactly one year ago, and the NIH initially hoped to undertake the
project itself.

24
We have a system here at NIH that were not responsible for called Biowulf,
which is a play on Beowulf. Its a high speed computing environment but its
not data intensive. Its really computationally intensive, Lynch explains. We first
talked to them but we realized that what they had wasnt going to serve
our purposes.

So the IT staff at NIH worked on a preliminary design, and then engaged ven-
dors to help formulate a more formal design. From that process the NIH chose
MapR to help it develop the cluster.

We used end-of-year funding in September of last year to start the procure-


ment of the equipment and the software, Lynch says. That arrived here in the
November /December timeframe and we started to coordinate with our office
of information technology to build the cluster out. Implementation took place
in the April to June timeframe, and the cluster went operational in August.

Training Mitigates Learning Curve


What were doing is that were in the process of testing the system and basi-
cally wringing out the bugs, Lynch notes. Probably the biggest challenge that
weve faced is our own learning curve; trying to understand the system. The
challenge that we have right now as we begin to put data into the system is
how do we want to deploy that data? Some of the data lends itself to the dif-
ferent elements of the MapR ecosystem. What should we be putting it into
not just raw data, but should we be using Pig or Hive or any of the
other ecosystem elements?

Key to the project success so far, and going forward, is training.

Many of the people here are biomedical scientists. The vast majority of them
have PhDs in biomedical science or chemistry or something. We want them to
be able to use the system directly, Lynch says. We had MapR come in and give
us training and also give our IT people training on administering the MapR
system and using the tools.

25
The knowledge that But that is the beginning of the story, not the conclusion. As Lynch notes,
were dealing with is The longer journey is now to use it for true big data analysis; to find tools
that we can apply to it; to index; to get metadata; to look at the information
so complex, Lynch
that we have there and to start finding things in the data that we had never
concludes. In this seen before.
environment it is a
Our view is that applying big data analytics to the data that we have will help
huge step forward us discover relationships that we didnt realize existed, Lynch continues.
and I think it is going
Success for us is being able to answer the questions being given to us by
to resonate with the senior leadership, Lynch says. For example, is the research that were doing
biomedical community in a particular area productive? Are we getting value out of research? Is there
at large. There are something that were missing? Is there overlap in the different types of research
or are there gaps in the research? And in what we are funding are we returning
other biomedical
value to the public?
organizations that look
Next Steps
to NIH to drive methods
So what is the next step for the NIH?
and approaches and
To work with experts in the field to find better ways of doing analysis with the
best practices. I think
data and make it truly a big data environment as opposed to just a data lake,
this is the start of a Lynch says. That will involve a considerable amount of effort and will take us
new best practice. some time to put that together. I think what were interested in doing is com-
Chuck Lynch, Senior IT paring and contrasting different methods and analytic techniques. That is the
Specialist, National Institutes long haul.
of Health Office of Portfolio
Analysis and Office of
The knowledge that were dealing with is so complex, Lynch concludes. In this
the Director environment it is a huge step forward and I think it is going to resonate with
the biomedical community at large. There are other biomedical organizations
that look to NIH to drive methods and approaches and best practices. I think
this is the start of a new best practice.

26
Keeping an Eye on the Analytic End
Game at UnitedHealthcare
Alex Barclay When Alex Barclay received his Ph.D. in mathematics from the University of
Vice President of California San Diego in 1999, he was already well on his way to a career focused
Advanced Analytics at on big data. Barclay brought his interest and expertise in analytics to Fair Isaac,
UnitedHealthcare a software analytics company, and then Experian, the credit reporting company.

Then, two years ago, he joined UnitedHealthcare and brought his experience
with Big Data and a mature analytic environment to help advance the compa-
nys Payment Integrity analytic capabilities.

UnitedHealthcare offers the full spectrum of health benefit programs to indi-


viduals, employers, military service members, retirees and their families, and
Medicare and Medicaid beneficiaries, and contracts directly with more than
850,000 physicians and care professionals, and 6,000 hospitals and other care
facilities nationwide.

For Barclay, as Vice President of Advanced Analytics for the companys Payment
Integrity organization, these numbers translated into huge amounts of data
flowing into and through the company. When he first surveyed the internal big
data landscape, he found an ad hoc approach to analytics characterized by data
silos and a heavily rule-based, fragmented data environment.

Building a New Environment


In the other environments Ive been in, we had an analytic sandbox develop-
ment platform and a comprehensive, integrated analytic delivery platform, said
Barclay. This is what I wanted to set up for UnitedHealthcare. The first thing I
did was partner with our IT organization to create a big data environment that
we didnt have to re-create every yearone that would scale over time.

The IT and analytic team members used Hadoop as the basic data framework
and built a single platform equipped with the tools needed to analyze infor-
mation generated by claims, prescriptions, plan participants and contracted
care providers, and associated claim review outcomes. We spent a about a year
pulling this togetherI was almost an IT guy as opposed to an analytics guy,
Barclay added.

Rather than tackling a broad range of organizational entities, Barclay has con-
centrated his teams efforts on a single, key functionpayment integrity. The
27
We have been able to idea is to use analytics to ensure that once a claim is received we pay the correct
help identify mispaid amount, no more, no less, including preventing fraudulent claims, he said.

claims in a systematic, The Payment Integrity organization handles more than 1 million claims every day.
consistent way. And we The footprint for this data is about 10 terabytes and growing, according to Bar-
clay. It is also complex; data is generated by 16 different platforms, so although
have encouraged the
the claim forms are similar, they are not the same and must be rationalized.
company to embrace
Another major challenge to revamping the organizations approach to big data
big data analytics and
was finding the right tools.
move toward broad-
The tools landscape for analytics is very dynamicit changes practically every
ening the landscape to day, said Barclay. Whats interesting is that some of the tools we were looking
include other aspects of at two years ago and rejected because they didnt yet have sufficient capability
our business including for our purposes, have matured over time and now can meet our needs.
clinical, care provider Embracing Big Data
networks, and customer Its working, said Barclay. We have been able to help identify mispaid claims in
experience. a systematic, consistent way. And we have encouraged the company to embrace
big data analytics and move toward broadening the landscape to include other
Alex Barclay, Vice President
of Advanced Analytics for
aspects of our business including clinical, care provider networks, and customer
UnitedHealthcare
experience.

We emphasize innovation. For example, we apply artificial intelligence and


deep learning methodologies to better understand our customers and meet
their needs. We are broadening our analytic scope to look beyond claims and
understand our customers health care needs. Were really just at the beginning
of a long and rewarding Big Data journey.

When asked if he had any advice for his fellow data analysts who might be im-
plementing a big data analytic solution, Barclay replied: Be patient. Start slow
and grow with a clear vision of what you want to accomplish. It you dont have
a clearly defined use case, you can get lost in the mud in a hurry. With Payment
Integrity, in spite of some early challenges, we created something that is real
and has tangible payback. We always knew what the end game was supposed
to look like.

28
Creating Flexible Big Data Solutions
for Drug Discovery
David Tester He didnt know it at the time, but when high school student David Tester
Application Architect at acquired his first computera TI-82 graphing calculator from Texas Instru-
Novartis Institutes for mentshe was on a path that would inevitably lead to Big Data.
Biomedical Research Testers interest in computation continued through undergraduate and gradu-
ate schools culminating in a Ph.D from the University of Oxford.

He followed a rather unusual path, investigating where formal semantics and


logic interacted with statistical reasoning. He notes that this problem is one that
computers are still not very good at solving as compared to people. Although
computers are excellent for crunching statistics and for following an efficient
chain of logic, they still fall short where those two tasks combine: using statistical
heuristics to guide complex chains of logic.

Data Science Techniques for Drug Discovery


Since joining the Novartis Institutes for Biomedical Research over two years
ago, Tester has been making good use of his academic background. He works
as an application architect charged with devising new applications of data
science techniques for drug research. His primary focus is on genomic data
specifically Next Generation Sequencing (NGS) data, a classic Big Data application.

In addition to dealing with vast amounts of raw heterogeneous data, one


of the major challenges facing Tester and his colleagues is that best practices
in NGS research are an actively moving target. Additionally, much of the cut-
ting-edge research requires heavy interaction with diverse data from external
organizations. For these reasons, making the most of the latest NGS research in
the literature ends up having two major parts.

Firstly, it requires workflow tools that are robust enough to process vast
amounts of raw NGS data yet flexible enough to keep up with quickly changing
research techniques.

Secondly, it requires a way to meaningfully integrate data from Novartis with


data from these large external organizationssuch as 1000 Genomes, NIHs
GTEx (Genotype-Tissue Expression) and TCGA (The Cancer Genome Atlas)
paying particular attention to clinical, phenotypical, experimental and other

29
This workflow sys- associated data. Integrating these heterogeneous datasets is labor intensive, so
tem uses Hadoop they only want to do it once. However, researchers have diverse analytical needs
that cant be met with any one database. These seemingly conflicting require-
for performance and
ments suggest a need for a moderately complex solution.
robustness and MapR
Finding the Answers
to provide the POSIX
To solve the first part of this NGS Big Data problem, Tester and his team built
file access that lets a workflow system that allows them to process NGS data robustly while being
bioinformaticians use responsive to advances in the scientific literature. Although NGS data re-
their familiar tools. quires high data volumes that are ideal for Hadoop, a common problem is that
Additionally, it uses the researchers have come to rely on many tools that simply dont work on native
HDFS. Since these researchers previously couldnt use systems like Hadoop they
researchers own meta-
have had to maintain complicated bookkeeping logic to parallelize for opti-
data to allow them to mum efficiency on traditional HPC.
write complex work-
This workflow system uses Hadoop for performance and robustness and MapR
flows that blend the to provide the POSIX file access that lets bioinformaticians use their familiar
best aspects of Hadoop tools. Additionally, it uses the researchers own metadata to allow them to write
and traditional HPC. complex workflows that blend the best aspects of Hadoop and traditional HPC.

As a result of their efforts, the flexible workflow tool is now being used for a va-
riety of different projects across Novartis, including video analysis, proteomics,
and metagenomics. An additional benefit is that the integration of data science
infrastructure into pipelines built partly from legacy bioinformatics tools can be
achieved in days, rather than months.

Built-in Flexibility
For the second part of the problemthe integrating highly diverse public
datasets requirement, the team used Apache Spark, a fast, general engine for
large scale data processing. Their specific approach to dealing with heteroge-
neity was to represent the data as a vast knowledge graph (currently trillions of
edges) that is stored in HDFS and manipulated with custom Spark code. This use
of a knowledge graph lets Novartis bioinformaticians easily model the complex
and changing ways that biological datasets connect to one another, while the
use of Spark allows them to perform graph manipulations reliably and at scale.

30
On the analytics side, researchers can access data directly through a Spark API
or through a number of endpoint databases with schemas tailored to their spe-
cific analytic needs. Their toolchain allows entire schemas with 100 billions of
rows to be created quickly from the knowledge graph and then imported into
the analysts favorite database technologies.

Spark and MapR enable Research Advantage


As a result, these combined Spark and MapR-based workflow and integration
layers allow the companys life science researchers to meaningfully take advan-
tage of the tens of thousands of experiments that public organizations have
conducteda significant competitive advantage.

In some ways I feel that Ive come full circle from my days at Oxford by us-
ing my training in machine learning and formal logic and semantics to bring
together all these computational elements, Tester adds. This is particularly im-
portant because, as the cost of sequencing continues to drop exponentially, the
amount of data thats being produced increases. We will need to design highly
flexible infrastructures so that the latest and greatest analytical tools, tech-
niques and databases can be swapped into our platform with minimal effort as
NGS technologies and scientific requirements change. Designing platforms with
this fact in mind eases user resistance to change and can make interactions
between computer scientists and life scientists more productive.

To his counterparts wrestling with the problems and promise of Big Data in oth-
er companies, Tester says that if all you want to do is increase scale and bring
down costs, Hadoop and Spark are great. But most organizations needs are not
so simple. For more complex Big Data requirements, the many tools that are
available can be looked at merely as powerful components that can be used to
fashion a novel solution. The trick is to work with those components creatively
in ways that are sensitive to the users needs by drawing upon non-Big Data
branches of computer science like artificial intelligence and formal semantics
while also designing the system for flexibility as things change. He thinks that
this is an ultimately more productive way to tease out the value of your Big
Data implementation.

31
Summary
Whether you are a business analyst, data scientist, enterprise architect, IT admin-
istrator, or developer, the examples in this guide provide concrete applications of
big data technologies to help you develop your own methods, approaches, and
best practices for creating big data solutions within your organization. Moving
beyond experimentation to implementing sustainable big data solutions is neces-
sary to impact the growth of your business.

Are you or your colleagues Big Data All-Stars?


If you are pushing the boundaries of big data, wed love to hear from you. Drop
us a note at BigDataAllStars@mapr.com and tell us about your journey or share
how your colleagues are innovating with big data.

Originally published in Datanami


Creating Flexible Big Data Solutions for Drug Discovery
Datanami, January 19, 2015
Coping with Big Data at ExperianDont Wait, Dont Stop
Datanami, September 1, 2014
Trevor Mason and Big Data: Doing What Comes Naturally
Datanami, October 20, 2014
Leveraging Big Data to Economically Fuel Growth
Datanami, November 18, 2014
How comScore Uses Hadoop and MapR to Build its Business
Datanami, December 1, 2014
Keeping an Eye on the Analytic End Game at UnitedHealthcare
Datanami, September 7, 2015
The NIH Pushes the Boundaries of Health Research with Data Analytics
Datanami, September 21, 2015
Making Big Data Work for a Major Oil & Gas Equipment Manufacturer
Datanami, November 16, 2015
Making Good Things Happen at Wells Fargo
Datanami, January 11, 2016

32
Presented by D a t a n a m i Sponsored by

You might also like