You are on page 1of 7

Defining the Data Lake

An Introduction to Big Data Lakes and Common Use Cases

www.zaloni.com
Introduction
A data lake is a central location in which to store all your data, regardless of its source
or format. It is typically, although not always, built using Hadoop. The data can be
structured or unstructured. You can then use a variety of storage and processing
toolstypically tools in the extended big data ecosystemto extract value quickly
and inform key organizational decisions.

Because of the growing variety and volume of data, data lakes are an emerging and
Data Lake: a central
powerful architectural approach, especially as enterprises turn to mobile, cloud-
location to store all of
based applications, and the Internet of Things (IoT) as right-time delivery mediums
your data, regardless of its
for big data.
source or format.

Data Lake versus EDW


The differences between enterprise data warehouses (EDW) and data lakes are
significant. An EDW is fed data from a broad variety of enterprise applications.
Naturally, each applications data has its own schema, requiring the data to be
transformed to conform to the EDWs own predefined schema. Designed to collect
only data that is controlled for quality and conforming to an enterprise data model,
the EDW is capable of answering only a limited number of questions.

Data lakes, on the other hand, are fed information in its native form. Little or no
processing is performed for adapting the structure to an enterprise schema. The
biggest advantage of data lakes is flexibility. By allowing the data to remain in its
native format, a far greaterand timelierstream of data is available for analysis.

Some of the benefits of a data lake include:

Ability to derive value from unlimited types of data


Ability to store all types of structured and unstructured data in a data lake, from
CRM data to social media posts
More flexibilityyou dont have to have all the answers up front
Ability to store raw datayou can refine it as your understanding and insight
improves
Unlimited ways to query the data
Application of a variety of tools to gain insight into what the data means
Elimination of data silos
Democratized access to data via a single, unified view of data across the
organization when using an effective data management platform

www.zaloni.com

2
Key Attributes of a Data Lake
To be classified as a data lake, a big data repository should exhibit three key
characteristics:

1. A single shared repository of data, typically stored within Distributed File System
Three Characteristics of
(DFS). Data lakes preserve data in its original form and capture changes to data
a Data Lake:
and contextual semantics throughout the data lifecycle. This approach is especially
useful for compliance and internal auditing activities. This is an improvement over A single shared repository
the traditional EDW, where if data has undergone transformations, aggregations of data
and updates, it is challenging to piece data together when needed, and organizations Includes orchestration and
struggle to determine the provenance of data. job scheduling capabilites
Contains a set of
2. Includes orchestration and job scheduling capabilities (e.g., via YARN). Workload
applications or workflows
execution is a prerequisite for the enterprise and YARN provides resource
to consume, process or act
management and a central platform to deliver consistent operations, security and
upon the data
data governance tools across Hadoop clusters, ensuring analytic workflows have
access to the data and the computing power they require.

3. Contains a set of applications or workflows to consume, process or act upon


the data. Easy user access is one of the hallmarks of a data lake, due to the fact
that organizations preserve the data in its original form. Whether structured,
unstructured or semi-structured, data is loaded and stored as-is. Data owners can
then consolidate customer, supplier and operations data, eliminating technicaland
even politicalroadblocks to sharing data.

Zalonis reference data lake architecture highights a zone-based approach to governing


data in the data lake:

Data lakes are becoming more and more central to enterprise data strategies. Data lakes
best address todays data realities: much greater data volumes and varieties, higher
expectations from users, and the rapid globalization of economies.

www.zaloni.com

3
Most common use cases for data lakes today
Although the potential use cases are limitless, today data lakes are seeing success
with these common use cases:
1. EDW augmentation: Offloading data from a traditional enterprise data
warehouse (EDW) to Hadoop or the cloud brings storage cost savings and
increases bandwidth in the EDW for business intelligence (BI) processes.
Common Data Lake
2. Agile analytics: A fail fast approach to data science where hypotheses,
Use Cases:
testing, iterating and improvements are in a constant cycle using real-time
data, which can result in more and innovative insights that can add business 1. EDW Augmentation
value. 2. Agile Analytics
3. Enterprise reporting: The ability to do ad hoc reporting using an enterprise- 3. Enterprise Reporting
wide data source is key to understanding the business in real-time and
reducing risk.
4. Data Monetization

4. Data monetization: More enterprises are leveraging their data to better


5. Data Science
understand current customers and also develop new products and services
that resonate with consumers.
5. Data science: Some enterprises are working to support an enterprise-wide
data science capability, particularly for predictive modeling and analytics
using machine-learning and large datasets.

You want a data lake, not a data swamp


Data lakes can provide big wins for an organization. But implementing, managing
and scaling a data lake can be problematic. The key to realizing continuous
business value from a data lake is in applying, from day 1, the appropriate data
management and governance needed for data visibility, reliability, security
and privacy that can then allow broader access to data by multiple users (with
permissions).
Organizations that apply data management best practices to their big data lakes
are experiencing significant benefits, whether its reducing costs, enhancing or
adding new revenue streams, or reducing business risk.

Case studies in data lake success


Companies are making serious strides with big data implementations, addressing
old problems in new ways and creating new opportunities to enhance their
business, their customer loyalty and their competitive advantage. Below we
outline 4 distinct use cases in 4 different industries, that leveraged an integrated
data lake management platform to build a clean, actionable data lake and were
well-rewarded for the effort:

1. Fraud detection in the insurance industry


2. EDW augmentation in market research
3. Hospital re-admission risk reduction in healthcare
4. Network analytics in telecommunications
www.zaloni.com

4
1. Improving fraud detection and reduction

Workers compensation fraud, including employee, employer and medical provider


fraud, is estimated to cost states $1 billion-$3 billion each annually. One reason At a Glance:
is that many fraudsters are able to manipulate billing faster than investigators Industry: Insurance
can audit. To support a new real-time anti-fraud analytics solution for one of the
Company Description:
countrys largest provider of workers compensation insurance, Zaloni built a data
Large US-based provider of
lake solution.
workers compensation
Before the data lake: The companys business analysts were using manual Technical Use Case: Data
processes to build and run SQL queries from several systems, which could take Lake 360: Agile Analytics
hours or days to get results. The company wanted to unify its data silos and legacy
Business Use Case: Fraud
data platforms to support a new anti-fraud analytics solution and enable non-
and Payment Integrity
technical staff to quickly perform self-service data analytics through an intuitive
user interface. Deployment: On premises

With the data lake: Zalonis big data management platform enabled real-time
data ingestion and automated fraud detection. We also created a centralized data
catalog that provided an intuitive user interface for search and ad-hoc analytics of
all data.

Results: The new system improved data quality and the companys ability to
detect patterns of potential fraud hidden in vast amounts of claims data, and
enabled attorneys to quickly build cases with original claim documents. Data was
used to prosecute more than $150 million in fraudulent cases. With queries now
taking seconds versus hours or days, the company was able to analyze more than
10 million claims, 100 million bill line item details and other records.

2. Enjoying more cost-effective storage and faster data processing

A leading provider of market and shopper information, predictive analytics and


business intelligence to 95% of the Fortune 500 consumer packaged goods (CPG) At a Glance:
and retail companies needed a more cost-effective and efficient approach to Industry: Market Research
process, analyze and manage data. Zaloni designed and built the backend solution
Company Description:
architecture for an enterprise data warehouse (EDW) augmentation solution to
Top 10 Market Research
provide a more cost-effective, flexible and expandable data processing and stor-
Firm
age environment.
Technical Use Case: EDW
Before the data lake: Tasked with managing massive volumes of data (ingesting Augmentation
hundreds of gigabytes of data from external data providers every week), the com-
Deployment: On premises
pany struggled to keep costs as low while providing clients with state-of-the-art
analytic and intelligence tools.

With the data lake: Using Hadoop for the aggregate POS dataset and servicing
the extractions that populate the companys custom, in-memory analytics farm
enabled the company to offload more data, faster, and realize substantial savings
in a very short period of time.

Results: The company realized $5.2 million annual savings and $4.4 million project-
ed additional savings (upon completion of the fine-grained data-at-rest modifica-
tion project). Additionally, the company saw a nearly 50% reduction in mainframe
MIPS (millions of instructions per second/processing power and CPU consump-
tion) and better throughput with Zaloni Bedrock managing the data lake. www.zaloni.com

5
3. Reducing re-admission risk

One of the US leading healthcare companies wanted to more effectively use At a Glance:
big data to deliver the best care more affordably. Specifically, it wanted to be Industry: Healthcare
able to better determine readmission risk for patient discharges and predic-
Company Description:
tively classify readmissions as potentially preventable or not preventable.
Healthcare Provider and
Zaloni helped them achieve this by developing a Hadoop-based managed data
Health Plan
lake housing more than 60 million records of historical patient data.
Technical Use Case: Data
Before the data lake: The company was unable to seamlessly access high vol- Lake 360: Agile Analytics
umes of both new and existing data from disparate sources and analyze it for
Business Use Case:
patterns and trends.
Population Health: Re-
With the data lake: The healthcare customer had a clear view of what data was admission Risk
in the data lake; could track its source, format and lineage; and enable users to Deployment: On premises
more efficiently search, browse and find the data needed for analytics.

Results: Zalonis solution enabled the healthcare customer to reduce costs and
improve outcomes by ultimately lowering readmission rates, due to significantly
improved understanding of potentially preventable readmissions, the ability to
develop better algorithms to more accurately predict readmissions, and a new
ability to identify patients with the highest risk of readmission early in their
initial hospitalization and proactively adjust treatment plans sooner to account
for that risk.

4. Reporting compliance and improving customer experience

A large telecom company was required to archive large and growing volumes
of wireless call details records (CDRs) to comply with government regulations.
Although it was a significant cost to store all of this data, the company identi-
fied an opportunity to leverage the data to provide better customer service,
grow its business and ensure it was getting the expected return on billions
invested in capital expenses. To enable this goal, Zaloni architected and built At a Glance:
a managed data lake to serve as the single repository for all traffic-related, Industry:
inventory and provisioning data (CDRs, SNMP, server logs). Telecommunications
Before the data lake: The companys wireless network generated 4 terabytes Company Description:
of data per day from voice, data and SMS CDRs. This data was created by 11 South American
different servers and switches with 8-10 different record layouts. As a result, telecommunications giant
there was no centralized repository to store all these CDRs. In addition, the up- Technical Use Case: Data
stream mediation system, which is responsible for merging CDR records into a Lake 360: Agile Analytics
single record for each session, was sending duplicate CDRs or not stitching the
Business Use Case: NOC/
call records completely.
SOC: Network Analytics
With the data lake: The company was better able to perform load balancing, Deployment: On premises
government reporting, network analysis, and parallel processing that removed
duplicate data and reconstructed call records from partial or lost records.
With a managed data lake, the telecom company not only gained a compliance
solution, but was able to more efficiently manage the network and continue to
provide a great customer experience as the volume of subscriber usage contin-
ued to grow significantly.

Results: The company was able to double data ingestion, to 8 terabytes/day. www.zaloni.com
Also, the company avoided costly fines and penalties by meeting the immediate
need for government reporting requirements, and gained insights into network
utilization to avoid network congestion in near real-time.

6
Data Lake Management Platform Self Service Data Preparation
Bedrock is a fully integrated data lake management Mica provides the on-ramp for self-service data discovery,
platform that provides visibility, governance, and reliability. curation, and governance of data in the data lake. Mica provides
By simplifying and automating common data management business users with an enterprise-wide data catalog through
tasks, customers can focus time and resources on building which to discover data sets, interact with them and derive real
the insights and analytics that drive their business. business insights.

Zaloni Professional Services Professional Services Include:


Your trusted partner for building production Big Data Use Case Discovery and Definition

data lakes Data Lake Assessment Services

Zaloni has more than 400+ staff years of big data experience Solution Architecture Services
working globally across the US, Latin America, Europe, Middle Data Lake Build Services
East and Asia. Zaloni Professional services offers expert big Data Lake Analytics Application Development
data consulting and training services, helping clients plan,
prepare, implement and deploy data lake solutions. Data Science Services

About Zaloni
Delivering on the Business of Big Data
Zaloni is a provider of enterprise data lake management solutions. Our software platforms, Bedrock and Mica, enable customers
to gain competitive advantage through organized, actionable big data lakes. Serving the Fortune 500, Zaloni has helped its
customers build production implementations at many of the worlds leading companies.

To learn more about Zaloni:


Call us: +1 919.323.4050 I E-mail: info@zaloni.com I Visit: www.zaloni.com
Find Us on Social Media:

Twitter handle @zaloni

www.zaloni.com

You might also like