A Data Lake is a place to store practically unlimited amounts of data. It's relatively inexpensive and massively scalable. A data mart is a store of bottled water in a more natural state.
A Data Lake is a place to store practically unlimited amounts of data. It's relatively inexpensive and massively scalable. A data mart is a store of bottled water in a more natural state.
A Data Lake is a place to store practically unlimited amounts of data. It's relatively inexpensive and massively scalable. A data mart is a store of bottled water in a more natural state.
Cisco Confidential 1 2010 Cisco and/or its affiliates. All rights reserved.
Cisco Data Lake
March 3, 2014 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2 Data Lake Definition Current Hadoop Landscape Why to Build Data Lake Benefits Data Lake Design
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3 Data Lake - a place to store practically unlimited amounts of data of any format, schema and type that is relatively inexpensive and massively scalable. Data processing software like Hadoop can transform the data from its raw state to a finished product.
--Revelytix
If you think of a datamart as a store of bottled water cleansed and packaged and structured for easy consumption the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
--Pentaho The difference between a data lake and a data warehouse is that in a data warehouse, the data is pre-categorized at the point of entry, which can dictate how its going to be analyzed.
--Forbes 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4 Databases
Current Hadoop Landscape
Unstructured Data Docs, Cases, Content IoE, Machine Data, Clickstream Service Renewal Opportunities Marketing Campaigns
ERP SFDC Database N Data Sources Hadoop Platform Data Consumption IB, Contracts, Hierarchies Network Logs CPAI IB, Cases, Hierarchies, Customer Network Logs Collab CSTG Customer, Hierarchies Cisco.com logs Marketing Customer Network Config, Product Quality v Bookings, Hierarchies etc Data Science Program Data Science Program Excercises
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5 Every project team spends resources in bringing its data Difficult to track data elements availability in the platform Redundant platform resource utilization for data acquisition & maintenance Data quality and reliability issues Project teams develop their data acquisition flows manually
2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6 Databases
Data Lake
Unstructured Data Docs, Cases, Content IoE, Machine Data, Clickstream Service Renewal Opportunities Marketing Campaigns
ERP SFDC Database N Data Sources Hadoop Platform Data Consumption IB, Contracts, Cases Hierarchies, Bookings, Customers, Supply Chain Etc
Network Logs, Cisco.com logs, Documents, etc Data Lake (EDS) Customer Network Config, Product Quality Data Science Program Excercises
CPAI Marketing Data Science CSTG 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7 Data reuse bring data once and consumed by multiple projects Data stored in raw format can be used by variety of apps and tools Automated framework can be quickly configured to get data from any source Better resource utilization frees resources in source systems and hadoop platform Quick project deliveries
2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8 Databases
High Level Data Lake Architecture
Unstructured Data Docs, Cases, Content IoE, Machine Data, Clickstream ERP SFDC Database N Data Sources Hadoop Platform IB, Contracts, Cases Hierarchies, Bookings, Customers, Supply Chain Etc
Network Logs, Cisco.com logs, Documents, etc Data Lake (EDS) CPAI Marketing Data Science ETL Offload Tidal
Data Lake Load Process
Hadoop Edge Node Data lake Metadata (TD) 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10 Unstructured Sources Data Lake Population and Consumption Transformed Layer Data Lake
(Source Like Structure) T L F1 F2 F3 F5 F6 F4 S O R Any Source Structured Sources CG1 TD Docs, Cases, Content IoE, Machine data, Clickstream ETL Offload (3NF Model) What data model to use? Data Lake Source Like Structure Processed Data - 3NF Model What are Sources to Data Lake? Any structured / unstructured data source Do we build a transformed layer? Yes HADOOP SSOT to be computed in one place, consumed by many platforms Functional Areas can consume from Data Lake, not allowed to share with other Functional Areas EDS Governs Data Lake and Transformed Layer Thank you. 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12