You are on page 1of 3

============================================================

BigData Analytics & Hadoop - Training - Session 1


============================================================
Trainer Name - Prashant Shantigrama
Email ID - prashanth.grama@hotmail.com
Course Contents
================
Introduction to BigData & Analytics
Introduction to Hadoop
Role of Hadoop in current architecture
Hadoop Requirements
Understanding Hadoop Generations
Creating Hadoop Lab
Understanding Hadoop Execution Modes
Creating Hadoop Standalone Setup
Creating Hadoop PseudoDistributed Setup
Creating Hadoop Fully Distributed Setup
Practice !!!
---Understanding HDFS architecture
Understanding Replication Factor and Block Size
Working on Hadoop Administration Essentials
Setting Replication Factor
Setting Block Size
Adding a New Datanode
Decommissioning a Datanode
Essential Hadoop File System Commands
Introduction to MapReduce Paradigm
Understanding MapReduce Development in Java
Working on WordCount Program
Advanced MapReduce Program Examples (Industry-oriented Examples)
--Introduction to Hadoop Ecosystem Components
Working Pig
Working Hive
Working on Sqoop
Working on Flume
Industry oriented examples using Pig & Hive
Realtime data acquition from Twitter
---Introduction to Hadoop V2 YARN architecture
Introduction to Cloudera OS, HortonWorks
Introduction to NoSQL Databases
Introduction to other Hadoop ecosystem components (Ambari, Oozie, Mahout, Cassa
ndra, Spark, Storm)
================================================================================
==============================
What is BigData ??.....
201201hourly.txt -----> 520 MB .txt file
Notepad ----------- 25% of CPU, 1 GB+ of RAM --------- failed to open the file
Wordpad ----------- 25% of CPU, 1 GB+ of RAM --------- Successfully opened the f
ile

BigData is term to define data related problem. It is NOT a technology, techniqu


e, software, hardware
BigData is all about the ability of the software to handle (storage, processing/
retrieval) the data
Oracle - single machine license ---- 12 TB (license treshold)
MySQL ----- 18 TB
IBM Definition of BigData
=========================
1) Volume ---- Size of data
2) Velocity ---- Speed at which data is changing/generated
3) Variety ---- types of data
a) Structred ---- Data residing in RDBMS --- (Data is in alingme
nt with Schema)
b) Semi-structured ---- XML, JSON --- (Schema to handle the data
is present along with data)
c) Unstructred ---- Text, emails, multimedia .... (Schema inform
ation is NOT present)
Gartner ( agrees with IBM definition of Big .....IF & ONLY IF their exist a 4th
V
4) Veracity / Value ---- If I process this data, will I get any outcome/
business value out of it ....If YES then we will consider this as BigData....ELS
E....it is just Garbage !!!!
Analytics
=========
Analytics is all about processing data to find insights that generate value to b
usiness
Value of Analytics -- enabling smarter decision making
=======================================================
-----------------------------------------------------------------------------------------------Traditional
Targeting using
Marketing
Analytics
-----------------------------------------------------------------------------------------------Number of Customers Targeted
10000
1000
-----------------------------------------------------------------------------------------------Cost per customer Targeted (assume $2)
$2
$2
-----------------------------------------------------------------------------------------------Number of Responses
200
100
-----------------------------------------------------------------------------------------------Response Rate
2%

10%
-----------------------------------------------------------------------------------------------Total Revenues (assume $100 per response)
20000
10000
-----------------------------------------------------------------------------------------------Total Cost of the Campaign
20000
2000
-----------------------------------------------------------------------------------------------Total profit
0
8000
-----------------------------------------------------------------------------------------------Introduction to Hadoop
======================
Hadoop is a java based framework which is used to handle BigData
Table: Employee
Schema: ID: int, name: varchar(30), salary:varchar(30)
1, steve, 10000 ------> Yes
2, Donald,Trump, 10000 ----> No (More values than the number of columns)
Hillary, 2, 10000 -----> No (data type mismatch)
To perform analytics.....a system should be able to
-> System should
the data not yet
-> System should
-> System should

be able to accept all kinds of data (even garbage !!! value of


realized)
NOT have treshold limits (like how Oracle has 12TB limit)
be able to process all kinds of data

To achieve these objectives .......... we use Hadoop !!!!

You might also like