You are on page 1of 16

Confidential

Hadoop Analytics
Analytics for Beginners

 Introduction to Big data Business Analytics


 Applications of Analtyics
 Analtyics Technology and Resources
 Models and Algorithm
 Key roles for successful analytic project
 Main phases of life cycle
 State of the practice in analytics role of data scientists
 Developing core deliverables for stakeholders

Business Statistics

 Descriptive Statistics
 Probabilty and Sampling
 Inferential Statistics
 Hypothesis Testing
 Advanced Hypothesis Testing

Predictive Analytics

 Predictive modeling and Analysis - Regression Analysis


 Multicollinearity
 Correlation analysis
 Multiple correlation
 Least square

An Introduction to R

 Analytics Tools and Exploring R


 Data Structures in R
 Data Manipulation in R
 Dataframe factor

Functions & plots In R

 Measuring the central tendency – the model

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Measuring spread – variance and standard deviation


 Visualizing numeric variables – boxplots
 Visualizing numeric variables – histograms
 Visualizing numeric variables – qqplot
 Understanding numeric data – uniform and normal distributions
 Measuring the central tendency – the mode
 Exploring relationships between variables
 Visualizing relationships – scatterplots
 Exploring numeric variables

Read and Write Operations in R

 Reading from CSV


 Reading from URL
 Reading from Excel
 Writing to CSV & PMML

Integrating R

 Implementing Association rule mining in R


 Integrating R with Hadoop using RHadoop and RMR package
 Writing MapReduce Jobs in R and executing them on Hadoop
 Implementing Machine Learning Algorithms on larger Data Sets with Apache
Mahout

Databases and Introduction to Machine Learning Concepts

 Use SQL databases to store and organize data


 Access stored data with MySQL querying language
 Introduction to Machine Learning
 Supervised and Unsupervised Learning Techniques

Regression Methods and Supervised Learning Techniques

 Creating predictive models


 Classification Using Nearest Neighbors
 Linear Regression
 Multiple linear regression model
 Logistic Regression
 Decision Tree Classifier
 Clustering

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 What is Random Forests?


 Features of Random Forest
 Out of Box Error Estimate
 Naive Bayes Classifier

Unsupervised Machine Learning Techniques

 Introduction of K-Means Clustering


 K-means in Euclidean space
 K-means as optimization
 Understanding TF-IDF and Cosine
 Similarity and their application to Vector Space Model

Deep learning

 Deep Networks
 Optimization for Training Deep Models
 Convolutional Networks
 Understanding Support Vector Machines
 Retrieve data using sql statements
 Using kernels for non-linear spaces

 Live Project

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

Hadoop 2 Development with Spark


Big Data Introduction

 What is Big Data


 Evolution of Big Data
 Benefits of Big Data
 Operational vs Analytical Big Data
 Need for Big Data Analytics
 Big Data Challenges

Hadoop cluster

 Master Nodes
o Name Node
o Secondary Name Node
o Job Tracker
 Client Nodes
 Slaves
 Hadoop configuration
 Setting up a Hadoop cluster

HDFS

 Introduction to HDFS
 HDFS Features
 HDFS Architecture
 Blocks
 Goals of HDFS
 The Name node & Data Node
 Secondary Name node
 The Job Tracker
 The Process of a File Read
 How does a File Write work
 Data Replication
 Rack Awareness
 HDFS Federation
 Configuring HDFS
 HDFS Web Interface

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Fault tolerance
 Name node failure management
 Access HDFS from Java

Yarn

 Introduction to Yarn
 Why Yarn
 Classic MapReduce v/s Yarn
 Advantages of Yarn
 Yarn Architecture
o Resource Manager
o Node Manager
o Application Master
 Application submission in YARN
 Node Manager containers
 Resource Manager components
 Yarn applications
 Scheduling in Yarn
o Fair Scheduler
o Capacity Scheduler
 Fault tolerance

MapReduce

 What is MapReduce
 Why MapReduce
 How MapReduce works
 Difference between Hadoop 1 & Hadoop 2
 Identity mapper & reducer
 Data flow in MapReduce
 Input Splits
 Relation Between Input Splits and HDFS Blocks
 Flow of Job Submission in MapReduce
 Job submission & Monitoring
 MapReduce algorithms
o Sorting
o Searching
o Indexing
o TF-IDF

Hadoop Fundamentals

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 What is Hadoop
 History of Hadoop
 Hadoop Architecture
 Hadoop Ecosystem Components
 How does Hadoop work
 Why Hadoop & Big Data
 Hadoop Cluster introduction
 Cluster Modes
o Standalone
o Pseudo-distributed
o Fully - distributed
 HDFS Overview
 Introduction to MapReduce
 Hadoop in demand

HDFS Operations

 Starting HDFS
 Listing files in HDFS
 Writing a file into HDFS
 Reading data from HDFS
 Shutting down HDFS

HDFS Command Reference

 Listing contents of directory


 Displaying and printing disk usage
 Moving files & directories
 Copying files and directories
 Displaying file contents

Java Overview For Hadoop

 Object oriented concepts


 Variables and Data types
 Static data type
 Primitive data types
 Objects & Classes
 Java Operators
 Method and its types
 Constructors
 Conditional statements

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Looping in Java
 Access Modifiers
 Inheritance
 Polymorphism
 Method overloading & overriding
 Interfaces

MapReduce Programming

 Hadoop data types


 The Mapper Class
o Map method
 The Reducer Class
o Shuffle Phase
o Sort Phase
o Secondary Sort
o Reduce Phase
 The Job class
o Job class constructor
 JobContext interface
 Combiner Class
o How Combiner works
o Record Reader
o Map Phase
o Combiner Phase
o Reducer Phase
o Record Writer
 Partitioners
o Input Data
o Map Tasks
o Partitioner Task
o Reduce Task
o Compilation & Execution

Hadoop Ecosystems
Pig

 What is Apache Pig

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Why Apache Pig


 Pig features
 Where should Pig be used
 Where not to use Pig
 The Pig Architecture
 Pig components
 Pig v/s MapReduce
 Pig v/s SQL
 Pig v/s Hive
 Pig Installation
 Pig Execution Modes & Mechanisms
 Grunt Shell Commands
 Pig Latin - Data Model
 Pig Latin Statements
 Pig data types
 Pig Latin operators
 CaseSensitivity
 Grouping & Co Grouping in Pig Latin
 Sorting & Filtering
 Joins in Pig latin
 Built-in Function
 Writing UDFs
 Macros in Pig

HBase

 What is HBase
 History Of HBase
 The NoSQL Scenario
 HBase & HDFS
 Physical Storage
 HBase v/s RDBMS
 Features of HBase
 HBase Data model
 Master server
 Region servers & Regions
 HBase Shell
 Create table and column family
 The HBase Client API

Spark

 Introduction to Apache Spark

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Features of Spark
 Spark built on Hadoop
 Components of Spark
 Resilient Distributed Datasets
 Data Sharing using Spark RDD
 Iterative Operations on Spark RDD
 Interactive Operations on Spark RDD
 Spark shell
 RDD transformations
 Actions
 Programming with RDD
o Start Shell
o Create RDD
o Execute Transformations
o Caching Transformations
o Applying Action
o Checking output
 GraphX overview

Impala

 Introducing Cloudera Impala


 Impala Benefits
 Features of Impala
 Relational databases vs Impala
 How Impala works
 Architecture of Impala
 Components of the Impala
o The Impala Daemon
o The Impala Statestore
o The Impala Catalog Service
 Query Processing Interfaces
 Impala Shell Command Reference
 Impala Data Types
 Creating & deleting databases and tables
 Inserting & overwriting table data
 Record Fetching and ordering
 Grouping records
 Using the Union clause
 Working of Impala with Hive
 Impala v/s Hive v/s HBase

MongoDB Overview

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Introduction to MongoDB
 MongoDB v/s RDBMS
 Why & Where to use MongoDB
 Databases & Collections
 Inserting & querying documents
 Schema Design
 CRUD Operations

Oozie & Hue Overview

 Introduction to Apache Oozie


 Oozie Workflow
 Oozie Coordinators
 Property File
 Oozie Bundle system
 CLI and extensions
 Overview of Hue

Hive

 What is Hive
 Features of Hive
 The Hive Architecture
 Components of Hive
 Installation & configuration
 Primitive types
 Complex types
 Built in functions
 Hive UDFs
 Views & Indexes
 Hive Data Models
 Hive vs Pig
 Co-groups
 Importing data
 Hive DDL statements
 Hive Query Language
 Data types & Operators
 Type conversions
 Joins
 Sorting & controlling data flow
 local vs mapreduce mode
 Partitions
 Buckets

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

Sqoop

 Introducing Sqoop
 Scoop installation
 Working of Sqoop
 Understanding connectors
 Importing data from MySQL to Hadoop HDFS
 Selective imports
 Importing data to Hive
 Importing to Hbase
 Exporting data to MySQL from Hadoop
 Controlling import process

Flume

 What is Flume
 Applications of Flume
 Advantages of Flume
 Flume architecture
 Data flow in Flume
 Flume features
 Flume Event
 Flume Agent
o Sources
o Channels
o Sinks
 Log Data in Flume

Zookeeper Overview

 Zookeeper Introduction
 Distributed Application
 Benefits of Distributed Applications
 Why use Zookeeper
 Zookeeper Architecture
 Hierarchial Namespace
 Znodes
 Stat structure of a Znode
 Electing a leader

Kafka Basics

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Messaging Systems
o Point-to-Point
o Publish - Subscribe
 What is Kafka
 Kafka Benefits
 Kafka Topics & Logs
 Partitions in Kafka
 Brokers
 Producers & Consumers
 What are Followers
 Kafka Cluster Architecture
 Kafka as a Pub-Sub Messaging
 Kafka as a Queue Messaging
 Role of Zookeeper
 Basic Kafka Operations
o Creating a Kafka Topic
o Listing out topics
o Starting Producer
o Starting Consumer
o Modifying a Topic
o Deleting a Topic
 Integration With Spark

Scala Basics

 Introduction to Scala
 Spark & Scala interdependence
 Objects & Classes
 Class definition in Scala
 Creating Objects
 Scala Traits
 Basic Data Types
 Operators in Scala
 Control structures
 Fields in Scala
 Functions in Scala
 Collections in Scala
o Mutable collection
o Immutable collection

 Live Project

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

Hadoop Analytics using R (For Data


Scientist)
Hadoop Administration

 Introduction to Big Data and Hadoop


 Types Of Data
 Characteristics Of Big Data
 Business Benefits Of Big Data Technology
 Hadoop And Traditional Rdbms
 Hadoop Core Services

Hadoop Installation and Configuration

 Ubuntu Server-Introduction
 Hadoop and Multi-Node Installation
 Create a Clone of Hadoop Virtual Machine
 Perform Clustering of the Hadoop Environment

Hadoop Distributed File System

 Introduction to Hadoop Distributed File System


 Goals of HDFS
 HDFS Architecture
 Design of HDFS
 Hadoop Storage Mechanism
 Measures of Capacity Execution
 HDFS Storage Architecture Heterogeneous Storage
 HDFS Commands

The MapReduce Framework

 Understanding MapReduce
 The Map and Reduce Phase
 WordCount in MapReduce

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

 Running MapReduce Job

Planning Hadoop Cluster

 Architecture of Hadoop Cluster


 Workflow of Hadoop Cluster
 HDFS Writes
 Preparing for HDFS Writes
 Pipelined HDFS Write
 NameNode Functionality
 Replicating Missing Replicas
 HDFS Reads
 Factors for Planning Hadoop Cluster
 Single-Node and Multi-Node Cluster Configuration
 HDFS Block replication and rack awareness
 Topology and Components of Hadoop Cluster

Cluster Maintenance

 Checking HDFS Status


 Breaking the cluster
 Copying Data Between Clusters
 Adding and Removing Cluster Nodes
 Rebalancing the cluster
 Name Node Metadata Backup
 Cluster Upgrading

Advanced Cluster Configuration Features

 Hadoop Configuration Overview


 Types of Configuration Files
 Hadoop Cluster and Map Reduce Configuration Parameters with Values
 Hadoop Environment Setup
 Include and Exclude Configuration Files

Managing and Scheduling Jobs

 Managing Jobs
 The FIFO and Fair Schedule
 How to stop and start jobs running on the cluster

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

Cluster Monitoring, Troubleshooting, and Optimizing

 General System conditions to Monitor


 Name Node and Job Tracker Web Uis
 View and Manage Hadoop's Log files
 Ganglia Monitoring Tool
 Common cluster issues and their resolutions

YARN

 Introduction to YARN
 Need for YARN
 YARN Architecture
 YARN Installation and Configuration

Extending Hadoop

 Simplifying information access


 Enabling SQL–like querying with Hive
 Installing Pig to create MapReduce jobs
 Imposing a tabular view on HDFS with HBase
 Configuring Oozie to schedule workflows

Installing and Managing Hadoop Ecosystem

 Sqoop
 Flume
 Hive
 Pig
 HBase
 Oozie

Live Project

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential

Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com

You might also like