Hadoop Syllabus

Confidential
Hadoop Analytics
Analytics for Beginners
 Introduction to Big data Business Analytics

 Applications of Analtyics
 Analtyics Technology and Resources
 Models and Algorithm
 Key roles for successful analytic project
 Main phases of life cycle
 State of the practice in analytics role of data scientists
 Developing core deliverables for stakeholders
Business Statistics
 Descriptive Statistics
 Probabilty and Sampling
 Inferential Statistics
 Hypothesis Testing
 Advanced Hypothesis Testing
Predictive Analytics
 Predictive modeling and Analysis - Regression Analysis

 Multicollinearity
 Correlation analysis
 Multiple correlation
 Least square
An Introduction to R
 Analytics Tools and Exploring R

 Data Structures in R
 Data Manipulation in R
 Dataframe factor
Functions & plots In R
 Measuring the central tendency – the model
Office Address: SM tower,2nd Floor, Above Jijamat sahakari bank, Near karve nagar bus stop, karve
nagar, Pune-411052 Contact no. 902871122/9975813299 , www.cyberdyneitservvices.com
Contact us- info@cyberdyneitservices.com OR hr@cyberdyneitservices.com
Confidential
 Measuring spread – variance and standard deviation

 Visualizing numeric variables – boxplots
 Visualizing numeric variables – histograms
 Visualizing numeric variables – qqplot
 Understanding numeric data – uniform and normal distributions
 Measuring the central tendency – the mode
 Exploring relationships between variables
 Visualizing relationships – scatterplots
 Exploring numeric variables
Read and Write Operations in R
 Reading from CSV

 Reading from URL
 Reading from Excel
 Writing to CSV & PMML
Integrating R
 Implementing Association rule mining in R

 Integrating R with Hadoop using RHadoop and RMR package
 Writing MapReduce Jobs in R and executing them on Hadoop
 Implementing Machine Learning Algorithms on larger Data Sets with Apache
Mahout
Databases and Introduction to Machine Learning Concepts
 Use SQL databases to store and organize data

 Access stored data with MySQL querying language
 Introduction to Machine Learning
 Supervised and Unsupervised Learning Techniques
Regression Methods and Supervised Learning Techniques
 Creating predictive models

 Classification Using Nearest Neighbors
 Linear Regression
 Multiple linear regression model
 Logistic Regression
 Decision Tree Classifier
 Clustering
Confidential
 What is Random Forests?

 Features of Random Forest
 Out of Box Error Estimate
 Naive Bayes Classifier
Unsupervised Machine Learning Techniques
 Introduction of K-Means Clustering

 K-means in Euclidean space
 K-means as optimization
 Understanding TF-IDF and Cosine
 Similarity and their application to Vector Space Model
Deep learning
 Deep Networks
 Optimization for Training Deep Models
 Convolutional Networks
 Understanding Support Vector Machines
 Retrieve data using sql statements
 Using kernels for non-linear spaces
 Live Project
Confidential
Hadoop 2 Development with Spark

Big Data Introduction
 What is Big Data

 Evolution of Big Data
 Benefits of Big Data
 Operational vs Analytical Big Data
 Need for Big Data Analytics
 Big Data Challenges
Hadoop cluster
 Master Nodes
o Name Node
o Secondary Name Node
o Job Tracker
 Client Nodes
 Slaves
 Hadoop configuration
 Setting up a Hadoop cluster
HDFS
 Introduction to HDFS
 HDFS Features
 HDFS Architecture
 Blocks
 Goals of HDFS
 The Name node & Data Node
 Secondary Name node
 The Job Tracker
 The Process of a File Read
 How does a File Write work
 Data Replication
 Rack Awareness
 HDFS Federation
 Configuring HDFS
 HDFS Web Interface
Confidential
 Fault tolerance
 Name node failure management
 Access HDFS from Java
Yarn
 Introduction to Yarn
 Why Yarn
 Classic MapReduce v/s Yarn
 Advantages of Yarn
 Yarn Architecture
o Resource Manager
o Node Manager
o Application Master
 Application submission in YARN
 Node Manager containers
 Resource Manager components
 Yarn applications
 Scheduling in Yarn
o Fair Scheduler
o Capacity Scheduler
 Fault tolerance
MapReduce
 What is MapReduce
 Why MapReduce
 How MapReduce works
 Difference between Hadoop 1 & Hadoop 2
 Identity mapper & reducer
 Data flow in MapReduce
 Input Splits
 Relation Between Input Splits and HDFS Blocks
 Flow of Job Submission in MapReduce
 Job submission & Monitoring
 MapReduce algorithms
o Sorting
o Searching
o Indexing
o TF-IDF
Hadoop Fundamentals
Confidential
 What is Hadoop
 History of Hadoop
 Hadoop Architecture
 Hadoop Ecosystem Components
 How does Hadoop work
 Why Hadoop & Big Data
 Hadoop Cluster introduction
 Cluster Modes
o Standalone
o Pseudo-distributed
o Fully - distributed
 HDFS Overview
 Introduction to MapReduce
 Hadoop in demand
HDFS Operations
 Starting HDFS
 Listing files in HDFS
 Writing a file into HDFS
 Reading data from HDFS
 Shutting down HDFS
HDFS Command Reference
 Listing contents of directory

 Displaying and printing disk usage
 Moving files & directories
 Copying files and directories
 Displaying file contents
Java Overview For Hadoop
 Object oriented concepts

 Variables and Data types
 Static data type
 Primitive data types
 Objects & Classes
 Java Operators
 Method and its types
 Constructors
 Conditional statements
Confidential
 Looping in Java
 Access Modifiers
 Inheritance
 Polymorphism
 Method overloading & overriding
 Interfaces
MapReduce Programming
 Hadoop data types

 The Mapper Class
o Map method
 The Reducer Class
o Shuffle Phase
o Sort Phase
o Secondary Sort
o Reduce Phase
 The Job class
o Job class constructor
 JobContext interface
 Combiner Class
o How Combiner works
o Record Reader
o Map Phase
o Combiner Phase
o Reducer Phase
o Record Writer
 Partitioners
o Input Data
o Map Tasks
o Partitioner Task
o Reduce Task
o Compilation & Execution
Hadoop Ecosystems
Pig
 What is Apache Pig
Confidential
 Why Apache Pig

 Pig features
 Where should Pig be used
 Where not to use Pig
 The Pig Architecture
 Pig components
 Pig v/s MapReduce
 Pig v/s SQL
 Pig v/s Hive
 Pig Installation
 Pig Execution Modes & Mechanisms
 Grunt Shell Commands
 Pig Latin - Data Model
 Pig Latin Statements
 Pig data types
 Pig Latin operators
 CaseSensitivity
 Grouping & Co Grouping in Pig Latin
 Sorting & Filtering
 Joins in Pig latin
 Built-in Function
 Writing UDFs
 Macros in Pig
HBase
 What is HBase
 History Of HBase
 The NoSQL Scenario
 HBase & HDFS
 Physical Storage
 HBase v/s RDBMS
 Features of HBase
 HBase Data model
 Master server
 Region servers & Regions
 HBase Shell
 Create table and column family
 The HBase Client API
Spark
 Introduction to Apache Spark
Confidential
 Features of Spark
 Spark built on Hadoop
 Components of Spark
 Resilient Distributed Datasets
 Data Sharing using Spark RDD
 Iterative Operations on Spark RDD
 Interactive Operations on Spark RDD
 Spark shell
 RDD transformations
 Actions
 Programming with RDD
o Start Shell
o Create RDD
o Execute Transformations
o Caching Transformations
o Applying Action
o Checking output
 GraphX overview
Impala
 Introducing Cloudera Impala

 Impala Benefits
 Features of Impala
 Relational databases vs Impala
 How Impala works
 Architecture of Impala
 Components of the Impala
o The Impala Daemon
o The Impala Statestore
o The Impala Catalog Service
 Query Processing Interfaces
 Impala Shell Command Reference
 Impala Data Types
 Creating & deleting databases and tables
 Inserting & overwriting table data
 Record Fetching and ordering
 Grouping records
 Using the Union clause
 Working of Impala with Hive
 Impala v/s Hive v/s HBase
MongoDB Overview
Confidential
 Introduction to MongoDB
 MongoDB v/s RDBMS
 Why & Where to use MongoDB
 Databases & Collections
 Inserting & querying documents
 Schema Design
 CRUD Operations
Oozie & Hue Overview
 Introduction to Apache Oozie

 Oozie Workflow
 Oozie Coordinators
 Property File
 Oozie Bundle system
 CLI and extensions
 Overview of Hue
Hive
 What is Hive
 Features of Hive
 The Hive Architecture
 Components of Hive
 Installation & configuration
 Primitive types
 Complex types
 Built in functions
 Hive UDFs
 Views & Indexes
 Hive Data Models
 Hive vs Pig
 Co-groups
 Importing data
 Hive DDL statements
 Hive Query Language
 Data types & Operators
 Type conversions
 Joins
 Sorting & controlling data flow
 local vs mapreduce mode
 Partitions
 Buckets
Confidential
Sqoop
 Introducing Sqoop
 Scoop installation
 Working of Sqoop
 Understanding connectors
 Importing data from MySQL to Hadoop HDFS
 Selective imports
 Importing data to Hive
 Importing to Hbase
 Exporting data to MySQL from Hadoop
 Controlling import process
Flume
 What is Flume
 Applications of Flume
 Advantages of Flume
 Flume architecture
 Data flow in Flume
 Flume features
 Flume Event
 Flume Agent
o Sources
o Channels
o Sinks
 Log Data in Flume
Zookeeper Overview
 Zookeeper Introduction
 Distributed Application
 Benefits of Distributed Applications
 Why use Zookeeper
 Zookeeper Architecture
 Hierarchial Namespace
 Znodes
 Stat structure of a Znode
 Electing a leader
Kafka Basics
Confidential
 Messaging Systems
o Point-to-Point
o Publish - Subscribe
 What is Kafka
 Kafka Benefits
 Kafka Topics & Logs
 Partitions in Kafka
 Brokers
 Producers & Consumers
 What are Followers
 Kafka Cluster Architecture
 Kafka as a Pub-Sub Messaging
 Kafka as a Queue Messaging
 Role of Zookeeper
 Basic Kafka Operations
o Creating a Kafka Topic
o Listing out topics
o Starting Producer
o Starting Consumer
o Modifying a Topic
o Deleting a Topic
 Integration With Spark
Scala Basics
 Introduction to Scala
 Spark & Scala interdependence
 Objects & Classes
 Class definition in Scala
 Creating Objects
 Scala Traits
 Basic Data Types
 Operators in Scala
 Control structures
 Fields in Scala
 Functions in Scala
 Collections in Scala
o Mutable collection
o Immutable collection
 Live Project
Confidential
Hadoop Analytics using R (For Data

Scientist)
Hadoop Administration
 Introduction to Big Data and Hadoop

 Types Of Data
 Characteristics Of Big Data
 Business Benefits Of Big Data Technology
 Hadoop And Traditional Rdbms
 Hadoop Core Services
Hadoop Installation and Configuration
 Ubuntu Server-Introduction
 Hadoop and Multi-Node Installation
 Create a Clone of Hadoop Virtual Machine
 Perform Clustering of the Hadoop Environment
Hadoop Distributed File System
 Introduction to Hadoop Distributed File System

 Goals of HDFS
 HDFS Architecture
 Design of HDFS
 Hadoop Storage Mechanism
 Measures of Capacity Execution
 HDFS Storage Architecture Heterogeneous Storage
 HDFS Commands
The MapReduce Framework
 Understanding MapReduce
 The Map and Reduce Phase
 WordCount in MapReduce
Confidential
 Running MapReduce Job
Planning Hadoop Cluster
 Architecture of Hadoop Cluster

 Workflow of Hadoop Cluster
 HDFS Writes
 Preparing for HDFS Writes
 Pipelined HDFS Write
 NameNode Functionality
 Replicating Missing Replicas
 HDFS Reads
 Factors for Planning Hadoop Cluster
 Single-Node and Multi-Node Cluster Configuration
 HDFS Block replication and rack awareness
 Topology and Components of Hadoop Cluster
Cluster Maintenance
 Checking HDFS Status

 Breaking the cluster
 Copying Data Between Clusters
 Adding and Removing Cluster Nodes
 Rebalancing the cluster
 Name Node Metadata Backup
 Cluster Upgrading
Advanced Cluster Configuration Features
 Hadoop Configuration Overview

 Types of Configuration Files
 Hadoop Cluster and Map Reduce Configuration Parameters with Values
 Hadoop Environment Setup
 Include and Exclude Configuration Files
Managing and Scheduling Jobs
 Managing Jobs
 The FIFO and Fair Schedule
 How to stop and start jobs running on the cluster
Confidential
Cluster Monitoring, Troubleshooting, and Optimizing
 General System conditions to Monitor

 Name Node and Job Tracker Web Uis
 View and Manage Hadoop's Log files
 Ganglia Monitoring Tool
 Common cluster issues and their resolutions
YARN
 Introduction to YARN
 Need for YARN
 YARN Architecture
 YARN Installation and Configuration
Extending Hadoop
 Simplifying information access

 Enabling SQL–like querying with Hive
 Installing Pig to create MapReduce jobs
 Imposing a tabular view on HDFS with HBase
 Configuring Oozie to schedule workflows
Installing and Managing Hadoop Ecosystem
 Sqoop
 Flume
 Hive
 Pig
 HBase
 Oozie
Live Project
Confidential

Hadoop Syllabus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Syllabus

Uploaded by

Copyright:

Available Formats

Confidential

 Introduction to Big data Business Analytics

 Predictive modeling and Analysis - Regression Analysis

 Analytics Tools and Exploring R

Functions & plots In R

 Measuring the central tendency – the model

 Measuring spread – variance and standard deviation

Read and Write Operations in R

 Reading from CSV

 Implementing Association rule mining in R

Databases and Introduction to Machine Learning Concepts

 Use SQL databases to store and organize data

Regression Methods and Supervised Learning Techniques

 Creating predictive models

 What is Random Forests?

Unsupervised Machine Learning Techniques

 Introduction of K-Means Clustering

Hadoop 2 Development with Spark

 What is Big Data

HDFS Command Reference

 Listing contents of directory

Java Overview For Hadoop

 Object oriented concepts

 Hadoop data types

 What is Apache Pig

 Why Apache Pig

 Introduction to Apache Spark

 Introducing Cloudera Impala

Oozie & Hue Overview

 Introduction to Apache Oozie

Hadoop Analytics using R (For Data

 Introduction to Big Data and Hadoop

Hadoop Installation and Configuration

Hadoop Distributed File System

 Introduction to Hadoop Distributed File System

The MapReduce Framework

 Running MapReduce Job

Planning Hadoop Cluster

 Architecture of Hadoop Cluster

 Checking HDFS Status

Advanced Cluster Configuration Features

 Hadoop Configuration Overview

Managing and Scheduling Jobs

Cluster Monitoring, Troubleshooting, and Optimizing

 General System conditions to Monitor

 Simplifying information access

Installing and Managing Hadoop Ecosystem

You might also like