You are on page 1of 24

http://www.excelonlineclasses.co.

nr
/
excel.onlineclasses@gmail.com

http://www.excelonlineclasses.co.nr/

Excel Online Classes offers following


services:
Online Training
Development
Testing
Job support
Technical Guidance
Job Consultancy
Any needs of IT Sector

http://www.excelonlineclasses.co.nr/

HDFS

Nagarjuna K
http://www.excelonlineclasses.co.nr/

HDFS
Distributed

FS designed to run on
Commodity Hardware

Provides

high throughput access to


application data , suitable for
applications having large datasets
http://www.excelonlineclasses.co.nr/

Assumptions & Goals


Hardware Failure
Streaming Data Access
Large Datasets
Simple coherency Model
Moving Computation cheaper

moving data
http://www.excelonlineclasses.co.nr/

than

Hardware Failure

Assumptions &

Goals

HDFS

instance many machines

Each storing part of the data

Chances

that any machine goes


down cant be avoided
Detection of faults, auto recovery is
core architectural goal of HDFS
http://www.excelonlineclasses.co.nr/

Streaming Data Access


Assumptions & Goals

HDFS

is designed fro batch


processing rather than interactive
usage by users.

Emphasis

on Data throughput

Not on low Latency data access.

http://www.excelonlineclasses.co.nr/

Streaming Data Access


Assumptions & Goals

HDFS built on !dea Write once , Read


many times pattern

Overtime data set generated and


placed in HDFS

Analysis is done one large part of data , rather


than on first few records

Time to read whole data set is more than


retrieving first or the last record.
http://www.excelonlineclasses.co.nr/

Large Datasets

Assumptions &

Goals

A typical file ranges from GB to TB

http://www.excelonlineclasses.co.nr/

Simple Coherency Model


Assumptions & Goals

HDFS built on !dea Write once , Read


many times pattern

The assumption enables high through put


access

http://www.excelonlineclasses.co.nr/

Moving Computation OR Data ?


Assumptions & Goals

Computation
Data

intensive porgraming

intensive programing

http://www.excelonlineclasses.co.nr/

Where HDFS doesnt fit


Low

latency data access

Lots

of small files

Multiple

writers, arbitrary file


modifications
http://www.excelonlineclasses.co.nr/

Where HDFS doesnt fit


Low

latency data access

Lots

of small files

High latency time


Each file (say 10 KB of size) takes up a block

in HDFS Compress
All the metadata is stored in HDFS memory
http://www.excelonlineclasses.co.nr/

Where HDFS doesnt fit


Multiple

writers, arbitrary file


modifications
Single user writes files in HDFS.

Appending only at the end. Multiple


sources of writing into a same file or
writing at arbitrary offset is not
supported (currently)
http://www.excelonlineclasses.co.nr/

Blocks
disc

has block size

minimum amount of data that is

read/write
512 bytes
FileSystem

blocks are few multiple of


disc block size
few KB
http://www.excelonlineclasses.co.nr/

Blocks
In

classical FS, single block may


contain data of only single file
Leads to internal fragmentation.

Newer

file systems, solves this


problem by
block suballocation
tail merging

http://www.excelonlineclasses.co.nr/

Blocks
HDFS

also has a block size

64 MB

Unlike

normal FS , if file is less than


64 MB it doesnt occupy underlying
storage of 64MB.
http://www.excelonlineclasses.co.nr/

Why BIG BLOCK size ?


Throughput

vs Latency

time to seek start of block


Reading the whole block

http://www.excelonlineclasses.co.nr/

Why BIG BLOCK size ?


seek

time = 10ms
transfer rate (throughput) = 100MBPS
make

seek time 1% of transfer rate ,

block size = 100MB

Default

is 64 MB
As the transfer rate increases , Block
size can be increased
http://www.excelonlineclasses.co.nr/

hadoop

fsck / -files -blocks

Gives information about all the files and

blocks in the file system


Replication
under
over etc.,

corrupt ?
etc.,
http://www.excelonlineclasses.co.nr/

File Permissions on
HDFS
Clients

identity determined

user name and groups from which it operates.

Sharing

of FS shouldnt be used hostile


environment

Going

forward

Kerberos authentication
http://www.excelonlineclasses.co.nr/

Hadoop File Systems


HDFS

is just one implementation of


Hadoop FileSystems.
org.apache.hadoop.fs.FileSystem
represents a FileSystem in hadoop

http://www.excelonlineclasses.co.nr/

Hadoop File Systems

http://www.excelonlineclasses.co.nr/

Hadoop File Systems

http://www.excelonlineclasses.co.nr/

You might also like