Big Data

BIG DATA
ABOUT THE BOOK

Webscale applications like social networks, realtime
analytics, or ecommerce sites deal with a lot of data,
whose volume and velocity exceed the limits of
traditional database systems. These applications
require architectures built around clusters of
machines to store and process data of any size, or
speed. Fortunately, scale and simplicity are not
mutually exclusive.
This book requires no previous exposure to large
scale data analysis or NoSQL tools. Familiarity with
traditional databases is helpful.
Whats Inside
Introduction to big data
systems
Real-time processing of web-
scale data
Tools like Hadoop, Cassandra,
and Storm
Extensions to traditional
database skills
` 599 /ISBN: 9789351198062 Pages: 328
Authors: Nathan Marz

with James Warren
SUMMARY
Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware
along with new tools designed specifically to capture and analyze webscale data. It describes a scalable, easyto
understand approach to big data systems that can be built and run by a small team. Following a realistic example, this
book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy
and operate them once they're built.
ABOUT THE AUTHORS

Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture
for big data systems. James Warren is an analytics architect with a background in machine
learning and scientific computing.
/dtechpress
/dtechpress
/dreamtechpress
dreamtechpress.wordpress.com
Books are available on:
TABLE OF CONTENTS
1 A new paradigm for Big Data
1.1 How this book is structured
1.2 Scaling with a traditional database
1.3 NoSQL is not a panacea
1.4 First principles
1.5 Desired properties of a Big Data system
1.6 The problems with fully incremental
architectures
1.7 Lambda Architecture
1.8 Recent trends in technology
1.9 Example application:
SuperWebAnalytics.com
1.10 Summary
PART 1 BATCH LAYER
2 Data model for Big Data
2.1 The properties of data
2.2 The factbased model for representing
data
2.3 Graph schemas
2.4 A complete data model for
2.5 Summary
3 Data model for Big Data: Illustration
3.1 Why a serialization framework?
3.2 Apache Thrift
3.3 Limitations of serialization frameworks
3.4 Summary
4 Data storage on the batch layer
4.1 Storage requirements for the master
dataset
4.2 Choosing a storage solution for the batch
layer
4.3 How distributed filesystems work
4.4 Storing a master dataset with a
distributed filesystem
4.5 Vertical partitioning
4.6 Lowlevel nature of distributed
filesystems
4.7 Storing the SuperWebAnalytics.com
master dataset on a distributed
filesystem
4.8 Summary
5 Data storage on the batch layer: Illustration
5.1 Using the Hadoop Distributed File System
5.2 Data storage in the batch layer with Pail
5.3 Storing the master dataset for
5.4 Summary
6 Batch layer
6.1 Motivating examples
6.2 Computing on the batch layer
Published by:
6.3
Recomputation algorithms vs.

incremental algorithms
6.4 Scalability in the batch layer
6.5 MapReduce: a paradigm for Big Data
computing
6.6 Lowlevel nature of MapReduce
6.7 Pipe diagrams: a higherlevel way of
thinking about batch computation
6.8 Summary
7 Batch layer: Illustration
7.1 An illustrative example
7.2 Common pitfalls of dataprocessing tools
7.4 Composition
7.5 Summary
8 An example batch layer: Architecture and
algorithms
8.1 Design of the SuperWebAnalytics.com
batch layer
8.2 Workflow overview
8.3 Ingesting new data
8.4 URL normalization
8.5 Useridentifier normalization
8.6 Deduplicate pageviews
8.7 Computing batch views
8.8 Summary
9 An example batch layer: Implementation
9.1 Starting point
9.2 Preparing the workflow
9.3 Ingesting new data
9.4 URL normalization
9.5 Useridentifier normalization
9.6 Deduplicate pageviews
9.7 Computing batch views
9.8 Summary
PART 3 SPEED LAYER

12 Realtime views 207
12.1 Computing realtime views
12.2 Storing realtime views
12.3 Challenges of incremental computation
12.4 Asynchronous versus synchronous
updates
12.5 Expiring realtime views
12.6 Summary
13 Realtime views: Illustration
13.1 Cassandra's data model
13.2 Using Cassandra
13.3 Summary
14 Queuing and stream processing
14.1 Queuing
14.2 Stream processing
14.3 Higherlevel, oneatatime stream
processing
14.4 SuperWebAnalytics.com speed layer
14.5 Summary
15 Queuing and stream processing: Illustration
15.1 Defining topologies with Apache Storm
15.2 Apache Storm clusters and deployment
15.3 Guaranteeing message processing
15.4 Implementing the
uniquesovertime speed layer
15.5 Summary
16 Microbatch stream processing
16.1 Achieving exactlyonce semantics
16.2 Core concepts of microbatch stream
processing
16.3 Extending pipe diagrams for microbatch
processing
16.4 Finishing the speed layer for
PART 2 SERVING LAYER
10 Serving layer
16.5 Another look at the bouncerateanalysis
10.1 Performance metrics for the serving layer
example
10.2 The serving layer solution to the
16.6 Summary
normalization/ denormalization problem
17 Microbatch stream processing: Illustration
10.3 Requirements for a serving layer
17.1 Using Trident
database
17.2 Finishing the SuperWebAnalytics.com
10.4 Designing a serving layer for
speed layer
17.3 Fully faulttolerant, inmemory, micro
10.5 Contrasting with a fully incremental
batch processing
solution
17.4 Summary
10.6 Summary
18 Lambda Architecture in depth
11 Serving layer: Illustration
18.1 Defining data systems
11.1 Basics of ElephantDB
18.2 Batch and serving layers
11.2 Building the serving layer for
18.3 Speed layer
18.4 Query layer
11.3 Summary
18.5 Summary
DREAMTECH PRESS
19-A, Ansari Road, Daryaganj
New Delhi-110 002, INDIA
Tel: +91-11-2324 3463-73, Fax: +91-11-2324 3078
Email: feedback@dreamtechpress.com
Website: www.dreamtechpress.com
WILEY INDIA PVT. LTD.

4435-36/7, Ansari Road, Daryaganj
New Delhi-110 002, INDIA
Tel: +91-11-4363 0000, Fax: +91-11-2327 5895
Email: csupport@wiley.com
Website: www.wileyindia.com
Distributed by:
Regional Offices: Bangalore: Tel: +91-80-2313 2383, Fax: +91-80-2312 4319, Email: blrsales@wiley.com
Mumbai: Tel: +91-22-2788 9263, 2788 9272, Telefax: +91-22-2788 9263, Email: mumsales@wiley.com
/dtechpress
/dtechpress
/dreamtechpress
dreamtechpress.wordpress.com

Big Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data

Uploaded by

Copyright:

Available Formats

BIG DATA

ABOUT THE BOOK

` 599 /ISBN: 9789351198062 Pages: 328

Authors: Nathan Marz

ABOUT THE AUTHORS

Books are available on:

Recomputation algorithms vs.

PART 3 SPEED LAYER

WILEY INDIA PVT. LTD.

You might also like