You are on page 1of 29

Big Data? Big Opportunities?

David Tarrant @davetaz

Aim

Provide some practical steps to managing


big open data projects.

Outcomes

Define Big Data


Describe Adierent approaches to managing big open
data projects
Apply enterprise tools to analyse a big dataset quickly

WHAT IS BIG DATA?

Big Data
Dataset that are too large and
complex to manipulate with
standard methods or tools.

Excel
Workbook WAS limited to 65,536 rows (216 aka 16-Bit)

64-Bit operating system addressing limit is 264


18,446,744,073,709,551,615
q q t
b m t
h

What is big data?

Volume
Velocity
Variety
Veracity

What is big data?

Volume
We create around 4 zettabytes of
Velocity
data day.

Variety
Thats 1 sextillion bytes per day
(128-Bit OS required)
Veracity

What is big data?

Volume
Velocity
The data is created quicker than
Variety
we can curate its storage.
Veracity

What is big data?

Volume
The data is continuously changing
Velocity
in structure, format and detail.
Variety
Veracity

What is big data?

Volume
Velocity
The data quality is highly variable and
aected by changing perception of
Variety
truth and fact.
Veracity

Big Data
Taken collectively. All digital data is big
data. Looking at a facet might reveal
that you are looking at a dataset that
only conforms to one or two of the Vs.

Can you name a dataset that shows the
characteristics of all 4 Vs?

A few more Vs
Value and Viability
More data does not mean better results.
In fact often entirely the opposite is true.
Sample selection is critical to all good statistic studies.
Not being able to control selection may lead to an incorrect conclusion.

Conclusion
The majority of datasets
are large.

Lots of rows with lots of joins that can be
processed. If you know how to exploit
computing power available.

Outcomes

Define Big Data


Discuss the dierent approaches to managing big open
data projects
Apply enterprise tools to process a large dataset quickly

Download the data, buy a mac pro


12 Cores
64Gb RAM
7,500

Download the data, build a cluster


50,000+
80 Cores
4Tb RAM


http://browser.primatelabs.com/geekbench3/913858

Take the data to the cloud



32 cores
244Gb RAM
$0.3293 per Hour
h"p://aws.amazon.com/ec2/instance-types/

Separate your
pipelines

Compute

Data

Compute
h"p://aws.amazon.com/datasets/

Queries: Google speed

Data

Web Server

Compute

Compute

Compute

Compute

Compute

Compute

REDUCE

Web Server

Compute

Compute

Compute

Compute

Compute

MAP

Compute

Map-reduce services

Google Big Query

Outcomes

Define Big Data


Discuss the dierent approaches to managing big open
data projects
Apply enterprise tools to process a large dataset quickly

Exercise

Visualising 6m rows of
data in Socrata.

Compute

Data

Compute

Thank-you

David Tarrant @davetaz

You might also like