You are on page 1of 134

UNIVERSIT DEGLI STUDI DI CAMERINO

SCHOOL OF ADVANCED STUDIES


DOTTORATO DI RICERCA IN SCIENZE DELLINFORMAZIONE E SISTEMI
COMPLESSI XXVI CICLO
SETTORE SCIENTIFICO DISCIPLINARE INF/01 INFORMATICA
SCUOLA DI SCIENZE E TECNOLOGIE

Big Data in Official Statistics

Supervisore Co-supervisore Dottorando


Alberto Polzonetti Barbara Re Carlo Vaccari

_________________________________________________________________
ANNO ACCADEMICO 2013 - 2014

1
UNIVERSITY OF CAMERINO
SCHOOL OF ADVANCED STUDIES
DOCTOR OF PHILOSOPHY IN INFORMATION SCIENCE AND COMPLEX
SYSTEMS - XXVI CYCLE
SCHOOL OF SCIENCE AND TECHNOLOGY

Big Data in Official Statistics

Advisor Co-advisor PhD Candidate


Alberto Polzonetti Barbara Re Carlo Vaccari

_________________________________________________________________
ACADEMIC YEAR 2013 - 2014

2
No! Try not. Do, or do not. Theres no try
(Yoda, grand Master of the Order Jedi)

3
Abstract of the Dissertation
The explosion in the amount of data, called data deluge, is forcing to
redefine many scientific and technological fields, with the affirmation in any
environment of Big Data as a potential source of data.
Official statistics institutions a few years ago started to open up to external
data sources such as administrative data. The advent of Big Data is
introducing important innovations: the availability of additional external data
sources, dimensions previously unknown and questionable consistency, poses
new challenges to the institutions of official statistics, imposing a general
rethinking that involves tools, software, methodologies and organizations.
The relative newness of the field of study on Big Data requires first of all an
introduction phase, for addressing the problems of definition and for defining
the areas of technology involved and the possible fields of application.
The challenges that the use of Big Data poses to institutions that deal with
official statistics are then presented in detail, after a brief discussion of the
relationship between the new "data science" and statistics.
Although at an early stage, there is already a number, limited but growing
practical experience in the use of Big Data as a data source for use in
statistics by public (and private) institutions. The review of these experiences
can serve as a stimulus to address in a more conscious and organized way the
challenges that the use of this data source requires all producers of official
statistics.
The worldwide spread of data sources (web, e-commerce, sensors) has also
prompted the statistical community to take joint action to tackle the complex
set of methodological, technical and legal problems. And so many national
statistical institutes along with the most prestigious international
organizations have initiated joint projects that will develop in the coming
years to address the complex issues raised by Big Data for statistical
methodology and computer technology.

4
Contents
List of Figures ........................................................................................................................................... 7
Chapter 1 Introduction .............................................................................................................................. 8
1.1 Data sources for official statistics ................................................................................................... 8
1.2 Research questions and Thesis contribution ................................................................................. 10
1.3 Methodology ..................................................................................................................................11
1.4 Structure of the Thesis .................................................................................................................. 14
Chapter 2: Big Data definition and characteristics ................................................................................. 16
2.1 Data deluge ................................................................................................................................... 16
2.2 Big Data definitions ...................................................................................................................... 19
2.3 Big Data technologies ................................................................................................................... 26
2.3.1 MPP Massively Parallel Processing .................................................................................... 26
2.3.2 NoSQL (column oriented databases) ..................................................................................... 28
2.3.3 Hadoop and MapReduce ........................................................................................................ 35
2.3.4 Hadoop ecosystem ................................................................................................................. 41
2.3.5 Big Data and the Cloud .......................................................................................................... 43
2.3.6 Visual analytics, Big Data and visualization .......................................................................... 46
2.4 Applications enabled by Big Data ................................................................................................. 50
2.5 Considerations on Technology and their usage in Official Statistics ............................................ 54
Chapter 3: Big Data usage in Official Statistics - challenges ................................................................. 57
3.1 Data science and statistics ............................................................................................................. 57
3.2 Challenges posed to official statistics ........................................................................................... 59
3.3 Remarks from early experiences ................................................................................................... 64
Chapter 4: First applications of Big Data in statistics. ............................................................................ 66
4.1 First experiences in Google........................................................................................................... 66
4.2 Prices ............................................................................................................................................. 71
4.2.1 Billion Prices Project and PriceStats ...................................................................................... 71
4.2.2 Netherlands experience .......................................................................................................... 76
4.2.3 Other experiences on Prices ................................................................................................... 78
4.3 Traffic data .................................................................................................................................... 80
4.3.1 Netherlands ............................................................................................................................ 80
4.3.2 Colombia ................................................................................................................................ 84
4.4 Social media data .......................................................................................................................... 84
4.5 Mobile phones data ....................................................................................................................... 87
4.5.1 Estonia.................................................................................................................................... 87
4.5.2 New Zealand .......................................................................................................................... 89
4.6 Data about Information Society .................................................................................................... 91
4.6.1 Eurostat .................................................................................................................................. 91
4.6.2 Italy ........................................................................................................................................ 92

5
4.7 Considerations on first examples of Big Data usage in Official Statistics ................................... 95
Chapter 5: International cooperation on Big Data in Official Statistics ................................................. 99
5.1 High Level Group ......................................................................................................................... 99
5.2 MSIS 2013 .................................................................................................................................. 102
5.3 Task Team ................................................................................................................................... 105
5.3.1 Project proposed by the task team........................................................................................ 107
5.4 Sandbox subproject ......................................................................................................................112
5.4.1 Goals of Sandbox ..................................................................................................................112
5.4.2 Specific objectives ................................................................................................................112
5.4.3 Basis for the Recommendations ............................................................................................113
5.4.4 Recommendations and Resource Requirements ...................................................................114
5.5 Recommendations on Big Data coming from international cooperation .....................................117
Conclusions and future actions ............................................................................................................. 120
Bibliography.......................................................................................................................................... 123

6
List of Figures
Figure 1 Methodology schema ................................................................................................................ 13
Figure 2 Growth of global storage 2005 - 2015 ...................................................................................... 17
Figure 3 The four "V" for Big Data ........................................................................................................ 19
Figure 4 Massively Parallel Processing configuration for Big Data ....................................................... 27
Figure 5 Oracle RAC configuration for Big Data ................................................................................... 28
Figure 6 A bit of fun on non-relational databases ................................................................................... 29
Figure 7 Still a bit of fun on relationships and NoSQL .......................................................................... 30
Figure 8 Schema for Columnar Databases .............................................................................................. 31
Figure 9 Graphical definition for graph Databases ................................................................................. 33
Figure 10 Graphical representation of CAP Theorem with NoSQL tools .............................................. 34
Figure 11 HDFS Hadoop File System schema ....................................................................................... 36
Figure 12 Hadoop way of working ......................................................................................................... 37
Figure 13 Schema for MapReduce way of working ............................................................................... 39
Figure 14 MapReduce algorithm example .............................................................................................. 40
Figure 15 Google Trends: searches on "Cloud Computing" and "Big Data" ......................................... 44
Figure 16 Tableplot usage for Netherlands Census data ......................................................................... 48
Figure 17 Tableplot usage in statistical production process .................................................................... 49
Figure 18 Comparison between Google Flu Trends estimate and US data on influenza........................ 67
Figure 19 Google Flu Trends updated model after H1N1 pandemic ...................................................... 69
Figure 20 Seasonal Correlation between Winter and the searches for "Pizzoccheri" word.................... 70
Figure 21 Comparison between Argentina Inflation rate and Price Index computed by Pricestats ........ 73
Figure 22 Comparison of official US inflation and Pricestats computed one......................................... 75
Figure 23 Comparison of official Argentina inflation and Pricestats computed one .............................. 76
Figure 24 Price for international flights by days before departure ......................................................... 77
Figure 25 Comparison between Tweets on Prices and Rice Price level ................................................. 79
Figure 26 Schema for Netherlands Data Warehouse for Traffic Information ......................................... 81
Figure 27 Dutch traffic time profile ........................................................................................................ 82
Figure 28 Dutch traffic: normalized number of vehicles for three length categories ............................. 83
Figure 29 Comparison between sentiment analysis from social media and consumer confidence ........ 86
Figure 30 New Zealand: usage of mobile phones for dates close to the quake ...................................... 89
Figure 31 Schema for Istat analysis of Big Data coming from the Internet ........................................... 94
Figure 32 Schema of international groups working on statistical standards......................................... 100
Figure 33 Web page of MSIS 2013 meeting ......................................................................................... 103
Figure 34 Web page for Big Data Inventory ......................................................................................... 106

7
Chapter 1 Introduction

1.1 Data sources for official statistics

Official statistical institutions are organized following international adopted


principles. The United Nations Statistical Commission1, established in 1947,
is the higher entity of the global statistical system, bringing together Chief
Statisticians from member states from around the world. The Commission
consists of 24 member countries elected on the basis of geographical
distribution: five from Africa, four from Asia, four from Easter Europe, four
from Latin America and seven from Western Europe and other countries.
In 1994 the United Nations Statistical Commission adopted the fundamental
principles of official statistics. These principles are:
a) Relevance, impartiality and equal access
b) Professional standards and ethics
c) Accountability and transparency
d) Prevention of misuse
e) Sources of official statistics
f) Confidentiality
g) Legislation
h) National coordination
i) Use of international standards
j) International cooperation
Here we want to focus on fifth principle: Sources of official statistics. This
principle state that Data for statistical purposes may be drawn from all types
of sources, be they statistical surveys or administrative records. Statistical
agencies are to choose the source with regard to quality, timeliness, costs and
the burden on respondents.
1
"United Nations Statistics Division - UN Statistical Commission." Web. December 2013
http://unstats.un.org/unsd/statcom/commission.htm
8
In the past statistical organizations collected data using statistical surveys
with questionnaires filled by respondents. Once statistical questionnaires was
filled by interviewers writing on the paper (PAPI), then by interviewers using
phone interviews (CATI) [Lyberg1995], then using computers (CAPI)
[Baker1995]; afterwards optical readers began to be used and recently
questionnaires on the Web (CAWI) [Couper 2001]. The problem has been
always the so-called statistical burden for the users and the costs for
resources involved in the process [Saralaguei 2013].
In the 90s statistical organizations came under pressure to improve the
efficiency of the statistical production process to make savings in costs and
staff resources. At the same time, there was growing political demands to
reduce the burden placed on the respondents to statistical surveys. Given
these pressures, statisticians was increasingly being forced to consider
alternatives to the traditional survey approach as a way of collect data.
The solution was clear: many non-statistical organizations collected data in
various forms, and although these data was rarely direct substitutes for those
collected via statistical surveys, they often offered possibilities, sometimes
through the combination of multiple sources, to replace, fully or partially,
direct statistical data collection [Brackstone 1999] .
These sources are called Administrative Sources and was defined:
Administrative source is the organisational unit responsible for
implementing an administrative regulation (or group of regulations), for
which the corresponding register of units and the transactions are viewed as a
source of statistical data.2 Starting from the 90s, administrative sources
became central in many statistical processes and all developed countries
started using them to reduce costs and resources devoted to statistical surveys
[Wallgren 2007].
Now we are facing a similar shift: statistical organizations are always pressed
by governments and by public opinion for costs reduction and for reduction
of statistical burden. But today we can use a new source of data: more and
more data are generated on the web and produced by sensors supported by a
huge number of electronic devices around us. The amount of data and the

2
"OECD Glossary of Statistical Terms - Administrative Source Definition." Web. December 2013.
<http://stats.oecd.org/glossary/detail.asp?ID=7
9
frequency at which they are produced have led to the concept of 'Big Data'
[Mayer-Schonberger 2013].
Many statistical organizations already started to investigate the possibility of
using the Big Data as a source to complement and support the official
statistics.
The use of Big Data in official statistics presents many challenges, falling
into the following categories:
a. Legislative, i.e. with respect to the access and use of data
b. Privacy, i.e. managing public trust and acceptance of data re-use and its
link to other sources
c. Financial, i.e. potential costs of sourcing data vs. benefits
d. Management, e.g. policies and directives about the management and
protection of the data
e. Methodological, i.e. data quality and suitability of statistical methods
f. Technological, i.e. issues related to information technology.

1.2 Research questions and Thesis contribution


This thesis aims at contributing to the field of Big Data, focusing on the
usage of such a data in the field of Official Statistics, i.e. statistics managed
and published by bodies such as National Statistical Institutes and
International Statistical Organizations.
The specific research questions are reported below:
1. How technologies used for Big Data can be used for supporting Official
Statistics?
2. What are current Big Data uses in statistical institutions?
3. Which are the challenges to address when introducing Big Data in
traditional statistical process?
4. Which considerations can be derived analysing first examples of Big
Data usage by Official Statistics Institutions?

10
5. Which are the first acquisitions and which are the open research fields
to be addressed?
According to the research questions we tried to help advance research by
providing following contributions:
1. Suggestions on the means to integrate Big Data technologies with
technologies already used in Statistical organisations;
2. Summary of examples of Big Data use in Official Statistics;
3. Possible solutions to challenges coming from Big Data usage in Official
Statistics;
4. Recommendations on the possible Big Data usage inside Official
Statistics, describing actions to be taken to maximize results and risky
actions to avoid;
5. Description of international activities recently started to address the
issues coming from Big Data with related open research fields.

1.3 Methodology
In this section the methodological approach is described, explaining how the
research work is carried out in order to answer to proposed research
questions. Figure 1 shows how the methodology is followed.
The first phase is the selection of research domain and the evaluation of
context knowledge. Here the context starts from the analysis of current data
managed by Official Statistics institutions, mainly data coming from surveys
derived from questionnaires submitted to large samples of populations and
the use of statistical data derived from administrative sources (Administrative
Data).
Although traditional techniques permit the production of good quality data,
there are problems related to many factors like the cost of traditional surveys,
the time needed for statistical production in a world that increasingly requires
more timeliness for policy decisions, the increase in waste to participate in
surveys due to the increasing burden of statistics, the increasing difficulty to
make use of CATI techniques due to the increasing number of people who do

11
not have telephones or want to keep it confidential, the cost and labor
required to transform the given administrative data in quality statistical data.
In this context, the Big Data can provide an additional source of investigation
due to their availability and to the fact that they can produce relevant and
timely information. The Big Data provide such information on topics like
connectivity of people, their mobility, the prices of e-commerce transactions,
the job search network, transactions in the property market and on some
people's value orientations. It is a very rich reservoir of information that also
the Official Statistics can try to use. In practice, a third source of data that
cannot be ignored, besides traditional surveys data and administrative data.
The use of Big Data in official statistics, however, poses problems of a
technological nature on the one hand, on the other hand requires changes of
the skills and also new considerations about the significance of the
information produced. In fact, statisticians need to select from the big amount
of data existing only ones that are actually useful from the perspective of
statistical official statistics. The main issue is the way Official Statistics
institutions can use Big Data, granting data quality today assured by
traditional statistical methods, reducing costs, improving timeliness and
avoiding risks for privacy and security and assuring veracity and reliability of
data.
Therefore we describe the big increase of data availability (named data
deluge) due to the increase in the Web adoption and to the development of
Internet of things. Then we list available classifications in literature and
taxonomies of data that fall into the category of Big Data.
In order to use this valuable source of data new arising problems in
information technology cannot be ignored. We will therefore analyze the
main issues that Big Data pose in storage and processing to identify the
technological challenges that must be resolved before you can currently use
Big Data as a source of data. This will allow examination of hardware and
software technologies today available to deal with Big Data.

12
Figure 1 Methodology schema

13
Then the analysis of the recent first experiences of Big Data usage to improve
official statistics will be carried out. The uses made so far allow us to provide
a set of suggestions and considerations about methods, organizational
activities and technologies that Statistical Institutions can adopt to start using
Big Data as a way to improve official statistics quality, followed by the
description of internationally coordinated activities that are addressing
research fields already open.
The data exist, the technologies as well, perhaps that's missing is a decision-
making process and also an integration of competences between traditional
approaches in the production of statistics and utilization opportunities of Big
Data. It is necessary to overcome the natural distrust of traditional producers,
helping them to overcome the real problems regarding the reliability,
accuracy and representativeness of the data, but also using new skills to take
advantage of the enormous wealth of information that allows us to extract
information at low cost.
It is a process that official statistics has already crossed when they started to
use administrative sources that today are absolutely current part in the
production of statistical data. At least at the beginning, probably the Big Data
will be used mainly to supplement and enrich the traditional sources of
statistical data, rather than replace them entirely.

1.4 Structure of the Thesis

The thesis is organized as follows:


Chapter 2 introduces the Big Data topic, summarizes the definitions
used in the literature, analyzes the technologies that are the basis of the
phenomenon and allow its success and lists some of the areas where the
use of BD is more promising, finally we give some considerations
about the software to be used in statistical organizations and the
necessary skills;
Chapter 3 is devoted to the challenges that introduction of Big Data
poses to official statistics: starting from the relationship between data
science and statistics, lists the main challenges (methodological,
14
technological, legal, organizational, etc.) and finally analyze the results
highlighted by the first international experiences;
Chapter 4 presents first international experiences of usage of Big Data
in statistics: starting from the first experiments in the Google world and
then passing the various statistical domains involved: consumer price
index, transport statistics, economic statistics using data from social
media and phone calls, statistics on the information society; at the and
we give some recommendations coming from the analysis of problems
met in first experiences;
Chapter 5 describes the joint initiatives taken on Big Data from
international bodies and the most advanced national statistical offices:
these actions will lead in the near future to the creation of shared
environments in which to experiment in a coordinated way new
methods and new software tools on large statistical archives.

15
Chapter 2: Big Data definition and characteristics

In this chapter we start with the analysis of the explosion of the quantity of
information in recent years; then we list more promising definitions for the
term Big Data and we describe the most important technologies that make
possible their usage. Finally we give some considerations about the usage of
traditional statistical software tools together with Big Data technologies, the
problem of the current skill of statistical institutions staff and some issues
about IT infrastructure.

2.1 Data deluge

In a recent article on Information Management we can read We create 2.5


quintillion [1018] bytes of data every day, with 90% of the data in the world
created in the last two years alone... Every hour, Wal-Mart handles 1 million
transactions, feeding a database of 2.5 petabytes [1015 bytes], which is
almost 170 times the data in the Library of Congress. The entire collection of
the junk delivered by the U.S. Postal Service in one year is equal to 5
petabytes, while Google processes that amount of data in just one hour. The
total amount of information in existence is estimated at a little over a
zettabyte [1021 bytes]. [Bettino 2012]
A recent study provided by the TechAmerica Foundation Big Data
Commission [Mills 2012] to US Government stated Since 2000, the amount
of information the federal government captures has increased exponentially.
In 2009, the U.S. Government produced 848 petabytes of data and U.S.
healthcare data alone reached 150 exabytes. Five exabytes (1018 bytes) of
data would contain all words ever spoken by human beings on earth. At
this rate, Big Data for U.S. healthcare will soon reach zettabyte (1021
gigabytes) scale and soon yottabytes (1024 bytes).
The information explosion, so called data deluge [Bell 2009], is forcing us to
use new units of measurement with which we have yet to familiarize. While

16
our personal archives are overcoming the terabyte (one trillion bytes), we are
beginning to use higher multiples.
In the following table we give the units used to measure the data, starting
with those most familiar to those that are essential to measure the Big Data.

Value in bytes Symbol Name


109 (=10003) GB gigabyte
1012 (=10004) TB terabyte
1015 (=10005) PB petabyte
1018 (=10006) EB exabyte
1021 (=10007) ZB zettabyte
1024 (=10008) YB yottabyte
Table 1: units of measure for data

Figure 2 Growth of global storage 2005 - 2015

17
All of this data would be useless if we couldnt store it, and thats where
Moores Law [Schaller 1997] comes in. Following this law, which states that
the number of transistors on integrated circuits doubles every two years, since
the early 80s processor speed has increased from 10 MHz to 3.6 GHz - an
increase of 360 (not counting increases in word length and number of
cores). But weve seen much bigger increases in storage capacity, on every
level. RAM as moved from $1,000/MB to roughly $25/GBa price
reduction of about 40,000, and all this together with the reduction in size and
increase in speed. The first gigabyte disk drives appeared in 1982, weighing
more than 100 kilograms; now terabyte drives are consumer equipment, and a
32 GB microSD card weighs about half a gram. Whether you look at bits per
gram, bits per dollar, or raw capacity, storage availability has grown faster
than CPU speed.
Not only the dimension of data is growing, but the nature of these data is also
changing, mainly with the wide usage of social media and of services offered
via mobile phones. The bulk of this information can be called data exhaust,
in other words, the digitally trackable or storable actions, choices, and
preferences that people generate as they go about their daily lives.3
But How Big is Big? Even if Big Data environment of petabytes
dimension are reported, a survey [Devlin 2012] shows that a significant
sample of companies believes that it is "big" a daily flow of data managed
(loaded, analyzed, processed) greater than 1 TB (normally between 1 TB and
750TB), while a common Big Data management system usually ranges
between 110GB and 300TB.

3
"Word Spy - Data Exhaust." Web. December 2013. http://www.wordspy.com/words/dataexhaust.asp
18
2.2 Big Data definitions

Here we will try to list some of the most used definitions of Big Data. We
know that define Big Data is not an easy task: Opentracker, a Web company
specialized in web-tracking and offering a service called (Big)Data-as-a-
Service, collected more than thirty definitions of Big Data4. We can start
from the classic Wikipedia5 definition Big Data is a collection of data sets so
large and complex that it becomes difficult to process using on-hand database
management tools or traditional data processing applications, but many
authors adhere to the Four Vs definition that points to the four
characteristics of Big Data, namely volume, variety, velocity, and veracity.6

Figure 3 The four "V" for Big Data

The convergence of these four dimensions helps to define Big Data:


Volume (the amount of data): it refers to the mass quantities of data that
organizations are trying to use to improve decision-making processes.
Data volumes continue to increase at an unprecedented rate. However,
what constitutes truly high volume varies by industry and even

4
"Definitions of Big Data." Opentracker. Web. December 2013.
http://www.opentracker.net/article/definitions-big-data
5
"Big Data." Wikipedia. Wikimedia Foundation, Web. December 2013.
https://en.wikipedia.org/wiki/Big_data
6
What is Big Data 8digits blog, Web. December 2013. http://blog.8digits.com/what-is-big-data-2/
19
geography, and is smaller than the petabytes and zettabytes often
referenced. Many companies consider datasets between one terabyte
and one petabyte to be Big Data. Still, everyone can agree that whatever
is considered high volume today, will be even higher tomorrow;
Variety (different types of data and data sources): variety is about
managing the complexity of multiple data types, including structured,
semi-structured and unstructured data. Organizations need to integrate
and analyze data from a complex array of both traditional and non-
traditional information sources, from within and outside the enterprise.
With the explosion of sensors, smart devices and social media
technologies, data is being generated in countless forms, including text,
web data, tweets, sensor data, audio, video, click streams, log files and
more;
Velocity (data in motion): the speed at which data is created, processed
and analyzed continues to accelerate. Higher velocity is due to both the
real-time nature of data creation, and the need to incorporate streaming
data into business processes. Today, data is continually being generated
at a rate that is impossible for traditional systems to capture, store and
analyze. For time-sensitive processes such as multi-channel instant
marketing, data must be analyzed in real time to be of value to the
business;
Veracity (data uncertainty): it refers to the level of reliability associated
with certain types of data. The quest for high data quality is an
important Big Data requirement and challenge, but even the best data
cleansing methods cannot remove the inherent unpredictability of some
data, like the weather, the economy, or a customers buying decisions.
The need to acknowledge and plan for uncertainty is a dimension of Big
Data that has been introduced as executives try to better understand the
uncertain world around them.
The above mentioned TechAmerica Foundation Big Data Commission [Mills
2012] gave this definition: Big Data is a term that describes large volumes
of high velocity, complex and variable data that require advanced techniques
and technologies to enable the capture, storage, distribution, management,
and analysis of the information.

20
In the document Big Data: The next frontier for innovation, competition,
and productivity [Manyika 2011] McKinsey Global Institute use this
definition: Big Data refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, and analyze. This
definition is intentionally subjective and incorporates a moving definition of
how big a dataset needs to be in order to be considered Big Data - i.e., we
dont define Big Data in terms of being larger than a certain number of
terabytes (thousands of gigabytes).
We assume that, as technology advances over time, the size of datasets that
qualify as Big Data will also increase. Also note that the definition can vary
by sector, depending on what kinds of software tools are commonly available
and what sizes of datasets are common in a particular industry. With those
caveats, Big Data in many sectors today will range from a few dozen
terabytes to multiple petabytes (thousands of terabytes).
Tim O'Reilly gave a very short definition that perhaps includes all other: "Big
Data is what happened when the cost of storing information became less than
the cost of making the decision to throw it away."
IDC defines them Big Data technologies describe a new generation of
technologies and architectures, designed to economically extract value from
very large volumes of a wide variety of data, by enabling high-velocity
capture, discovery, and/or analysis [Gantz 2011].
Many other definitions focus on the sources of Big Data, trying to list the
different types of existing data.
One possible taxonomy, coming from a consulting company [Devlin 2012]:
1. Human-sourced information: all information ultimately originates
from people. This information is the highly subjective record of human
experiences, previously recorded in books and works of art, and later in
photographs, audio and video. Human-sourced information is now
almost entirely digitized and electronically stored everywhere from
tweets to movies. Structuring and standardizationfor example,
modelingdefines a common version of the truth that allows the
business to convert human-sourced information to more reliable
process-mediated data. This starts with data entry and validation in
operational systems and continues with the cleansing and reconciliation
21
processes as data moves to Business Intelligence (Social Networks
growth);
2. Process-mediated data: business processes record and monitor
business events of interest, such as registering a customer,
manufacturing a product, taking an order, etc. The process-mediated
data thus collected is highly structured and includes transactions,
reference tables and relationships, as well as the metadata that sets its
context. Process-mediated data has long been the vast majority of what
IT managed and processed, in both operational and BI systems
(Traditional Business systems);
3. Machine-generated data: the output of sensors and machines
employed to measure and record the events and situations in the
physical world is machine-generated data, and from simple sensor
records to complex computer logs, it is well structured and considered
to be highly reliable. As sensors proliferate and data volumes grow, it is
becoming an increasingly important component of the information
stored and processed by many businesses. Its well-structured nature is
amenable to computer processing, but its size and speed is often beyond
traditional approachessuch as the enterprise data warehousefor
handling process-mediated data; standalone high-performance relational
and NoSQL databases are regularly used (Internet of Things).
The UNECE task team on Big Data (see in chapter 5) in 2013 proposed
another taxonomy:
1. Social Networks (human-sourced information): this information is
the record of human experiences, previously recorded in books and
works of art, and later in photographs, audio and video. Human-sourced
information is now almost entirely digitized and stored everywhere
from personal computers to social networks. Data are loosely structured
and often ungoverned. Subcategories:
Social Networks: Facebook, Twitter, Tumblr etc.
Blogs and comments
Personal documents
Pictures: Instagram, Flickr, Picasa etc.
22
Videos: Youtube etc.
Internet searches
Mobile data content: text messages
User-generated maps
E-Mail
2. Traditional Business systems (process-mediated data): these
processes record and monitor business events of interest, such as
registering a customer, manufacturing a product, taking an order, etc.
The process-mediated data thus collected is highly structured and
includes transactions, reference tables and relationships, as well as
the metadata that sets its context. Traditional business data is the vast
majority of what IT managed and processed, in both operational and
Business Intelligence systems. Usually structured and stored in
relational database systems (Some sources belonging to this class
may fall into the category of "Administrative data").
Data produced by Public Agencies
Medical records
Data produced by businesses
Commercial transactions
Banking/stock records
E-commerce
Credit cards
3. Internet of Things (machine-generated data): derived from the
phenomenal growth in the number of sensors and machines used to
measure and record the events and situations in the physical world.
The output of these sensors is machine-generated data, and from
simple sensor records to complex computer logs, it is well structured.
As sensors proliferate and data volumes grow, it is becoming an
increasingly important component of the information stored and
processed by many businesses. Its well-structured nature is suitable

23
for computer processing, but its size and speed is beyond traditional
approaches.
Data from sensors
Fixed sensors
o Home automation
o Weather/pollution sensors
o Traffic sensors/webcam
o Scientific sensors
o Security/surveillance videos/images
Mobile sensors (tracking)
o Mobile phone location
o Cars
o Satellite images
Data from computer systems
Logs
Web logs
Another taxonomy, somehow transverse, coming from the international
statistical community can be found in the HLG document on Big Data7:
1. Administrative (arising from the administration of a program, be it
governmental or not), e.g. electronic medical records, hospital visits,
insurance records, bank records, food banks;
2. Commercial or transactional: (arising from the transaction between
two entities), e.g. credit card transactions, on-line transactions
(including from mobile devices);
3. From sensors, e.g. satellite imaging, road sensors, climate sensors;
4. From tracking devices, e.g. tracking data from mobile telephones,
GPS;

7
"UNECE Statistics Wikis." What Does Big Data Mean for Official Statistics? Web. December 2013.
http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170614
24
5. Behavioral, e.g. online searches (about a product, a service or any other
type of information), online page view;
6. Opinion, e.g. comments on social media.

Global Pulse8, an initiative based in the Executive Office of the Secretary-


General United Nations, has developed a loose taxonomy of types of new,
digital data sources that are relevant to global development:
1. Data Exhaust passively collected transactional data from peoples
use of digital services like mobile phones, purchases, web searches,
etc., and/or operational metrics and other real-time data collected by
UN agencies, NGOs and other aid organisations to monitor their
projects and programmes (e.g. stock levels, school attendance); these
digital services create networked sensors of human behaviour;
2. Online Information web content such as news media and social
media interactions (e.g. blogs, Twitter), news articles obituaries, e-
commerce, job postings; this approach considers web usage and content
as a sensor of human intent, sentiments, perceptions, and want;
3. Physical Sensors satellite or infrared imagery of changing
landscapes, traffic patterns, light emissions, urban development and
topographic changes, etc.; this approach focuses on remote sensing of
changes in human activity;
4. Citizen Reporting or Crowd-sourced Data Information actively
produced or submitted by citizens through mobile phone-based surveys,
hotlines, user-generated maps, etc.

8
"United Nations Global Pulse." Home. Web. December 2013. http://www.unglobalpulse.org/
25
2.3 Big Data technologies

Talking about technologies enabling the use of Big Data, there are three
fundamental technological strategies for storing and providing fast access to
large data sets:
1. Improved hardware performance and capacity: use faster CPUs, use
more CPU cores (Requires parallel/threaded operations to take
advantage of multi-core CPUs) , increase disk capacity and data transfer
throughput , increased network throughput (MPP);
2. Reducing the size of data accessed: data compression and data
structures that, by design, limit the amount of data required for queries
(e.g., bitmaps, column-oriented databases) (NoSQL);
3. Distributing data and parallel processing: putting data on more disks to
parallelize disk I/O, put slices of data on separate compute nodes that
can work on these smaller slices in parallel, use massively distributed
architectures with emphasis on fault tolerance and performance
monitoring with higher-throughput networks to improve data transfer
between nodes (Hadoop and MapReduce).
As keywords of these three classes of technologies we can use: MPP
(Massively Parallel Processing), NoSQL (Not Only SQL), Hadoop and
MapReduce.

2.3.1 MPP Massively Parallel Processing

The Massively Parallel Processing [DeWitt 1992] relational database


architecture spreads data over a number of independent servers, or nodes, in a
manner transparent to those using the database. In Big Data environments are
often used analytic MPP systems usually called shared-nothing databases
[Taniar 2008], as the nodes that make up the cluster operate independently,
communicating via a network but not sharing disk or memory resources.
With modern multi-core CPUs, MPP databases can be configured to treat
each core as a node and run tasks in parallel on a single server.
26
Figure 4 Massively Parallel Processing configuration for Big Data

By distributing data across nodes and running database operations across


those nodes in parallel, MPP databases are able to provide fast performance
even when handling very large data stores. The massively parallel, or
shared- nothing, architecture allows MPP databases to scale performance in
a near- linear fashion as nodes are added, i.e., a cluster with eight nodes will
run twice as fast as a cluster with four nodes for the same data.
The collection of servers that make up an MPP system is known as a cluster.
A key element in the MPP performance is distributing the data evenly across
all the nodes in the cluster. This requires identifying a key whose value is
random enough that, even over time, the data does not concentrate in one or a
subset of nodes. The MPP databases have algorithms that help keep the data
distributed; using the wrong fields for distribution may lead to a skewed
data distribution and poor performance.

27
Figure 5 Oracle RAC configuration for Big Data

MPP databases has the advantages of being able to scale simply by adding
hardware and of using the standard SQL, so that they can be easily integrated
with ETL (extract/transform/load) [Vassiliadis 2002], visualization, and
display tools, without requiring the introduction of new skills.

2.3.2 NoSQL (column oriented databases)

The traditional Relational Database Management Systems (RDBMS)


technology [Codd 1979], that introduced the standard access to data using
SQL, was developed at a time when structured information could be
28
accessed, categorized and normalized with relative ease. RDBMs were
designed to meet a wide array of different query types, looking at corporate
data which was processed in a highly structured way by traditional software.
The idea of a record, with its fixed areas of data entry and limited
information types is synonymous with this usage.
The first evolution trying
to overcome the logic of
RDBMS was the column
oriented databases [Abadi
2009], in which the data
is stored by columns, and,
when possible, turned into
bitmaps or compressed in
other ways to reduce the
amount of data stored.
Compressing columns
reduced the data to be
stored; the combination of
compressed data and
retrieving only the
requested columns speeds
query performance by
reducing the amount of
I/O required and
increasing the amount of
query data that can reside
in memory.
The first examples of
column oriented
Figure 6 A bit of fun on non-relational databases
databases maintained the
standard SQL and already
in the 90s many products were available (e.g. Sybase IQ [MacNicol 2004]),
but recently has established itself the NoSQL model [Cattel 2011], a data
model that goes beyond the relational model. NoSQL once stood for No
SQL; today it is generally agreed that it means not only SQL. These are
29
database products designed to handle extremely large data sets. In NoSQL
databases the data schema is not fixed a priori but can be flexibly adapted to
the actual form of the data. Such a flexibility is the key for reaching high
scalability: physical data representation is simplified, few constraints on data
exist, and data can be easily partitioned and stored on different machines.
NoSQL databases. The obvious drawback lies in the lack of strong
consistency guarantees (e.g., ACID properties and foreign key constraints are
no longer valid) and the difficulty of performing joins between different
datasets [Stonebracker 2010].

Figure 7 Still a bit of fun on relationships and NoSQL

There are a variety of different types of database types that today fall within
the general NoSQL category: the most important are the following:
Key-value systems, using a hash table with a unique key and pointer to
a data item [Seeger 2009]. Key-value databases do not require a schema
(like RDBMSs) and offer great flexibility and scalability, do not offer
ACID (Atomicity, Consistency, Isolation, Durability) capability, and
require implementers to think about data placement, replication, and
fault tolerance as they are not expressly controlled by the technology
itself. Key-value databases are not typed and most of the data is stored
as strings. These include Memcached, Dynamo and Voldemort.
Amazons S3 uses Dynamo as its storage mechanism. Also used
extensively is Riak9, an open-source fault-tolerant key-value NoSQL
9
"Riak Docs." Riak. Web. December 2013. http://docs.basho.com/riak/latest/
30
database;
Columnar systems, used to store and process very large amounts of
data distributed over many machines [Schindler 2012]. Relational
databases are row oriented, as the data in each row of a table is stored
together. In a columnar, or column-oriented database, the data is stored
across rows. It is very easy to add columns, and they may be added row
by row, offering great flexibility, performance, and scalability. When
you have volume and variety of data, you might want to use a columnar
database. It is very adaptable; you simply continue to add columns. The
most important example is Googles BigTable, where rows are
identified by a row key with the data sorted and stored by this key.
BigTable has served as the basis of a number of NoSQL systems,
including Hadoops Cassandra (open sourced from Facebook), HBase
and Hypertable;

Figure 8 Schema for Columnar Databases


Document Databases: similar to key-value, but based on versioned
documents that are collections of other key-value collections. The
31
structure of the documents and their parts is often provided by
JavaScript Object Notation (JSON) and/or Binary JSON (BSON).
Document databases are most useful when you have to produce a lot of
reports and they need to be dynamically assembled from elements that
change frequently. The best known of these are MongoDB and
CouchDB;
Graph Database systems, where the fundamental structure is called
node-relationship. [Angles 2008]. This structure is most useful when
you must deal with highly interconnected data. Nodes and relationships
support properties, a key-value pair where the data is stored. These
databases are navigated by following the relationships. This kind of
storage and navigation is not possible in RDBMSs due to the rigid table
structures and the inability to follow connections between the data
wherever they might lead us. Examples are the open source Neo4J, the
triplestore Allegro and Virtuoso.

32
Figure 9 Graphical definition for graph Databases

33
As per CAP theorem [Brewer 2000], there are three primary concerns you
must balance when choosing a data management system: consistency,
availability, and partition tolerance. The theorem states that a distributed
computer system can simultaneously provide only two of them:
Consistency means that each client always has the same view of the
data
Availability means that all clients can always read and write
Partition tolerance means that the system works well across physical
network partitions

Figure 10 Graphical representation of CAP Theorem with NoSQL tools


In the picture10 you can find the most popular NoSQL software products
placed each in its intersection: PA (Partition-Availability), CP (Consistency-

10
"Amit Piplani.": U Pick 2 Selection for NoSQL Providers. Web. December 2013.
http://amitpiplani.blogspot.it/2010/05/u-pick-2-selection-for-nosql-providers.html
34
Partition) and AC (Availability-Consistency).
One thing that has become clear is that there is no single solution to Big Data
problems. Instead, there are a variety of different database models emerging
that are more specialized and suitable for handling specific types of
problems. For example, the columnar databases that have been popular
recently are designed for high speed access to data on a distributed basis, and
work well with MapReduce.
But document databases, such as MongoDB and CouchDB work better with
documents, and incorporate features for high speed high volume processing
of document objects. Graph databases are specialized to graph data and key-
value databases are another form of high speed processing format that is
suitable for large data sets with relatively simple characteristics.

2.3.3 Hadoop and MapReduce

Hadoop, an open source Java framework for running applications in parallel


across large clusters of commodity hardware, was created by Doug Cutting,
the developer of Lucene (search tool) and Nutch (distributed web crawler)
[Khare 2004]. Cutting was influenced by what he learned about Google File
System (GFS [Ghemawat 2003]) and MapReduce in 2004 2006, and the
Hadoop project grew out of his work on Nutch. In 2006 Doug was hired by
Yahoo!, got a team of engineers, and started the Hadoop project to give
Yahoo! the same type of distributed processing that Google was enjoying
with their MapReduce platform. Now Hadoop has become the primary
MapReduce platform used outside of Google.
Hadoop [White 2012] is today an open-source software project, managed by
the Apache Software Foundation, targeted at supporting the execution of
data-oriented application on clusters of generic hardware. The Apache
Hadoop project includes a number of related open source Apache projects
(see above) that include Pig, Hive, Cassandra, HBase, Avro, Chukwa,
Mahout and Zookeeper. The NoSQL databases HBase and Cassandra are
used as the database grounding for a significant number of Hadoop projects.
The Hadoop project is comprised of three pieces: Hadoop Distributed File
35
System (HDFS), the MapReduce model, and Hadoop Common.

Figure 11 HDFS Hadoop File System schema

Hadoop Distributed File System [Shafer 2010] is a distributed, scalable, and


portable file-system written in Java for the Hadoop framework. Each node in
a Hadoop instance typically has a single Namenode, the core of the HDFS
file system that keeps the directory tree of all files in the file system, and
tracks where across the cluster the file data is kept. DataNodes stores data in
the HDFS and a functional filesystem has more than one DataNode, with data
replicated across them.
The file system uses the TCP/IP layer for communication, while clients use
36
Remote Procedure Call (RPC) to communicate between each other. HDFS
stores large files (typically in the range of gigabytes to terabytes) across
multiple machines. It achieves reliability by replicating the data across
multiple hosts, and hence does not require RAID storage on hosts. With the
default replication value (3) data is stored on three nodes: two on the same
rack, and one on a different rack. Data nodes can talk to each other to re-
balance data and to move copies around. HDFS main feature is the ability to
scale to a virtually unlimited storage capacity by simply adding new
machines to the cluster at any time.
A good resume of Hadoop way of working is in following image11

Figure 12 Hadoop way of working

In the early 2000s, some engineers at Google looked into the future and
determined that while their current solutions for applications such as web
crawling, query frequency and so on were adequate for most existing
requirements, they were inadequate for the complexity they anticipated as the
11
Sris Technology Blog. Web. December 2013. http://ecomcanada.wordpress.com/2012/11/14/storing-
and-querying-big-data-in-hadoop-hdfs/
37
web scaled to more and more users. These engineers defined a new
programming model in which the work is distributed across inexpensive
computers connected on the network in the form of a cluster.
Distribution alone was not a sufficient answer. This distribution of work must
be performed in parallel for the following three reasons:
The processing must be able to expand and contract automatically;
The processing must be able to proceed regardless of failures in the
network or the individual systems;
Developers must be able to create services that are easy to be used by
other developers. Therefore, this approach must be independent of
where the data and computations have executed.
MapReduce [Dean 2004] is a programming model and an associated
implementation for processing and generating large data sets. The term
MapReduce refers to two separate and distinct tasks. The first is the map job,
which takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). The
reduce job takes the output from a map as input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce job is always performed after the map job.
Putting the Map and Reduce functions to work efficiently requires an
algorithm too. The standard steps for a MapReduce work-flow is something
like:
1. Start with a large number or data or records
2. Iterate over the data
3. Use the map function to extract something of interest and create an
output list
4. Organize the output list to optimize for further processing
5. Use the reduce function to compute a set of results
6. Produce the final output.

38
Figure 13 Schema for MapReduce way of working
By analogy to SQL, the map is like group by and the reduce function is like
an aggregate function (e.g., sum or count) for an aggregate query.

39
To better understand MapReduce way of working, you can think of map and
reduce tasks as the way a census was conducted in Roman times, where the
census bureau would dispatch its people to each city in the empire. Each

Figure 14 MapReduce algorithm example


census taker in each city would be tasked to count the number of people in
that city and then return their results to the capital city. There, the results
from each city would be reduced to a single count (sum of all cities) to
determine the overall population of the empire. This mapping of people to
40
cities, in parallel, and then combining the results (reducing) is much more
efficient than sending a single person to count every person in the empire in a
serial fashion.
Many real world tasks can be easily expressed as MapReduce computations
[Gillick 2006]:
Distributed Grep: the map function emits a line if it matches a supplied
pattern. The reduce function is an identity function that just copies the
supplied intermediate data to the output;
Count of URL Access Frequency: the map function processes logs of
web page requests and outputs (URL;1). The reduce function adds
together all values for the same URL and emits a (URL; total count)
pair;
Inverted Index: the map function parses each document, and emits a
sequence of (word; document Id) pairs. The reduce function accepts all
pairs for a given word, sorts the corresponding document IDs and emits
a (word; list(document ID)) pair. The set of all output pairs forms a
simple inverted index.

2.3.4 Hadoop ecosystem

In recent years open-source as well as commercial developers all over the


world have been building and testing tools to increase the adoption and
usability of Hadoop [Borthakur 2011]. Many groups of developers are
working to offer their enhancements back to the Apache project. Here we
give a first list of components you need to build and manage (around the
foundation Hadoop-MapReduce) Big Data applications for the real world.
YARN [Vavilapalli 2013] (Yet Another Resource Negotiator) is a kind
of new release of MapReduce, in Apache Foundation called
MapReduce 2.0 (MRv2). The fundamental idea of MRv2 is to split up
the two major functionalities of the JobTracker, resource management
and job scheduling/monitoring, into separate daemons. The idea is to
have a global ResourceManager and per-application ApplicationMaster.
41
The ResourceManager is a master service and control NodeManager in
each of the nodes of a Hadoop cluster. Included in the
ResourceManager is Scheduler, whose sole task is to allocate system
resources to specific running applications (tasks);
HBase12 is a distributed, nonrelational (columnar) database that utilizes
HDFS as its persistence store. It is modeled after Google BigTable and
is capable of hosting very large tables (billions of columns/rows)
because it is layered on Hadoop clusters of commodity hardware.
HBase provides random, real-time read/write access to Big Data. HBase
is highly configurable, providing a great deal of flexibility to address
huge amounts of data efficiently;
Cassandra [Dede 2013] is another distributed database management
system designed to handle large amounts of data across many
commodity servers, providing high availability with no single point of
failure. Cassandra offers robust support for clusters spanning multiple
data-centers, with asynchronous master-less replication allowing low
latency operations for all clients;
Hive [Thusoo 2010] is a batch-oriented, data-warehousing layer built
on the core elements of Hadoop (HDFS and MapReduce). It provides
users who know SQL with a simple SQL-lite implementation called
HiveQL without sacrificing access via mappers and reducers. With
Hive, you can get the best of both worlds: SQL-like access to structured
data and sophisticated Big Data analysis with MapReduce. Unlike most
data warehouses, Hive is not designed for quick responses to queries
and is best used for data mining and deeper analytics that do not require
real-time behaviors. Because it relies on the Hadoop foundation, it is
very extensible, scalable, and resilient, something that the average data
warehouse is not;
Pig [Olston 2008] was designed to make Hadoop more approachable
and usable by non-developers. Pig is an interactive, or script-based,
execution environment supporting Pig Latin, a language used to express
data flows. The Pig Latin language supports the loading and processing

12
"HBase - Apache HBase Home." HBase - Apache HBase Home. Web. December 2013.
http://hbase.apache.org/
42
of input data with a series of operators that transform the input data and
produce the desired output;
Sqoop13 (SQL-to-Hadoop) is a tool that offers the capability to extract
data from non-Hadoop data stores, transform the data into a form usable
by Hadoop, and then load the data into HDFS. This process is called
ETL, for Extract, Transform, and Load. While getting data into Hadoop
is critical for processing using MapReduce, it is also critical to get data
out of Hadoop and into an external data source for use in other kinds of
application;
Zookeeper [Hunt 2010] is Hadoops way of coordinating all the
elements of distributed applications, managing Process synchronization,
Configuration Management and managing Messaging between and
among the nodes (across racks).

2.3.5 Big Data and the Cloud

The Cloud has emerged as a principal facilitator of Big Data, both at the
infrastructure and at the analytic levels [Agrawal 2011]. The Cloud offers a
range of options for Big Data analysis in both public and private cloud
settings. On the infrastructure side, Cloud provides options for managing and
accessing very large data sets as well as for supporting powerful
infrastructure elements at relatively low cost.
The Cloud is particularly well suited to Big Data operations. The virtual,
adaptable, flexible, and powerful nature of Cloud certainly lends itself to the
enormous and shifting environment(s) of Big Data. Cloud architectures
consist of arrays of virtual machines that are ideal for the processing of very
large data sets, to the extent that processing can be segmented into numerous
parallel processes. This affinity was discovered at an early stage of Cloud
development, frequently leading directly to development of Hadoop clusters
that could be used for analytics.

13
"sqoop". Web. December 2013. http://sqoop.apache.org/
43
Figure 15 Google Trends: searches on "Cloud Computing" and "Big Data"

According to Agrawal, the nature of the Cloud makes it an ideal computing


environment for Big Data. Here are some examples:
IaaS (Infrastructure as a Service) in a public cloud: in this scenario,
companies would be using a public cloud providers infrastructure for
their Big Data services without using their own physical infrastructure
[Horey 2012]. IaaS can provide the creation of virtual machines with
almost limitless storage and compute power. Users can pick the
operating system that they want, and they can have the flexibility to
dynamically scale the environment to meet their needs. An example
might be using the Amazon EC2 service to run a real-time predictive
model that requires data to be processed using massively parallel
processing. It might be a service that processes big-box retail data.
Users might want to process billions of pieces of click-stream data for
targeting customers with the right ad in real time;
PaaS (Platform as a Service) in a private cloud: PaaS is an entire
infrastructure packaged so that it can be used to design, implement, and
44
deploy applications and services in a public or private cloud
environment [Paraiso 2012]. PaaS enables an organization to leverage
key middleware services without having to deal with the complexities
of managing individual hardware and software elements. PaaS vendors
are beginning to incorporate Big Data technologies such as Hadoop and
MapReduce into their PaaS offerings. For example, an organization
might want to build a specialized application to analyze vast amounts of
medical data. The application would make use of real-time as well as
non-real-time data. Its going to require Hadoop and MapReduce for
storage and processing. Whats great about PaaS in this scenario is how
quickly the application can be deployed. Companies don't have to wait
for internal IT teams to get up to speed on the new technologies and
they can experiment more liberally. Once identified a solid solution,
they can bring it in house when IT is ready to support it;
SaaS (Software as a Service) in a hybrid cloud: useful to analyze voice
of the customer data from multiple channels [Bughin 2010]. Many
companies have come to realize that one of the most important data
sources is what the customer thinks and says about their company, their
products, and their services. Getting access to voice of the customer
data can provide invaluable insights into behaviors and actions. The
value of the customers input can be greatly enhanced by incorporating
this public data into companies' analysis. SaaS vendors provide the
platform for the analysis as well as the social media data. In addition,
businesses might utilize their enterprise CRM data in private cloud
environment for inclusion in the analysis.
Cloud is likely to become increasingly important as an enabler of Big Data,
both for storage and access and for analytics. Development of Hybrid Clouds
capable of integrating public data and private corporate data is particularly
critical [Sotomayor 2009]. Most Big Data applications will depend upon the
capability to bring together external and corporate data to provide usable
information.
Additionally, processing of petabyte data archives will be highly dependent
on local storage capability, with local processing rather than requiring large
scale data movement and ETL (Extraction-Transformation-Loading)
processes. In this situation, in the opinion of Agrawal, the Cloud provides a
45
point of access as well as a mechanism for integrate private corporate data
warehouses and processing of public data. The Cloud virtualized architecture
enables the parallel processing needed for solving these problems, and there
will be an increasing number of SaaS solutions capable of performing the
processing and data integration tasks.

2.3.6 Visual analytics, Big Data and visualization14

According to Keim [Keim 2010] the driving vision of visual analytics is to


turn the information overload into an opportunity: just as information
visualization has changed our view on databases, the goal of visual analytics
is to make our way of processing data and information transparent for an
analytic discourse. The visualization of these processes will provide the
means of examining the actual processes instead of just the results. Visual
analytics will foster the evaluation, correction and rapid improvement of our
processes and models and ultimately the improvement of our knowledge and
our decisions.
Visual analytics provides technology that combines the strengths of human
and electronic data processing. Visualization becomes the medium of a semi-
automated analytical process, where humans and machines cooperate using
their respective, distinct capabilities for the most effective results. The user
has to be the ultimate authority in directing the analysis. In addition, the
system has to provide effective means of interaction to focus on their specific
task. In many applications, several people may work along the processing
path from data to decision. A visual representation will sketch this path and
provide a reference for their collaboration across different tasks and at
different levels of detail.
Visualization is crucial to each stage of the data scientist. E.g. visualization is
key to data cleaning: if you want to find out just how bad your data is, try
plotting it. Visualization is also frequently the first step in analysis: often data
analysts when they get a new data set, they start by making some scatter
14
The InfoVis:Wiki project provides a community platform and forum integrating recent developments
and news on all areas and aspects of Information Visualization "Main Page." InfoVis:Wiki. Web.
December 2013. http://www.infovis-wiki.net/index.php?title=Main_Page
46
plots, trying to get a sense of what might be interesting. Once they have
received some hints at what the data might be saying, they can go on with
more detailed analysis. There are many packages for plotting and presenting
data: to name a few GnuPlot15 is very effective; R16 incorporates a fairly
comprehensive graphics package; Processing17 is suitable in case of
animations that follow the variations of a certain phenomenon in time, IBMs
Many Eyes18 has many interactive applications with graphic effects. In
Flowingdata19 website we can find a lot of creative visualizations.
In following example (Figure 15), coming from Netherlands Central Bureau
of Statistics [Tennekes 2011], we can see the usage of new visualization
techniques. In the first image the function tableplot [Kwan 2009] is used to
express the characteristics of data to final users. Data come from Census and
are about 16 millions of inhabitants. As you can see with this technique it's
easy to understand how census variables (gender, marital status, position in
household, household size, level of education, activity status) vary with the
changing age (left column). Color images can help in giving to final users a
clear understanding of complex statistical phenomena.

15
"Gnuplot Homepage." Gnuplot Homepage. Web. December 2013. http://www.gnuplot.info/
16
"The R Project for Statistical Computing." The R Project for Statistical Computing. Web. December
2013. http://www.r-project.org/
17
Processing 2. Web. December 2013. http://processing.org/
18
"Try Our Featured Visualizations." Many Eyes. Web. December 2013. http://www-
958.ibm.com/software/analytics/manyeyes/
19
"Facebook Debunks Princeton Study." FlowingData. Web. December 2013. http://flowingdata.com/
47
Figure 16 Tableplot usage for Netherlands Census data

In Figure 16 we find another potential usage of tableplot, this time in the


production process of statistics. Using tableplot function in the process of
checking and correcting data of statistical surveys, it is easy to highlight the
distribution of the statistical outliers and check the progress of the ongoing
process of data cleansing. Pictures explain better than many formulas the
progressive results of the process of data editing.

48
Figure 17 Tableplot usage in statistical production process

In the following we list the challenges that are posed by Big Data to
visualization tools technology [Zhang 2012]:
Semi- and Unstructured Data. The increasing speed of data
generation brings both opportunity and challenge. In particular, more
and more semi- or unstructured data are generated on-line or off- line. A
large number of data analysis and visualization techniques are available
for analyzing structured data, but methods for modeling and visualizing
semi- or unstructured data are still underrepresented. An effective
Visual Analytics system often needs to be able to handle both, and
ideally integrate the analysis of both types of data for supporting
decision making;
Advanced Visualization. Many commercial products seem slow to
49
integrate innovative visualization techniques. In particular, some big
software vendors tend to focus on only a small number of standard
visualization techniques such as line charts, bar charts and scatter plots,
which have limited capability in handling large complex data. The
success of more advanced products demonstrate the possibility and
benefit of transferring technical advances developed by academic
research into industrial products;
Customizable Visualization. Given the same data and visualization
technique, different parameter settings may lead to totally different
visual representations and give people different visual impressions.
Designing customizable visualization functions leaves the user the
freedom of changing visual parameter setting and more opportunity to
gain insight from the visualization;
Real Time Analysis. More and more data are generated in real-time on
the Internet (e.g. online news streams, twitter streams, weblogs) or by
equipment or devices (e.g. sensors, GPS, satellite cameras). If analysis
is applied appropriately, these data provide rich information resources
to many tasks. Therefore, improving analytical capability to handle
such data is a development opportunity in current commercial products.
We expect to see more functionality in this respect in the future;
Predictive Analysis. The demand of predictive modeling is increasing,
especially in the business domain, but only very few systems support
predictive analysis. Even with those systems that support predictive
analysis, not many predictive modeling methods are implemented.

2.4 Applications enabled by Big Data

Following two different lists: the first with the business applications that can
be enabled by Big Data, and then the possible uses of Big Data in the public
administration. [Cohen 2009] [LaValle 2011] [Brown 2011] [Manyika 2011]
Types of business applications that can be directly enabled by Big Data:
Marketing: revenue generation and business model development,
50
particularly in retail and consumer packaged goods, where there is
direct or indirect interaction with large consumer markets, moves to a
new level;
Cost containment in real-time becomes viable as electronic event
monitoring from automobiles to smartphones, fraud detection in
financial transaction data and more expands to include larger volumes
of often smaller size or value messages on ever-shorter timescales. Big
Data analysis techniques on streaming data, before or without storing it
on disk, have become the norm, enabling faster reaction to specific
problems before they escalate into major situations;
Real-time forecasting becomes possible as utilities, such as water and
electricity supply and telecommunications, move from measuring
consumption on a macro- to a micro-scale using pervasive sensor
technology and Big Data processes to handle it;
Tracking of physical items by manufacturers, producers and distributors
- everything from food items to household appliances and from parcel
post to container shipping - through distribution, use and even disposal
drives deep optimization of operational processes and enables improved
customer experiences. People, as physical entities, are also subject to
tracking for business reasons or for surveillance;
Reinventing business processes through innovative use of sensor-
generated data offers the possibility of reconstructing entire industries.
Automobile insurance, for example, can set premiums based on actual
behavior rather than statistically averaged risk. The availability of
individual genomic data and electronic medical records presents the
medical and health insurance industries with significant opportunities,
not to mention ethical dilemmas.
For which concerns the public sector, there are many possibilities to use Big
Data to address the mission of government:
Healthcare Quality and Efficiency: health expenditures represent a
growing component of gross domestic product and chronic diseases,
such as diabetes, are increasing in prevalence and consuming a greater
percentage of healthcare resources. The increased use of electronic
health records (ERs) coupled with new analytics tools presents an
51
opportunity to mine information for the most effective outcomes across
large populations. Using carefully de-identified information,
researchers can look for statistically valid trends and provide
assessments based upon true quality of care;
Healthcare Early Detection: Big Data in health care may involve using
sensors in the hospital or home to provide continuous monitoring of key
biochemical markers, performing real time analysis on the data as it
streams from individual high- risk patients to central systems. The
analysis system can alert specific individuals and their chosen health
care provider if the analysis detects a health anomaly, requiring a visit
to their provider or an emergency event about to happen. This has the
potential to extend and improve the quality of millions of citizens lives;
Transportation: Big Data has the potential to transform transportation in
many ways. Traffic jams in many countries waste energy, contribute to
global warming and cost individuals time and money. Distributed
sensors on handheld devices, on vehicles, and on roads can provide
real-time traffic information that is analyzed and shared. This
information, coupled with more autonomous features in cars can allow
drivers to operate more safely and with less disruption to traffic flow;
Education: Big Data can have a profound impact on education. For
example, through in-depth tracking and analysis of on-line student
learning activities with fine grained analysis down to the level of
mouse clicks researchers can ascertain how students learn and the
approaches that can be used to improve learning. This analysis can be
done across thousands of students rather than through small isolated
studies;
Fraud Detection Healthcare Benefits Services Big Data can transform
improper payment detection and fundamentally change the risk and
return perceptions of individuals that currently submit improper,
erroneous or fraudulent claims. This challenge is an opportunity to
explore a use case for applying Big Data technologies and techniques,
to perform unstructured data analytics on medical documents to
improve efficiency in mitigating improper payments. Automating the
improper payment process and utilizing Big Data tools, techniques and

52
governance processes would result in greater improper payment
prevention or recovery;
Fraud Detection Tax Collection: By increasing the ability to quickly
spot anomalies, government collection agencies can lower the tax gap
the difference between what taxpayers owe and what they pay
voluntarily and profoundly change the culture of those that would
consider attempting improper tax filings. Big Data offers the ability to
improve fraud detection and uncover noncompliance at the time tax
returns are initially filed, reducing the issuance of questionable refunds;
Weather: the ability to better understand changes in the frequency,
intensity, and location of weather and climate can benefit millions of
citizens and thousands of businesses that rely upon weather, including
farmers, tourism, transportation, and insurance companies. Weather and
climate-related natural disasters result in tens of billions of dollars in
losses every year and affect the lives of millions of citizens. New
sensors and analysis techniques hold the promise of developing better
long term climate models and nearer term weather forecasts;
New ways of combining information: Big Data provides an opportunity
to develop thousands of new innovative ways of developing solutions
for citizens. For example, Big Data may be useful in helping
unemployed job seekers find new opportunities by combining their job
qualifications with an analysis and mining of available job opportunities
that are posted on the Internet (e.g., company Websites, commercial
postings).

53
2.5 Considerations on Technology and their usage in
Official Statistics

In this paragraph considerations on the usage of Big Data technologies are


exposed, from the following point of views:
the use of traditional instruments statistical software;
the necessary skills needed;
the problem of IT infrastructure.
Traditional statistical software and Big Data - Software for statistical
analysis is usually used inside Official Statistical institutions: in this sector
software are available both from proprietary suppliers and from open-source
projects. Most used tools are today R, an open-source software and
proprietary software tools like SAS and SPSS. All these software packages
have now developed interfaces toward Hadoop-MapReduce environment.
In R project many packages were developed to use R functions on Big
Data managed in Hadoop MapReduce environment. We can cite for
example the rhipe [Guha 2012] package that provides an interface between
R and Hadoop for analysis of large complex data, the rmr [Prajapati 2013]
package by Revolution Analytics that provides an interface between R and
Hadoop for a Map/Reduce programming framework and other packages like
Segue for R20, RprotoBuf [Eddelbuettel 2014] and HistogramTools
[Stokely 2013]. A different approach has been followed by developers of
pdbR, a serie of R packages, which integrate mature results from the
parallel computing community into R [Ostrouchov 2013]. Now in many
research institutions R is considered to be an indispensable tool for managing
big data, also for the management of non-statistical data.
SAS support for Hadoop can be summarized in two ideas: i) Hadoop data can
be managed using SAS software and ii) SAS Analytics functionalities has
been extended to Hadoop21. Just as with other data sources, data stored in
20
"Segue for R:." Segue - An R Language Segue into Parallel Processing on Amazon's Web Services.
Includes a Parallel Lapply Function for the Elastic Map Reduce (EMR) Engine. Web. December 2013.
http://code.google.com/p/segue/
21
"Big Data Insights." Big Data. Web. 31 Dec. 2013. http://www.sas.com/en_us/insights/big-data.html
54
Hadoop can be consumed across the SAS software stack in a transparent way.
This means that analytic tools such as SAS Enterprise Miner, tools relating to
data management such as SAS Data Integration Studio, and also foundation
tools such as Base SAS, can be used to work with Hadoop data.
Four major software components of SPSS22 can be integrated with big data.
SPSS Modeler is a data mining workbench for analyzing data (also stored in
Hadoop HDFS) and developing predictive models. SPSS Analytic Server
manages access to Hadoop data sources and orchestrates the running of
Modeler streams in Hadoop. Modeler operations run as MapReduce jobs in
Hadoop providing high performance and scalability. SPSS Collaboration and
Deployment Services provides an interface to schedule batch jobs that use
NoSQL database and Hadoop data sources. Finally SPSS Analytic Catalyst
analysis runs in Hadoop: data source connectivity to existing data in Hadoop
is provided by the SPSS Analytic Server.
The availability of interfaces to Hadoop MapReduce for most used
statistical platforms can significantly contribute to the use of big data
techniques within the statistical institutes, providing the ability to use well
known software tools, re-using previously developed procedures and
reducing the need for additional training for statisticians.
New skills needed - Statistical production from Big Data will also require a
new skill set for staff in statistical organizations. These skills will be needed
throughout the organization, from managers to methodologists to IT staff.
Currently analysts in statistical organizations are not programmers so they
cannot assess the validity of a particular program routines, the programmers
are not methodologists and mathematicians and so they do not have the
requisite background knowledge needed to code a new statistical routine or
evaluate an existing one. In short, statistical organizations have no data
scientists [Davenport 2012].
The data science associated with Big Data that is emerging in the private
sector does not seem to have connected yet with the official statistics
community. Statistical organizations should perform in-house and national

22
"Apply SPSS Analytics Technology to Big Data." Apply SPSS Analytics Technology to Big Data.
Web. 31 Dec. 2013. http://www.ibm.com/developerworks/library/bd-spss/
55
search at different communities (academic, public and private) to identify
where data scientists are and connect them to the area of official statistics.
To meet these needs, statistical organizations may be interested in recruiting
people from experimental physics, researchers in physical or social sciences,
or other fields with strong data and computational focus. There are also
opportunities for statistical organizations to work with academics and
organizations that could provide the necessary expertise.
In the long term, perhaps the official statistical community could organise
data science training and lectures using existing Advanced Schools in
Statistics in connection with Big Data players (like Google, Facebook,
Apache foundation) that would lead to a kind of certification in data science.
Considerations on IT infrastructure - For which concerns the organization
of IT in statistical institutions, if the volume and velocity of the data is
significantly more than traditional processing ones, consideration needs to be
given to the cost benefit of further enhancing the IT infrastructure once an
understanding Big data processing has been developed. The key bottle neck
points with respect to volume and velocity can be:
the capacity of the NSI to receive the data (bandwidth)
the capacity of the NSI to catalogue and organise the data for Big data
processing environment
the capacity of the Big data processing environment to process the
buckets of data in a sufficiently timely manner.
Two possible options for consideration of Big data processing are
outsourcing and or downsampling.
Given the volumes of data involved, consideration should be given to
whether it is necessary to retain data in any form when they are no longer
required. If there is a requirement to store the raw data, then some options
that could be considered to reduce the volume of data to be stored include
sampling or simply retaining a moving window of data (say the most recent
weeks/months of data).

56
Chapter 3: Big Data usage in Official Statistics -
challenges

This chapter starts analyzing the relationship between data science and
statistics. Then the main challenges that Big Data pose to Official Statistics
are listed distinguished in legislative, privacy, financial, management,
technological and methodological. Finally we list some remarks highlighted
by the first international experiences.

3.1 Data science and statistics

In recent years, following the explosion of Big Data, many studies started
using the term Data Science, referring to an emerging area of work
concerned with the collection, preparation, analysis, visualization,
management, and preservation of large collections of information.
Data science includes a family of disciplines, one of the most important of
which is statistics. In the opinion of Kirk Borne, an influent data scientist,
[Borne 2013], some Big Data users are tempted to do without the key tenets
of statistical reasoning. One reason for this may be that Big Data offers a
convenient path around statistical rigor, since there is so many possible
results and discoveries in large data collections that there is apparently no
need to use the mathematical complexity of statistics.
People could think that if they now have enough data to do 1000-fold cross-
validation or 1000-variable models with millions of training samples, then
statistics must become irrelevant.
Borne mentions four foundational statistical truisms (obvious, self-evident
truths) that are at risk in the age of Big Data:
1. Correlation does not imply Causation - Everyone knows this, but
many choose to ignore it. People may think that this fundamental tenet
of statistics is no longer an important concept when working with Big
Data, since huge numbers of correlations can be discovered now in
57
massive data collections, and some of these correlations must have a
causal relationship, which should be good enough. The search for
patterns, trends, correlations, and associations in data without
preconceived models is one of the major use cases of Big Data
[McAfee 2012]: correlation mining and discovery. In fact, finding
causes to observed effects would truly be a gold mine of value for any
business, science, government, healthcare, or security group that is
analyzing Big Data;
2. Sample variance does not go to zero, even with Big Data
researchers are familiar with the concept of statistical noise and how
noise decreases as sample size increases. But sample variance is not the
same thing as noise [Allison 2001]. The former is a fundamental
property of the population, while the latter is a property of the
measurement process. The final error in our predictive models is likely
to be irreducible beyond a certain threshold: this is the intrinsic sample
variance. For complex multivariate models, the bigger the sample, the
more accurate will be your estimate of the variance in different
parameters representing the population. It might be one of the
fundamental characteristics of the population. In any domain, as you
collect more data on the various members of the population, you can
make better and better estimates of the fundamental statistical
characteristics of the population, including the variance in each
property across the different classes. Statistical truism #2 fulfills one of
the big promises of Big Data: obtaining the best-ever statistical
estimates of the parameters describing the data distribution of the
population;
3. Sample bias does not necessarily go to zero, even with Big Data -
The tendency to ignore this principle of statistics occurs particularly
when we have biased data collection methods or when our models are
under-fitted, which is usually a consequence of poor model design and
thus independent of the quantity of data in hand. As Albert Einstein
said: models should be as simple as possible, but no simpler. In the
era of Big Data, it is still feasible to settle for a simple predictive model
that ignores many relevant patterns in the data collection. Another
situation in which bias does not evaporate as the data sample gets larger
58
occurs when correlated factors are present in an analysis that incorrectly
assumes statistical independence. Statistical truism #3 warns us that just
because we have Big Data does not mean that we have properly applied
those data to our modeling efforts;
4. Absence of Evidence is not the same as Evidence of Absence - In the
era of Big Data, we easily forget that we havent yet measured
everything. Even with the prevalence of data everywhere, we still
haven't collected all possible data on a particular subject. Consequently,
statistical analyses should be aware of and make allowances for missing
data (absence of evidence), in order to avoid biased conclusions. On the
contrary, "evidence of absence" is a very valuable piece of information,
if you can prove it. A dramatic example of a failure to appreciate this
statistical concept is the Shuttle Challenger disaster in 1986, when
engineers assumed that the lack of evidence of O-ring failures during
cold weather launches was equivalent to evidence that there would be
no O-ring failure during a cold-weather launch [Casella 1999]. This is
an extreme case, but neglect of statistical truism #4 is still an example
of fallacious reasoning in the era of Big Data that we should avoid.

3.2 Challenges posed to official statistics

The international statistical community started moving towards a common


vision on Big Data in 2012. At a Seminar on Streamlining Statistical
Production and Services, held in St Petersburg, 3-5 October 2012,
participants asked for "a document explaining the issues surrounding the use
of Big Data in the official statistics community". They wanted the document
with strategic focus, aimed at heads and senior managers of statistical
organisations.
To implement the document, the High-Level Group for the Modernisation of
Statistical Production and Services (HLG23) established an informal Task
Team of experts, coordinated by the UNECE Secretariat.

23
"UNECE Statistics Wikis." High-Level Group for the Modernisation of Statistical Production and
Services. Web. December 2013. http://www1.unece.org/stat/platform/display/hlgbas/High-
59
The first major analysis focuses on changes to the political role of the NSOs
(National Statistical Organizations) that may result from the growing
influence of private producers of Big Data. We can read in the document24:
Big Data has the potential to produce more relevant and timely statistics
than traditional sources of official statistics. Official statistics has been based
almost exclusively on survey data collections and acquisition of
administrative data from government programs, often a prerogative of
National Statistical Organizations (NSOs) arising from legislation. But this is
not the case with Big Data where most data are readily available or with
private companies. As a result, the private sector may take advantage of the
Big Data era and produce more and more statistics that attempt to beat
official statistics on timeliness and relevance.
It is unlikely that NSOs will lose the "official statistics" trademark but they
could slowly lose their reputation and relevance unless they get on board.
One big advantage that NSOs have is the existence of infrastructures to
address the accuracy, consistency and interpretability of the statistics
produced. By incorporating relevant Big Data sources into their official
statistics process NSOs are best positioned to measure their accuracy, ensure
the consistency of the whole systems of official statistics and providing
interpretation while constantly working on relevance and timeliness. The role
and importance of official statistics will thus be protected.
Actually, the use of Big Data as an additional source to be integrated
alongside with traditional survey data and administrative data is a must for
National Statistical Institutes [Balbi 2013], not only for competition reasons
with private sector, but also because of the costs associated to traditional data
collection techniques and the increasing levels of non-response due to the
burden associated to the compilation of a questionnaire, even if proposed
with advanced modalities (web surveys).
Obviously, Big Data potential use for statistical purposes is subject to a set of
challenges not only due to their particular characteristics (high volume,
velocity and variability), but also to the fact that their origin and generation
mode are often completely out of NSOs control. These challenges are:

Level+Group+for+the+Modernisation+of+Statistical+Production+and+Services
24
"UNECE Statistics Wikis." What Does Big Data Mean for Official Statistics? Web. December 2013.
http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170614
60
1. Legislative: are Big Data accessible to NSOs, and at what conditions?
a. Legislation in some countries provide the right to access data both to
public and private organizations while other countries guarantee the
right of access only to public entities. This can restrict access to
some types of Big Data;
b. As noted by the ESSNet AdminData25 The right of NSOs to access
admin data, established in principle by the law, often is not
adequately supported by specific obligations for the data holders;
c. Even if legislation grants the access to all types of data, the way to
demonstrate the statistical purpose for accessing the data may be
different from country to country.
2. Privacy: in accessing and processing Big Data, what assurances exist on
the protection of the confidentiality?
a. Definitions may vary from country to country but privacy can be
defined as freedom from unauthorized intrusion26. The problem
with Big Data is that the users of services and devices generating the
data are usually unaware that they generating data.
3. Financial: access to Big Data often has a cost, maybe lower then
statistical data, but sometimes considerable:
a. There are probably costs to acquire Big Data held by the private
sector, especially if legislation is silent on the financial modalities of
acquisition of external data. And so NSOs must balance quality (i.e.
relevance, timeliness, accuracy, coherence, accessibility and
interpretability) against costs and reduction in response burden. The
potential benefits should exceed costs, because new information
coming with Big Data could increase the efficiency of government
programs;
b. Report prepared by TechAmerica Foundations Federal Big Data
Commission in the United States [Mills 2012] states that the success
of transformation to Big Data lies in Understanding a specific
25
"The Use Of Administrative And Accounts Data For Business Statistics." ESSNET Admin Data
Wiktionary. Web. December 2013. http://essnet.admindata.eu/
26
"Privacy." Merriam-Webster. Web. December 2013. http://www.merriam-
webster.com/dictionary/privacy
61
agencys critical business imperatives and requirements, developing
the right questions to ask and understanding the art of the possible,
and taking initial steps focused on serving a set of clearly defined
use cases.. This approach can certainly be valid also for NSO
environments.
4. Management: what is the impact on the organization of a NSI when Big
Data become an important source of data?
a. Big Data for official statistics means more information coming to
NSOs. This information must be subject to NSOs policies and
directives on the management and protection of the information;
b. The skills requested to Data Scientist [Davenport 2012] are not easy
to find inside the official statistics community. The NSOs should
perform in-house and national scans (academic, public and private
sectors communities) to identify where data scientists are and try to
involve them in the area of official statistics.
5. Technological: what paradigm shift is required in Information Technology
in order to start using Big Data?
a. Collecting data [Parise 2012] in real time or near real time (often
through the use of standard Application Programming Interfaces -
API) maximizes the potential of data, opening new opportunities for
combining administrative data with high-velocity data coming from
other different sources, such as commercial data (credit card
transactions, on line transactions, sales, etc.), tracking devices
(cellular phones, GPS, apps) and physical sensors (traffic,
meteorological, pollution, energy, etc.), social media (twitter,
facebook, etc.) and search engines (online searches, online page
view);
b. The Big Data change of paradigm for data collection presents the
possibility to collect and integrate many types of data from many
different sources. Combining data sources to produce new
information is an additional interesting challenge in the near future.
6. Methodological: what is the impact of the use of Big Data (in
combination or in substitution of statistical data) on the consolidated
methods of data collection, processing and dissemination?
62
a. Representativity is the fundamental issue with Big Data [Tufecki
2013]. The difficulty in defining the target population, survey
population and survey frame prejudices the traditional way in which
official statisticians think. With a traditional survey, statisticians
identify a target population, build a survey frame, draw a sample,
collect the data etc. With Big Data all these phases can be completely
skipped, and they are no longer under the responsibility of the
statistician. This requires the statisticians to completely change the
way of thinking, since the characteristics of the data should be
considered exogenous;
b. Another issue is both IT and methodological. When more and more
data is being analysed, traditional statistical methods, developed for
the analysis of small samples, run into trouble; in the most simple
case they are just not fast enough. Here is the need for new methods
and tools:
i. Methods to quickly uncover information from massive amounts
of data available, such as visualisation methods and data, text
and stream mining techniques, that are able to make Big Data
small [Dunne 2013];
ii. Methods capable of integrating the information uncovered in
the statistical process, such as linking at massive scale and
statistical methods specifically suited for large datasets.
Methods need to be developed that rapidly produce reliable
results when applied to very large datasets.
c. The use of Big Data for official statistics triggers a need for new
techniques [LaValle 2011]. Methodological issues that these
techniques need to address are:
i. Measures of quality of outputs produced from external data
supply. The dependence on external sources clearly limits the
range of measures that can be reported;
ii. Limited application and value of externally-sourced data;
iii. Difficulty of integrating information from different sources to
produce high-value products.

63
3.3 Remarks from early experiences

We list below some issues to be addressed when using Big Data for use in
official statistics. Many of following issues come from the first experiences
held in CBS (Dutch National Statistical Institute).
Data exploration: typically Big Data sets are made available to NSIs,
rather than designed by them. Thus data contents and structure need to
be understood prior to using the data for analysis. This is called data
exploration, often involving visualisation methods [Zykopoulos 2012].
Recently some visualisation methods have emerged that are particularly
suited to Big Data. Examples are tableplots [Tennekes 2013] for data
with many variables, and 3D heatmaps [Liu 2013] to study variability
in multivariate continuous data. Data exploration tries to reveal data
structure and assess data quality including exposure of errors,
anomalies and missing data;
Missing data: despite the enormous amounts of data generated each
day, data coming from sensors often suffers from missing data. Official
statisticians need to find way to cope with the missing data problem and
simultaneously reduce the amount of data to a manageable level.
Missing data was experienced in two different case studies: traffic loop
detection data and social media data [Daas 2012]; this may be due to
server downtime and/or network outages. A possible way to overcome it
is focusing on statistical modelling able to cope with missing data, and
the development of information extraction and aggregation methods;
Volatility and resolution: data coming from sensors can fluctuate
considerably from minute to minute. These fluctuations are caused by
real changes in the phenomenon but can be from a statistical point of
view not very informative as they occur at too high a resolution.
Similarly, sentiment analysis at a daily basis may suffer from volatility
that is not seen at weekly or monthly intervals [O'Connor 2010]. It is
therefore needed to develop statistical methods able to cope with
volatile behavior. Possibilities under consideration in CBS are moving

64
averages and advanced filtering techniques;
Representativity/Selectivity: the analyses in Dutch first experiences
apply to traffic on roads equipped with traffic loop sensors, and to
sentiment analysis of people who post Dutch messages on social media
web sites. These are subpopulations of respectively all traffic on Dutch
roads, and of all people in the Netherlands. The subpopulations covered
by these Big Data sources are not target populations for official
statistics. Therefore data are likely to be selective, not representative of
a relevant target population. Representativity of Big Data could be
assessed through careful comparison of characteristics of the covered
population and the target population. This can be problematic, as often
there are no characteristics readily available to conduct such
comparison. For example, little is known about the people posting on
social media. Often only their user name is known but not their age or
gender. In situations where at least some background information is
available, the selectivity issue can be assessed, and addressed if
necessary. This could be achieved through predictive modeling, using a
wide variety of algorithms known from statistical learning and from
application of data mining methods in official statistics [Buelens 2012];
Long-term stability may be a problem when using Big Data. Typically,
statistics for policy making and evaluation are required for extended
periods of time, often covering many years. The Big Data sources
encountered so far seem subject to frequent modifications, possibly
compromising their long term use;
Privacy and data ownership are other issues that need to be addressed,
as many potential Big Data sources are collected by non-governmental
organizations, a situation that may not be covered by existing
legislation.

65
Chapter 4: First applications of Big Data in
statistics.

In this chapter we analyze the first experiences of usage of Big Data in


Statistics and in Official Statistics. We start describing the first experiments
in the Google environment and then passing the various statistical domains
involved: consumer price index, transport statistics, economic statistics using
data from social media and phone calls, statistics on the information society.
Then we express some suggestions derived from the problems encountered in
these early experiences. These considerations ranging between the Data
collection from the web (automated or assisted data collection, the respect of
Netiquette, the need to communicate with site owner, website identification),
the need of a new legal framework and methodological considerations about
the management of errors in Big Data usage, the data cleaning issues and
management of unstructured data.

4.1 First experiences in Google


Google researchers found that certain search terms are good indicators of
influenza activity. And so they implemented the website Google Flu Trends 27
that uses aggregated Google search data to estimate flu activity. But can
search query trends provide the basis for an accurate, reliable model of real-
world phenomena?

27
"Flu Trends." Google. Web. December 2013. http://www.google.org/flutrends

66
Figure 18 Comparison between Google Flu Trends estimate and US data on influenza

Google researchers found a close relationship between how many people


search for flu-related topics and how many people actually have flu
symptoms [Ginsberg 2008]. Not every person who searches for "flu" is
actually sick, but a pattern emerges when all the flu-related search queries are
added together. Researchers compared the query counts with traditional flu
surveillance systems and found that many search queries tend to be popular
exactly when flu season is happening. By counting how often we see these
search queries, we can estimate how much flu is circulating in different
countries and regions around the world.
Google published graphs showing historical query-based flu estimates for
different countries and regions compared against official influenza
surveillance data [Valdivia 2010]. As you can see, estimates based on Google
search queries about flu are very closely matched to traditional flu activity
indicators.
Early detection of a disease outbreak can reduce the number of people
affected. If a new strain of influenza virus emerges under certain conditions,
a pandemic could ensue with the potential to cause millions of deaths, as
happened, for example, in 1918 with Spanish flu. Google up-to-date
influenza estimates may enable public health officials and health
professionals to better respond to seasonal epidemics and pandemics.
Each season Google researchers evaluates the model's estimates against data
from traditional flu surveillance systems, looking at three indicators of
67
accuracy: estimation of the start of flu season, estimation of when the flu
season peaks, and estimation of the severity of the season. Based on this
evaluation, researchers update the model to improve its performance. After
the H1N1 flu season in 2009 Google completed an important update and
shared a summary of their analysis and related changes [Cook 2011].The
model has been modified to better respond to cases of pandemic, similar to
those of the 2009 H1N1 flu. In next pictures (Figure 18) you can compare the
behavior of the previous model and the behavior of the updated model: the
new model is able to better explain the peak in 2009 due to new pandemic.

68
Figure 19 Google Flu Trends updated model after H1N1 pandemic

After the experience of Flu Trends, Google released a more general website,
named Google Correlate, an online, automated method for query selection
that does not require such prior knowledge [Mohebbi 2012].

69
Using Correlate, given a temporal or spatial pattern of interest, users can
determine which queries best mimic the data. These search queries can then
be used to build an estimate of the true value of the phenomenon. This model
has been used also to produce accurate models of home refinance rate in the
United States.
In addition Google researchers showed that spatial patterns in real world
activity and temporal patterns in web search query activity can both surface
interesting and useful correlation where users can use their own data and find
search patterns which correspond with real-world trends.
As an example we can see in next picture the correlation between the winter
time and the search for the word pizzoccheri a typical winter dish of
northern Italy28.

Figure 20 Seasonal Correlation between Winter and the searches for "Pizzoccheri" word

28
"Pizzoccheri." Wikipedia. Wikimedia Foundation Web. December 2013.
http://en.wikipedia.org/wiki/Pizzoccheri
70
4.2 Prices

Official statistics have a long tradition of collecting price information for


goods and services. The consumer price index is the most popular statistical
indicator. It provides a measure for the development of prices from the
consumers perspective and is derived from prices of products typically
purchased by a household [Boskin 1998].
The fast development of the internet during the last decade has led to the
widespread adoption of ecommerce by producers and consumers. Nowadays,
consumers are able to buy a wide variety of goods and services from web-
shops at the internet. The share of European citizens who ordered goods or
services online has more than doubled from 15% in 2005 to 35% in 2012 29
(in Italy in the same years grew from 4% to 11%).
So nowadays many of the products that are bought by a household are
offered on the Web. Products are standardized and usually information is
given on the type of product and its price together with a short description:
this favors the use of the internet for price collection.

4.2.1 Billion Prices Project and PriceStats

In 2007, economists Roberto Rigobon and Alberto Cavallo from MIT


(Massachusetts Institute of Technology) Sloan School of Management,
started tracking prices online and storing them into a Big Database [Cavallo
2011]. In 2010 they released the Billion Prices Project30, a measure of
inflation based on 5 million items sold by 300 online retailers in 70 countries.
The prices are scraped directly from e-commerce websites. The Billion Prices
Project's (BPP) inflation measure is clearly different from the government

29
"Eurostat E-commerce Statistics." Statistics Explained RSS. Web. December 2013.
http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/E-commerce_statistics
30
"The Billion Prices Project @ MIT." The Billion Prices Project MIT RSS. Web. December 2013.
http://bpp.mit.edu/
71
one. The biggest difference is on the basket, that when BPP started was
simply composed by all the prices that could be collected online.
The online measure has the advantage that it comes out daily, allowing
researchers to examine daily price changes. The results of BPP index are very
close to the US CPI (Consumer Price Index), collected by US Bureau of
Labor Statistics31 at 23.000 retailers in 90 US cities, for a cost over 200
million $ a year. The official CPI is based on a number five times lower than
that of BPP project. New York Times so commented the new:Data on prices,
once monopolized by government gatekeepers, are now up for grabs.32
The project was managed as a research project in MIT. The research uses
web collected elementary data to study prices [Cavallo 2011b]. The current
research fields are:
Pricing Behavior: What drives price stickiness around the world?
How much can be explained by current and past inflation? How
much by competition and structure of industries? Are prices
synchronized between commodities and between countries?
Daily Inflation and Asset Prices: How daily inflation indexes
across countries and sectors do match official statistics? Which are
the links between daily inflation, asset prices, and inflation
expectations?
Pass-Through: How much do prices adjust internally when the
exchange rate, or the international price of commodities change?
Green Markups: What premium is paid in stores for green or
organic products? Storing data from multinational retailers, they
can compute premium differences - for exactly the same items - in
different places.
The MIT project shows that also in Macroeconomics the availability of (big)
data coming from the Web allow scientists to open new research fields.

31
"CPI News Releases." U.S. Bureau of Labor Statistics. Web. December 2013. http://www.bls.gov/cpi/
32
The Real-time Inflation Calculator The New York Times online. Web. December 2013.
http://www.nytimes.com/interactive/2010/12/19/magazine/ideas2010.html?pagewanted=2&src=tpt
w&_r=0#The_Real-time_Inflation_Calculator
72
The most famous usage of BPP data was an alternative inflation index for
Argentina, which is updated on a daily basis in the website Inflacion
Verdadera33 [Cavallo 2009]. The website showed that the price index
provided by the Argentine government did not represent the actual change in
prices. For the first time data coming from National Statistical Institutes were
contested on the basis of Big Data obtained from the web. In Figure 20 you
can see the difference between the official CPI and the price index calculated
at MIT.

Figure 21 Comparison between Argentina Inflation rate and Price Index computed by
Pricestats
In 2010 Rigobon and Cavallo, together with other colleagues, founded
PriceStats34, a company with the objective of bringing the academic research
to market [deHaan 2013].
PriceStats uses the following flow to generate his indexes:

33
"The Billion Prices Project @ MIT. Web. December 2013. http://www.inflacionverdadera.com/
34
"PriceStats ." PriceStats . December 2013. http://www.pricestats.com/
73
Scraping: PriceStats uses web scraping technologies to monitor online
prices every day. Web scraping is the process of automatically
collecting information from the web by converting unstructured data
(often in HTML format) into structured datasets that can be stored and
analyzed. PriceStats uses a combination of commercial and custom
scraping solutions to address the complexities of monitoring prices
across thousands of retailers;
Online Retailers: a key step of PriceStats approach is to identify the
best retailers to use for inflation measurement. PriceStats selects
retailers with large market shares, in relevant cities, that sell both
offline and online. In most countries, their data covers key economic
sectors such as food, clothing, electronics, furniture and energy. Even if
some category, mainly services, is not covered, this, however, is not a
problem for the goal of detecting the main changes in inflation trends.
Services are usually quite stable, not the main source of volatility, and
can be indirectly monitored through items with similar price behavior;
Processing: once the data collection is complete, PriceStats runs
automatic procedures to ensure that the data can be used for inflation
measurement. The data is first structured and cleaned so that it can be
used in a consistent manner. Price data is recorded in a standard format,
data is then categorized across economic sectors and a set of
performance statistics are automatically calculated to evaluate the
quality of the data;
Inflation Statistics: finally daily inflation statistics is computed using
advanced econometric techniques and leveraging official weights as
much as possible.
PriceStats inflation indices show that online prices are a successful measure
of inflation, despite online price sources being different to those of official
inflation estimates. Although online and offline prices may have different
nominal values, price changes tend to follow similar trends. Since inflation
statistics measure price changes, online prices are a great way to measure
inflation.

74
Following chart shows how accurately PriceStats conforms to the Consumer
Price Index data released by the US government (Figure 21).

Figure 22 Comparison of official US inflation and Pricestats computed one


On the other hand Figure 22 compares the PriceStats calculations in
Argentina to the official data released by that government.
Today PriceStats data are used also by companies like State Street
Corporation, a big financial services holding, traded in S&P 500, the second
oldest financial institutions in the United States35.

35
"PriceStats." State Street Global Exchange Research Risk Indices. Web. December 2013.
http://www.statestreet.com/research/pricestats/index.html
75
Figure 23 Comparison of official Argentina inflation and Pricestats computed one

4.2.2 Netherlands experience

Consumer Prices Index (CPI) of Statistics Netherlands is partially based on


data from the internet [Hoekstra 2012]. In the beginning statisticians surfed to
specific web pages to extract prices for various CPI-articles such as books,
airline tickets and electronic goods. In recent years many markets (e.g.
holidays, (e-)books, clothes) are quickly going online and so online prices are
no longer proxies, but the actual object of study for a large share of market.
Moreover the availability of websites that compare the prices of goods from
different suppliers implies that a larger share of certain field is covered by a
single website in a harmonized format.
CBS started exploring potential new internet sources like a) webpages of
airline carriers to collect airline fares to various destinations; b) webpages of
unmanned petrol stations to collect fuel prices; c) on-line catalogue of a mail-
order firm; d) on-line catalogue of companies with a large market share (such
as Ikea); e) analysing and interpreting the information on the webpages of a
large Dutch websites that compares the performance and prices of
electronics; f) webpages of on-line clothing stores; g) compare on-line prices
of new and used-cars; h) collect information of so-called aggregator sites
76
that compare and present the functionality and price of many goods; i)
automatic interpretation of the content of the Dutch Government Gazette
(Staatscourant); j) collect price information of mobile phone subscriptions; k)
collect prices of dvds, cds, and games; l) collect information of so-called
aggregator sites that compare and present the functionality and price of
many goods; m) use product specifications of technology vendors for price
analysis.
In 2010 CBS started testing the automated data collection from the website of
a company of unmanned petrol station and of four airline websites. To extract
data from these websites, CBS asked an external company to develop
dedicated robot tools. These tools was specifically developed to extract data
from internet pages, sometimes by just indicating the information target
visually and some additional scripting. The functionality varied from point
and click methods (like Excel macros) to web scripting focused on robot
functionality. Many tools are available [Ferrara 2012] which vary greatly in
their technique and functionality.

Figure 24 Price for international flights by days before departure

One challenge when using internet robots is to keep them working when
websites change. These changes may vary from simple layout and style
changes to complete makeover in terms of technology and communication
structure. CBS has also tried to quantify the time required to modify the robot
77
as a result of changes to the Web sites, concluding that the method of data
collection remained affordable, even taking into account the necessary
adjustments to the robots.
After the experiences on oil and flights prices, CBS focused on housing
market sites. Since these sites have large amounts of information presented in
harmonized ways, they are primary candidates for automated data collection.
Starting from the beginning of 2011 data on housing process for one province
in the Netherlands have been collected from three separate internet sites.

4.2.3 Other experiences on Prices

Similar to BPP was the Google Price Index (GPI) an experiment to


implement a daily or instant measure of inflation using the Google database
of web shopping [Groves 2011]. The project was never officially released and
published. Hal Varian, Google chief economist and father of the project said 36
that GPI showed a pretty good correlation with the traditional Consumer
Price Index for goods that often sell on the web, less so for things like car
parts normally bought at a store.
Always talking about Consumer Price Index another interesting experience
was carried out by United Nations Global Pulse37, an initiative launched by
the Executive Office of the United Nations Secretary-General, in response to
the need for more timely information to track and monitor the impacts of
global and local socio-economic crises. UN Global Pulse researchers
developed a research38 project to determine which indicators might be present
in social media explain how populations cope with global crises, such as
commodity price volatility.
The analysis was limited to publicly available data from Twitter for July 2010
through October 2011 in Javanese and English. The topics of focus included
the availability of food, fuel, housing and loans. Analyzing the number of
36
"Google to Map Inflation Using Web Data." Financial Times. Web. December 2013.
http://www.ft.com/cms/s/2/deeb985e-d55f-11df-8e86-00144feabdc0.html#axzz2j7tZZdvz
37
"United Nations Global Pulse." Home. Web. December 2013. http://www.unglobalpulse.org/
38
UN Global Pulse and Crimson Hexagon, 2011. Twitter and Perceptions of crisis-related stress. Web.
December 2013. http://www.unglobalpulse.org/projects/twitter-and-perceptions-crisis-related-stress
78
tweets per month commenting about the price of rice in Indonesia,
researchers found two periods around February 2011 and September 2011
when more conversations took place. These increases follow the official
inflation for the food basket, indicating that when prices rise, people notice
and express their concerns.

Figure 25 Comparison between Tweets on Prices and Rice Price level

In December 2013 China's National Bureau of Statistics39 said it will start


using Big Data technology to improve the collecting, processing and
producing of the country's consumer price index. Xian Zude, NBS chief
statistician, said in an interview40 that his bureau will use Big Data including
data from Chinese e-commerce companies in official statistics in an effort to
bring the CPI census to the next level.
NBS signed a strategic partnership agreement in November 2013 with 11
high-tech Chinese companies to develop Big Data technology. The 11

39
"Statistical Data." National Bureau of Statistics of China. Web. December 2013.
http://www.stats.gov.cn/english/
40
"China Turns to Big Data to Gauge Inflation." - Xinhua. Web. December 2013.
http://news.xinhuanet.com/english/china/2013-12/06/c_132946664.htm
79
companies include e-commerce giant Alibaba41, leading search engine
Baidu42, China United Network Communications43, the country's second-
largest mobile operator; and the FANYA Metal Exchange44, one of the largest
spot trading and investing platforms for rare metals.
Ma Jiantang, head of NBS, said at the signing ceremony of the strategic
partnership that the era of producing, sharing and using data is coming: "Big
Data will become the foundation of government management, social
management and macro-economic control". "This is only our first step," he
said. "We will cooperate with more companies in the future."
Huang Linli, a senior analyst with Baidu, said her company has some natural
advantages in developing Big Data technology: "Netizens' search requests
exceed 5 billion a day on baidu.com. The data generated from those searches
will be very valuable to the government for making predications on the
economy, as well as other sectors."

4.3 Traffic data

4.3.1 Netherlands
One of the first National Statistical Institute to start experimenting Big Data
usage in Official statistics was Statistics Netherlands (CBS45). Here we
present the experience on Netherlands Analysis of Traffic loop detection data
[Daas 2013].
In Netherlands there is a central authority collecting data on traffic. The
National Data Warehouse for Traffic Information (NDW46). Real-time traffic
data gives a picture of the current traffic situation on the roads. Every minute,
data from more than 20,000 measuring sites in the Netherlands is collected
41
"Manufacturers, Suppliers, Exporters & Importers from the World's Largest Online B2B Marketplace-
Alibaba.com." Alibaba. December 2013. http://www.alibaba.com/
42
"." . Web. December 2013. http://www.baidu.com/
43
"About Us." _China Unicom. Web. December 2013. http://eng.chinaunicom.com/about/
44
"." . Web. December 2013.
http://www.fyme.com.cn/plus/list.php?tid=296
45
"CBS - Home." CBS - Home. Web. December 2013.http://www.cbs.nl/en-GB/menu/home/default.htm
46
"Home." - Nationale Databank Wegverkeersgegevens. Web. December 2013. http://www.ndw.nu/en/
80
by the database and within 75 seconds distributed to users of the data. It
concerns the following data:
Traffic flow
Realised travel time
Estimated travel time
Traffic speed
Vehicle classes

Figure 26 Schema for Netherlands Data Warehouse for Traffic Information

In 2011 data on 4500 km of roads were available, comprising national roads,


provincial roads and urban thoroughfares. In 2012, this has been expanded by
approximately 1400 km. The data are centrally stored in the NDW and
81
managed by a collaboration of participating government organizations
(including the CBS) The National Data Warehouse contains historic traffic
data collected from 2010 onwards.
Collected data are interesting for traffic and transport statistics and
potentially also for statistics on other economic phenomena related to
transport. CBS started by studying minute level data for all locations in the
Netherlands for a single day: December 1st, 2011. The dataset extracted from
the NDW contained 76 million records which were analyzed using the open
source software R47.
After solving some problem due to missing data for computer/network
failures, CBS obtained a fist graph with the number of vehicles detected per
minute:
In Figure 26 the profile displays clear
morning and evening rush hour peaks around
8 am and 5 pm respectively. This result was
obtained counting about 300,000,000 in the
given day. Also maps has been created
indicating the number of vehicles for each
measurement location (by using different
colors) and so a movie could be created
displaying the changes in vehicle counts for
all locations during the day. Such figures
Figure 27 Dutch traffic time profile
showed that the traffic between the four
major cities in the Netherlands was and remained high during working hours
and in the evening.

47
"The R Project for Statistical Computing." The R Project for Statistical Computing Web. December
2013. http://www.r-project.org/
82
Next the number of vehicles in various length categories (small, medium-
sized and large) was studied. The Figure 27 illustrates the difference in
driving behavior. The small vehicles have clear morning and evening rush-
hour peaks at 8 am and 5 pm respectively, in line with the overall profile. The
medium-sized vehicles have both an earlier morning and evening rush hour
peak, at 7 am and 4 pm respectively. The large vehicle category has a clear
morning rush hour peak around 7 am and displays a more distributed driving
behavior during the remainder of the day. Remarkable is the decrease in the
relative number of medium-sized and large vehicles detected at 8 am. This
may be caused by a deliberate action of the drivers of the medium-sized and
large vehicles of wanting to avoid the morning rush hour peak of the small
vehicles.

Figure 28 Dutch traffic: normalized number of vehicles for three length categories
In Figure 27 normalized number of vehicles is distinguished by class of
vehicle length, after correcting for missing data. Small (<= 5.6 meter),
medium- sized (>5.6 and <= 12.2 meter) and large vehicles (> 12.2 meter)
are shown in black, dark grey and grey, respectively.

83
4.3.2 Colombia

The National Roads Institute of Colombia uses GPS data to improve traffic
circulation and to serve as input for transport statistics. With this method, cars
do not have to stop at toll stations: an electronic tracking device installed in
the vehicle is read when it enters the toll.
The tracking device also contains all the information concerning the vehicle,
which complements that of the National Single Transit Register (Registro
nico Nacional de Trnsito48), an Information system for managing
centralized and validated information regarding cars, drivers, traffic
accidents, traffic insurance and public transport companies.
This new method has already enhanced control of traffic flows and has led to
the strengthening of transport statistics.

4.4 Social media data

CBS, Netherlands National Statistical Office, studied social media messages


from two points of view: content and sentiment [Daas 2012b]. For the
contents CBS started analyzing Dutch Twitter messages (Twitter is the
dominant social medium in the Netherlands) collecting tweets from more
than 380,000 Twitter users gave CBS an archive of more than 12.000.000
tweets.
For topic identification they started analyzing the hashtag, present in only
15% of the tweets. In the single hashtag containing twitter messages the
topics Media and Other most frequently occurred, followed by Sports and
Spare time related tweets. To get an idea of the topics discussed in the ten
millions non-hashtag containing messages, researchers started with the
automated text classification method developed for the hashtag containing
messages. This method failed.

48
"RUNT Colombia." RUNT / Colombia. Web. December 2013.
http://registronacional.com/colombia/runt.htm
84
Then CBS tried to analyze manually a sample of the tweets and manual
classification showed that the majority of the non-hashtag messages belonged
to the other group (51%) (these kind of messages are referred to as pointless
babble in some studies). Apart from these kinds of messages, the non-
hashtag containing tweets in the sample were predominantly found to be
related to the themes Spare time, Sport, Media and Work. The results of this
first study reveal that on Twitter topics of potential interest for official
statistics are discussed. Topics for which twitter messages could provide
information from an official statistics point of view are mainly those related
to work and politics.
Another potential use of social media messages is sentiment analysis [Java
2007]. CBS started an experiment accessing over 1.6 billion messages written
in Dutch from a large number of social media sites was obtained through the
Coosto49 infrastructure. Messages sourced from the largest social media sites
(Twitter, Facebook, Hyves, Google+, and LinkedIn) but also from numerous
public blogs and forums. Researchers used June 2010 as the starting date
with August 2012 as the end date. With a query language and a web interface,
messages were selected from the database. The sentiment of each message
was automatically determined by counting the number of positive and
negative words following the approach described in Golder and Macy
[Golder 2011] and messages were classified as positive, negative or neutral.
CBS tried to link the sentiment in social media with consumer sentiment in
the Netherlands. They started by testing a wide range of somehow correlated
with consumer sentiment; such as buy and mortgage. But this proved very
difficult, because some words were hardly used and others showed no clear
or stable dependence. Then they tried another approach: using very general
terms. Interestingly, this general approach worked quite well.
Queries returned very large amounts of messages, around 600 million for the
Dutch articles and 1.2 billion for the 10 most frequently used Dutch words
for the period studied, of which the overall sentiment (sometimes called The
Mood of the Nation) was analysed. The monthly sentiment for the period
June 2010 - August 2012 derived from Dutch social media messages was
found to correlate very strongly (0.83) with the officially determined monthly

49
"Online Radar." Coosto. Web. December 2013. http://www.coosto.com/uk/
85
Dutch consumer confidence50 and with the sentiment for the attitude towards
the economic climate (0.88). Both official indicators are based in Netherlands
on a sample survey in which 1500 people are interviewed each month.

Figure 29 Comparison between sentiment analysis from social media and consumer confidence

Figure 28 displays the survey-based series (grey) and the corresponding


Dutch social media sentiment findings (black). Both series relate very well
except for the month of December, where a much more positive attitude is
found in social media. Removal of all messages including the (Dutch) words
Christmas and references towards New Year and New Years Eve reduced
these peaks and increased the correlation to 0.90. This high correlation is
remarkable, as the populations from which the data are obtained are different.
Dutch consumer confidence is obtained from a random sample from the
population register, while the sentiment in Dutch social media messages
shown in figure is derived from around 30 million messages generated each
month.

50
"Consumer Confidence Survey." CBS. Web. December 2013. http://www.cbs.nl/en-
GB/menu/methoden/dataverzameling/consumenten-conjunctuuronderzoek-cco.htm
86
4.5 Mobile phones data

4.5.1 Estonia

Since 2009 the Central Bank of Estonia has been using state-level inbound
and outbound tourism statistics (trips, spent nights) based on mobile
positioning data and calibrated with official accommodation and travel
statistics [Ahas 2008]. The monthly data flow is used in the calculation of the
national balance of payments. The initiation of this need came from border
surveys being cancelled due to financial cutbacks.
Using mobile positioning data in statistics has several positive aspects as
speed of data collection, digital format of data, large sample and high
penetration of phones in most of societies. Mobile data has also several
shortcomings that we have to keep in mind when interpreting the results. One
of the weaknesses is that we do not know the exact motivations and relations
lying behind those visits. The most important question is related to sampling:
often mobile phones are not used by lower income groups in foreign
countries due to roaming costs. Calling can also connected with cultural
differences, such as calling regulations and traditions.
Another problem that arises in case of using mobile positioning data is its
quantitative structure we know the locations of calls (dots), but we do not
know who is really making the calls, what kind of visit he/she is on, and what
kind of transportation he/she is using. The huge amount of quantitative data
also poses a problem for data processing and cleaning; the databases are too
large to be managed using traditional software and data preparation options.
In the following generic issues for mobile data usage in official statistics are
reported [Karlberg 2013]:
Privacy concerns: although there has been a cultural shift, with people
being increasingly willing to share, or even actively disseminate
(Facebook) their personal data, large-scale provision of mobile
positioning data to government agencies could be perceived as an
invasion of privacy;

87
Data protection legislation: there are a number of EU-level
instruments with different aims. The production of official statistics
could possibly be considered as a statutory basis for which processing
of personal data (such as mobile positioning data) could take place.
Moreover, different European countries have different national
transpositions of the EU directives;
Data provider reluctance: while mobile network operators have an
interest in this initiative, there are issues concerning the (i) the
maintenance of business secrets; (ii) direct costs of providing data, (iii)
effects of the data extraction workload on the real-time systems; (iv) the
opportunity cost of giving away their data for free to National
Statistical Institutes;
Technological barriers: as there are a number of providers in each
country, a technical solution concerning how to merge data from
different operators into one single analysis data set need to be resolved.
This particularly concerns at which step anonymisation should take
place;
Standardisation: technical platforms and data formats may differ
between operators, but the data provided must be standardised both in
terms of format, content and frequency (temporal granularity). The
format, content and attributes of the data should be stable over time.
Also algorithms used should be the same across operators, the issue of
sampling and representativity needs to be tackled;
Provision frequency: the frequency by which operators transmit data
to NSIs (near-real time, daily, weekly, monthly etc.) should be defined.
Scalability and speed of the processing is also an issue the processing
time should be independent of the size of the operator.

88
4.5.2 New Zealand

Figure 30 New Zealand: usage of mobile phones for dates close to the quake

Another experience in usage of mobile phones data comes from New


Zealand. In a report [SNZ 2012] Statistics New Zealand examines the use of
cellphone data to monitor short-term population movements. Analysis has
been undertaken with data records of cellphone usage during and after a
natural disaster, the Canterbury earthquakes of 2010 and 2011.
For the purposes of the study, SNZ used a subset of cellphones called
'Christchurch cellphones i.e. cellphones that initiated calls from Christchurch
city only, during the 10 days before the quake.
In general, cellphone voice calls show a regular calling pattern across the
week: during usual working days (Monday to Friday) the likelihood of
making voice calls is high. During the weekend, when business calling is
reduced and when friends and families are more likely to spend time together,
the number of cellphone voice calls is much reduced. Minimal usage is
observed on Sundays.
89
The analysis focused on population movements within the region, and to
other regions, giving answers the following questions:
What was the movement patterns of people who relocated following the
22 February 2011 earthquake in Christchurch city?
Which areas attracted the highest percentages of people?
By when had most people re-located and by when had most people
returned?
Are results sensitive to the method used to select subsets of cellphones
for the purposes of tracking movements?
What are the quality constraints of cellphone data?
The analysis highlighted the main strengths of the data source for the
purposes of emergency management. Cellphone data allowed to monitor
changes in population movement patterns and changes in the presence of
cellphone users in urban-spatial contexts. Further, it allowed to continuously
track urban movements over time an attribute unique to cellphone data.
Statistics New Zealand could conclude that cellphone data can provide the
following information for emergency management:
Which regions, districts, and cities attract significantly higher inflows
of people from an area of emergency over time;
Patterns of rapid movements over space and time following an event,
including return movements;
Relative inflows of people to a broadly defined area of emergency that
are significantly different to expected levels of inflow;
Which regions, cities, and districts potentially have increased
population making calls from their area due to an emergency event in a
nearby region?
On the other hand, cellphone data could not sufficiently provide information
about residential areas involved in movements and about the actual number
of people who relocated, temporarily or permanently.

90
4.6 Data about Information Society

4.6.1 Eurostat

National Statistical Institutes disseminate a wide variety of Information and


Communication Technology (ICT) statistics, which are used to monitor the
progression of countries to Information Society. Traditionally, data are
collected by means of two different questionnaires, one for
households/individuals and one for enterprises.
An issue of particular relevance to the ICT statistics is the fact that the
phenomena under study evolve comparatively rapidly. For instance, the
increased penetration of computers and broadband connections makes some
indicators on to access to ICT (and to the Internet) less relevant. On the other
hand, new technology constantly becomes available, and therefore, questions
on access and use of these new technologies of become relevant to
policymakers.
Digital footprints are left behind in our daily lives, and can be used to
measure a wide variety of phenomena. A recently conducted study 51 the
European Commission investigated the possibility to use the Internet as a
Data Source (IaD) to complement or substitute traditional statistical sources.
Three basic types of IaD methods are identified in the study: (i) User-centric
measurements that capture changes in behaviour at the client side (PC,
smartphone) of an individual user; (ii) Network-centric measurements that
focus on measuring properties of the underlying network; (iii) Site-centric
measurements that obtain data from webservers.
ICT statistics is a natural candidate subject for piloting reengineering based
on Internet and similar sources. Eurostat has therefore commissioned a study
on the analysis of methodologies for using the Internet for the collection of
ICT and other statistics. Out of the three possible IaD methods, the methods
proposed in the study fall in the user-centric and site-centric categories.
51
"Feasibility Study on Statistical Methods on Internet as a Source of Data Gathering (SMART
2010/030)." Digital Agenda for Europe. Web. December 2013. http://ec.europa.eu/digital-
agenda/en/news/feasibility-study-statistical-methods-internet-source-data-gathering-smart-2010030
91
For current individuals/households variables it is concluded that neither
access to ICT nor use of computers is possible to measure over the Internet,
whereas use of Internet predictably is ideal for measurement over the
Internet. For use of e-government, measurement over the Internet is only
possible after quite some redefinition of the items. For e-commerce and e-
skills, collection over the Internet is considered to be feasible.
For enterprises, it is concluded that only a fraction of the current ICT
variables would be available on the enterprise website, and that most
information, albeit stored or possible to trace by means if ICT tools, would
only be available via the servers of the enterprise.
The study concluded that data collected electronically would provide ample
material for additional ICT indicators, defined ex-post, and a large number of
potential indicators, providing additional detail to existing items are
suggested.

4.6.2 Italy

In 2013 ISTAT (Italian National Statistical Institute) started testing web


scraping and text mining techniques to amend the survey about Information
and Communication Technologies in enterprises 52. The survey data are
based on responses provided by more than 19,000 companies with at least 10
employees representative of a universe of more than 200,000 businesses
employing more than 8 million workers.
Survey is a postal self-completion questionnaire on paper, with the ability to
answer electronically through the use of a web site dedicated to the
investigation. In the 2013 survey, to the question Indicate whether the
company has its own web site or one or more pages on the Internet more
than 8,600 companies have indicated the address of a referring web site.
Based on the information available directly from the web sites indicated by a
subset of firms, two different actions can be performed:
Replace the data collected through questionnaire with data found on the
52
"Information and Communication Technologies in Enterprises." ICT in Enterprises. Web. December
2013. http://www.istat.it/en/archive/77760
92
web (Internet as Data source IaD) so as to reduce the respondent
burden;
Integrate data collected directly from the questionnaire with a sample
survey, with those acquired on Internet on the entire population of
companies, to increase the accuracy of the estimates.
The web scraping technique was used to extract data from web sites. Note
that web scraping is usually not limited to data retrieval, but is aimed at
storing data in a repository arranged properly for the treatment of textual
objects, facilitating indexing, searching and analysis.
In 2013 the scraping was made of about 5,600 sites on 8,600 websites
addresses mentioned by companies: the difference between the numbers is
due to incorrect indication of the URLs of the sites or the inability to access
sites for some kind of technological barriers.
Then Text Mining techniques were used to index the collected contents, to
test various algorithms of mining and to evaluate survey results. In the 2013
test many mining tools were used: among them MALLET (MAchine
Learning for LanguagE Toolkit53, the R applications RTextTools54, Rattle55
and the proprietary tool KXEN56. Most notable result of analysis was a big
number of false negative found, that is companies answering they did not
have a website for which an active website was found on the web.
Ongoing activities in ISTAT aim to ensure:
development of the techniques of web scraping and text / data mining in
order to improve the results in terms of predictive ability (see Figure
30);
the extension to questions relating to the characteristics of the web site
of the procedure so developed;
the extension to the questions relating to the use of social networking
sites.
53
"MALLET Homepage." MALLET Homepage. Web. December 2013. http://mallet.cs.umass.edu/
54
"RTextTools: A Machine Learning Library for Text Classification - Blog." RTextTools: A Machine
Learning Library for Text Classification. Web. December 2013. http://www.rtexttools.com/
55
"Rattle - Data Mining Toolkit in R - Google Project Hosting." Rattle - Data Mining Toolkit in R -
Google Project Hosting. Web. December 2013. http://code.google.com/p/rattle/
56
"KXEN - The Predictive Analytics Leader: KXEN, Inc." KXEN - The Predictive Analytics Leader:
KXEN, Inc. Web. December 2013. <http://www.kxen.com/
93
Figure 31 Schema for Istat analysis of Big Data coming from the Internet

94
4.7 Considerations on first examples of Big Data usage in
Official Statistics

Here we list some considerations derived from first examples of Big Data
usage by statistical organizations:
Automated or assisted data collection - Following first experiences in data
collection from the internet, it is useful to distinguish between (i) automated
data collection, i.e. collection of data from internet sites with many similar
items approached with internet robots that run without user interaction and
(ii) robot-assisted data collection for collecting data from internet data
sources with only a few items. For the second category, it's important to assist
the data collector to check for changes in data on internet sites. Both
automated and robot-assisted data collection have proven to be viable options
for official statistics. Automated data collection can result in more detailed
data compared to data collected in traditional ways, which may be used to
validate our work, to improve efficiency or to reduce response burden. Also,
this kind of collection methods may be used to study phenomena in a
completely new way. Robot assisted data collection appeared to be useful to
collect prices from many different internet sites in an efficient way.
Netiquette - When statistical institutions use the internet to collect data they
have to respect internet etiquette (netiquette): many sites include a file named
robots.txt in their root to specify how and if crawlers should treat that site
and usually all statistical institutions respect this robot exclusion protocol.
Also, in order not to negatively influence site performance, the standard
suggestion is to configure robots to run nicely, respecting a commonly
accepted waiting time of one second between requests. In addition, to operate
as transparently as possible, robots should identify themselves as being from
Official Statistics Agency via the user-agent string.
Communicating with site owner - One suggestion is to study the
possibilities of internet sources for a while first, and then to start
communicating with website owners, because they know their data better
than anyone else. Many website owners could offer to send the data directly
95
from their back-office system. If possible, it's better to opt for the direct
connection rather than the robot solution, as it is expected to be more stable
and may even contain more interesting data such as numbers of items sold (as
in scanner data).
Another reason to communicate with data owners is that they can already
have some API (application programming interface) available to give
partners access to his data. In some case in Netherlands they opened it for
Statistics Netherlands that started using API in combination with the generic
robot framework to access the data.
Website identification - Another important issue is the identification of the
websites to use for collecting data. For instance if you have to collect data for
Consumer Price Index you have to know which sites would offer reliable
data, which sites would offer original content and which would replicate the
content of others, how easy it would be to read the data, which variables are
available and how comparable they are across different sites. In addition, you
have to know how the volume of data grow and how volatile the data were.
This is typical for internet data collection. Unlike more traditional data
sources (administrative sources, questionnaires) where data characteristics
are known by the delivering organisation, or controlled by the statistics office
in case of questionnaires, statistics offices do not control internet data. It is
more like observational data where you have to work whit data you can get.
Legal framework - Additional legal steps are needed to enable the
production of official statistics using big data. The current legislative
framework for statistics in many countries does not cover access and use of
big data, both within government and from private sector. So it is particularly
difficult to gain access to the big data collected and kept by other parties.
Furthermore, a privacy framework is needed that sets the ground rules for
how big data sets can be combined, protected, shared, exposed, analysed and
retained. This would address the significant issue of public trust in the
appropriate use by government of the personal data of individuals. It is
important to maintain public trust: individuals must be sure that their
personal information will be well protected. For example, in the area of
mobile telephone location data, even if identification is suppressed, people
will still be highly concerned about the transfer of such information from the
mobile providers to other parties like NSIs. Similarly, mobile device
96
providers need guarantees that privacy rights will not be violated when they
turn over their data to the Government.
Errors management Big Data errors can occur at various stages. Usually
when Statistical institutions manage surveys, they apply the Total Survey
Error [Biemer 2012]: this methodology could be reviewed to determine how
it could be applied to Big Data.
Some type of errors could be source specific; others would potentially apply
to all sources (i.e. construct validity, coverage error, measurement error, error
due to imputation, non-response error). Sampling error would apply to the
specific cases where sampling techniques are used. A few examples of the
type errors are given below, distinguished according to the different big data
sources:
Human beings: measurement error (mode effect, respondent related
factors involving cognitive aspects and/or the will to declare the true
value;
Information system: lack of consistency in the construct used in
different files;
Machines/sensors: measurement errors caused by malfunction, misuse,
etc.
Data cleaning - when processing Big Data, the processing steps should
include a preliminary reception function, where data are first verified and
pre-treated, followed by a more in-depth analysis where erroneous data,
outliers, and missing values are flagged to be processed.
All types of big data sources can potentially suffer from partial non-response
(missing values for specific variables). So for data cleaning you have to
consider following points:
Knowledge about the data and the associated metadata is a key factor in
the development of efficient processing methods;
Given the overall size of the data, outliers may not be influential
compared to traditional statistical processing;

97
While imputation methods are well known for numerical data in
traditional statistics, still few experiences are available for imputing text
strings and moreover other type of unstructured data.
Unstructured data - The huge increase in the availability of unstructured
textual data requires statistical institutes to increase investments in tools able
to analyze textual data coming from the web. These tools, starting from web
scraping to textual analysis and to text mining, should become part of the
standard toolbox of statisticians and data analysts.

98
Chapter 5: International cooperation on Big Data in
Official Statistics

In this final chapter the initiatives taken on Big Data from international
bodies and the most advanced national statistical offices are described. A
special attention is given to the HLG project with the aim to create a shared
environment in which to experiment in a coordinated way new methods and
new software tools on large statistical archives. Then we list suggestions for
the steps to be followed when one statistical organization start dealing with
Big Data. Finally we indicate some recommendations, which we believe
would be important to follow in the field of international work on Big Data.

5.1 High Level Group

The High-Level Group for the Modernisation of Statistical Production and


Services (HLG57) was set up by the Bureau of the Conference of European
Statisticians (CES58) in 2010 to oversee and coordinate international work
relating to the development of enterprise architectures within statistical
organizations. It was originally called the High-Level Group for Strategic
Developments in Business Architecture in Statistics (HLG-BAS).
The HLG is composed by the heads of some important National Statistical
Institutes and by the heads of international statistical organizations like
Eurostat, OECD and UNECE.
Before establishment of HLG there were many groups that were concerned
with standardization in the field of statistics: in the Figure 31 one can see an
outline of main groups and their relationships

57
"UNECE Statistics Wikis." High-Level Group for the Modernisation of Statistical Production and
Services. Web. December 2013. http://www1.unece.org/stat/platform/display/hlgbas/High-
Level%2BGroup%2Bfor%2Bthe%2BModernisation%2Bof%2BStatistical%2BProduction%2Band
%2BServices
58
"Conference of European Statisticians (CES) - UNECE." Conference of European Statisticians (CES)
- UNECE. Web. December 2013. http://www.unece.org/stats/ces.html
99
In 2011 the HLG has released its strategy to implement the vision for
modernizing official statistics59, which was endorsed by the Conference of
European Statisticians (CES) in June 2011. In this document HLG addresses
the issue of the so-called data deluge: The main theme of the vision was
that statistical organizations are confronted with accelerating change in
society and the way that data are produced and used within the information
industry. Official statistics faces all of the opportunities and threats that
accompany a data deluge.

Figure 32 Schema of international groups working on statistical standards

In its vision HLG then analyses the key factors to define the role of official
statistics. They include factors like Quality, Trust, User needs, Strategic
partnerships, Common standards and others. In the chapter dedicated to
Rationalising processes there is a proposal (d) Develop new
59
"UNECE Statistics Wikis." HLG Strategy. Web. December 2013.
http://www1.unece.org/stat/platform/display/hlgbas/HLG%2BStrategy
100
methodologies to reflect the changes in data acquisition and the dramatic
increase of the volume of data available, for example, on topics such as noise
and error reduction in large data sets, pattern recognition and other
methodological tools appropriate for "Big Data".
Following this proposal, in March 2013 the HLG published the paper What
does Big Data mean for Official Statistics60. The paper starts with this
sentence this paper will seek to address two fundamental questions, i.e. the
What and the How: What subset of Big Data should National Statistical
Organisations (NSOs) focus on given the role of official statistics, and How
can NSOs use Big Data and overcome the challenges it presents?
An important initial consideration compares the potential of Big Data to
produce relevant and timely statistics and the traditional sources of official
statistics. In the past official statistics has been based on survey data
collections and on administrative data from government programs. But with
the great change brought on by Big Data, most data are now available on the
Web or through private companies. And so the private sector, using Big Data,
can produce statistics that are timelier and more relevant than official ones.
On the other hand National Statistical Offices have a long experience in
managing large amounts of data, addressing their accuracy, consistency and
interpretability. Integrating Big Data sources into official statistics processes
can be the best way for NSOs to protect the role of official statistics while
working on relevance and timeliness.
In the paper we can find an analysis of Big Data definitions, a list of the
challenges posed to official statistics and a set of examples of possible uses
of Big Data in official statistics. At the end of the document we find a final
paragraph of recommendations:
- During the next two years there is a need to identify a few pilot
projects that will serve as proof of concept with the participation of a
few countries collaborating. The results could then be presented to the
HLG;
- Statistical organisations are encouraged to address formally Big Data
issues (methodological and technological challenges) in their work
60
"UNECE Statistics Wikis." What Does Big Data Mean for Official Statistics? Web. December 2013.
http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170614
101
programmes by undertaking research and pilot projects in selected
areas;
- New exploration and analysis methods are required: Visualization
methods, Text mining, and High Performance Computing;
- Successful use cases on Big Data analytics tools and solutions should
be brought to the attention of the international statistical community;
- National examples of collaboration of NSOs with private data source
owners in this field, addressing some of the issues of granting NSOs
privileged access to private sourced Big Data should be part of the
priority actions;
- To use Big Data, statisticians are needed with analytical mind-set, an
affinity for IT (e.g. programming skills) and a determination to extract
valuable knowledge from data. These so-called data scientists can
be derived from various scientific disciplines;
- In the short to medium terms, NSOs should develop the necessary
internal analytical capability through specialised training and
international collaboration;
- There's a need for the drafting of guidelines / principles for the
effective use of Big Data for purposes of official statistics;
- HLG should ensure that the outputs of several future activities on Big
Data (dedicated session, workshops and seminars) are effectively
coordinated and communicated at the strategic level.

5.2 MSIS 2013

In MSIS (Meeting on Management of Statistical Information Systems)


meeting 2013 held in May in Paris and in Bangkok61 we can find many points
of discussion focused on Big Data. Here we mention some participants
comments (from the final report of the meeting):

61
"Meeting on the Management of Statistical Information Systems." - UNECE. Web. December 2013.
http://www.unece.org/stats/documents/2013.04.msis.html
102
Based on experiments so far, it is likely that Big Data will require new
methods and infrastructures, and new ways of defining and measuring
quality;
A classification of types of Big Data is needed;
This is an ideal time to start collaborating on Big Data, as we dont yet
have systems in place. An architecture for Big Data is needed, as well
as collaboration with the wider information industry;
We have common issues in the use of Big Data, so we need
mechanisms to work together to find solutions. This should be a priority
issue for the HLG;
It is important to take a multidisciplinary approach to Big Data,
currently different groups are all looking at this issue from their own
perspectives;
Agreeing a common classification of the different types of Big Data
should be an early priority;
We need a concrete project to produce specific statistics from Big Data,
and to find real solutions;
A virtual task team should be set up to define the issues and formulate a

103

Figure 33 Web page of MSIS 2013 meeting


clear project proposal, which would be passed to the HLG.

In the summary of the session on Collaboration, the Session Organizer asked


How can we collaborate on Big Data? Many initiatives and groups are
involved in collaboration. How can the MSIS group contribute most
effectively in this new structure?
Also a special panel session was organized with the title: Plug and play
architecture and collaborative development will allow us to accelerate our
Big Data programs by ... The panel discussion followed a Pecha Kucha
[Pink 2007] format: 20 slides for 20 seconds each.
Key points raised by panelists and other participants included:
There are several options for dealing with Big Data, ranging from
ignoring it to changing our whole business model to focus on Big Data
use. A solution somewhere between these extremes is more likely, and a
plug and play architecture would provide an essential basis for a
collaborative approach;
We have to deal with Big Data to keep our relevance, and a
collaborative approach can reduce the risk. We need to share tools,
ideas and skills, including beyond official statistics, for example with
industry and academia;
We may have to process Big Data in external environments (e.g. the
cloud), so it may not always be possible to use a plug and play
approach;
The main initial challenges in dealing with Big Data are both
methodological and technical;
Eurostat is coordinating a common architecture initiative within the
European Statistical System, based on the CSPA, which will provide a
framework for dealing with Big Data;
Big Data will require a case by case, opportunistic approach, and, at
least initially, could be seen as a complementary source rather than a
replacement for more traditional sources;

104
Processing time needs to be improved. Taking over a year to produce
census results is no longer acceptable. We should aim for real-time
processing, as is increasingly the case in the commercial sector;
The fact that Big Data are often not stored permanently is an issue
because resulting statistics may not be reproducible, which has
implications for quality management;
The size of Big Data should not be a major issue as storage and
processing capabilities are constantly increasing;
We need to identify and develop the knowledge and skills necessary to
use Big Data. New skill-sets will be needed;
Legal and licensing issues need to be addressed, particularly regarding
consistency of data supply;
Big Data provides an opportunity for the CSPA project to deliver useful
outputs where nothing currently exists.
Interestingly, many speakers emphasized the close link between the theme of
Big Data and standardization initiatives on architectures, based on the so-
called model plug-n-play (CSPA62).

5.3 Task Team

In May 2013 a temporary task team63 was set up, composed by members
coming from thirteen countries or international organization. The task team,
working virtually through teleconferences and sharing documents on the
wiki, started to define to work out the key issues with using Big Data for
official statistics, identify priority actions and formulate a project proposal.
As preliminary activities, two tasks have been undertaken by the temporary
task team.
62
"UNECE Statistics Wikis."Common Statistical Production Architecture Home Web. December 2013.
http://www1.unece.org/stat/platform/display/CSPA/Common%2BStatistical%2BProduction%2BAr
chitecture%2BHome
63
"UNECE Statistics Wikis." Members of the Task Team. Web. December 2013.
http://www1.unece.org/stat/platform/display/msis/Members%2Bof%2Bthe%2Btask%2Bteam
105
The first was the formulation of a classification scheme for Big Data sources
(see Chapter 2) and the identification of the attributes of these sources
relevant to their use in the production of official statistics.
The second was the development of an inventory64 containing structured and
searchable information about actual and planned use of Big Data in statistical
organizations. The inventory aims to include the key resources that will be
most valuable for the official statistics community. Where a resource already
exists somewhere on the Internet, the inventory holds links to resources and
documentation, to avoid duplication and reduce problems of version control.

Figure 34 Web page for Big Data Inventory


The inventory is hosted on the UNECE wiki platform. The wiki approach
makes it easier for the inventory to be maintained by the whole statistical
community, thus avoiding the burden of maintenance falling on one
individual or organization. This approach also ensures that the inventory is a
relevant and useful resource for all. Read access is open to the public, for
inventory entries for which author organizations have given permission. Edit
access is restricted to staff of national or international statistical organizations
with an interest in this topic.

64
"UNECE Statistics Wikis." Big Data Inventory. Web. December 2013.
http://www1.unece.org/stat/platform/display/msis/Big%2BData%2BInventory
106
The "owner" of the inventory is the High-Level Group on the Modernisation
of Statistical Production and Services, on behalf of the international statistical
community.
For each Big Data source the inventory collects following information:
Country/Organization name
Contact person for the case study (name, email, telephone)
Type of Big Data used
Project description
National or international scope of data source
Public/private source
Data access framework
Payment for data (Yes/No/Fees)
Data access
Statistical domain and use of data
Degree of progress in use of the Big Data source
Tools and methods for processing
Privacy and confidentiality issues
Links and attachments

5.3.1 Project proposed by the task team

The project proposed to HLG by the task team has three main objectives:
To provide guidance for statistical organizations to identify the main
possibilities offered by Big Data and to act upon the main strategic and
methodological issues that Big Data poses for the official statistics
industry;

107
To demonstrate the feasibility of efficient production of official
statistics using Big Data sources, and the possibility to replicate these
approaches across different national contexts;
To facilitate the sharing across organizations of knowledge, expertise,
tools and methods for the production of statistics using Big Data
sources.
The work for the project will start from the major challenges listed in the
HLG paper What does Big Data mean for Official Statistics? (see 4.1), i.e.
legislative, privacy, financial, management, methodological and
technological. The task team tried to identify a variety of issues and
expressed them in the form of questions:
How can we assess the suitability of Big Data sources for the
production of official statistics?
How can we actually benefit from the increased timeliness offered by
many Big Data sources?
Can we identify best practices for the major methodological issues
relating to Big Data? E.g.:
o Methods for reducing data volume
o Methods for noise reduction
o Methods for ensuring confidentiality and avoiding disclosure
o Methods for obtaining information on statistical concepts (text
mining, classification methods, etc.)
o Methods for determination of population characteristics, e.g.
correlating words used by social media users with certain
demographic characteristics
Should Big Data be treated as homogeneous, or do they require
different treatment according to the role they play in the production of
official statistics?
o Experimental uses
o Complementing existing statistics e.g. benchmarking and validity
checking
108
o Supplementing existing sources, permitting the creation of entirely
new statistics
o Replacing existing sources and methods
Are there 'quick wins', applicable beyond Big Data, such as data
storage, technology, advanced analytics, methods and models which
could transform our thinking in relation to the production of official
statistics more generally?
How should statistical organizations react to the novel idea that in a Big
Data world there are no 'bad' data (they all tell us something)?
How can organizations mitigate the risk of a data source ceasing to
exist, or changing substantially, when it is outside the control of the
organization?
How can Big Data be combined with survey data? How can we manage
the transition from statistical data production based on surveys to
production based substantially on Big Data?
Do we need a research question before exploring a Big Data source, or
should we just experiment to see what is possible?
What becomes of the time series in a world where data sources and uses
may become more transient?
How will institutional structures need to change in order to support the
use of Big Data and ensure its quality and the quality of resulting
outputs?
The output will take the form of recommendations, good practices and
guidelines, developed through broad consultation of experts throughout the
official statistics community, and coordinated by expert task teams. The
material will be collated in a Web environment such as a wiki so allowing the
guidelines to function as a 'living document', permitting timely updating.

The sandbox will form the practical element of the project, aimed at
proving concepts in two related strands:
(a) Statistics:

109
the possibility of producing reliable statistics from novel sources,
including the ability to produce statistics which correspond with
existing 'mainstream' products, such as price statistics;
the cross-country applicability of new analytical techniques and
sources, such as the analysis of data from social networking websites.
(b) Tools:
the efficiency of various software tools for large-scale processing and
analysis;
the applicability of the Common Statistical Production Architecture
(CSPA) to the production of statistics using Big Data sources.
A web-accessible environment for the storage and analysis of large-scale
datasets will be created and used as a 'sandbox' for collaboration across
participating institutions. Some internationally-relevant datasets will be
obtained and installed in this environment, with the goal of exploring the
tools and methods needed for statistical production and the feasibility of
producing Big Data-derived statistics. Simple configurations with tools and
data will, whenever possible, be released in 'virtual machines' that partners
will be able to download in order to test them within their own technical
environments.
More details on Sandbox subproject in 5.4 below.
In such projects its very important to be sure that conclusion reached are
shared broadly throughout the statistical world and beyond.
This will be done through a variety of means, including:
a) establishing and maintaining an online infrastructure for documentation
and information-sharing on the UNECE wiki, including detailed
documentation from work packages 1 and 2;
b) preparation of electronic demonstrations of tools and results, for
example in the form of presentations and videos which can be
disseminated widely. Identification of existing electronic resources and
online training materials is also included in this strand;

110
c) a workshop in which the results of Sandbox subproject will be
presented to members of the various of expert groups involved in the
HLG's modernisation programme.
Definition of success
Overall, this project will be successful if it results in an improved
understanding within the international statistical community of the
opportunities and issues associated with using Big Data for the production of
official statistics. More detailed success criteria are:
To reach a consistent international view of Big Data opportunities,
challenges and solutions, documented and released through a public
web site;
Share recommendations on appropriate tools, methods and
environments for processing and analysing different types of Big Data,
and a report on the feasibility of establishing a shared approach for
using Big Data sources;
Exchange of knowledge and ideas between interested organizations and
a set of standard training materials;

111
5.4 Sandbox subproject

5.4.1 Goals of Sandbox

The Sandbox project will be successful if it results in:


Strong, justified and internationally applicable recommendations on
tools, methods and environments for processing and analysing different
types of Big Data;
Report on the feasibility of establishing a shared approach for using Big
Data sources that are multi-national or for which similar sources are
available in different countries.
The value of the subproject derives from its international nature. While
individual statistical organizations can experiment with the production of
official statistics from Big Data sources (and many are currently doing so or
have already done so), and can share their findings and methods with other
organizations, this project will be able to do the same in a more open and
collaborative setting. The Sandbox will draw on the international nature
and/or international ownership and management of many Big Data sources,
and will capitalise on the collective bargaining power of the statistical
community acting as one in relation to such large transnational entities. The
sandbox will contribute to the overall value of the project by providing a
common methodology from the outset, precluding the need for post-hoc
efforts to harmonise methodology in the future.

5.4.2 Specific objectives

The Sandbox will evaluate the feasibility of the following propositions, and it
will demonstrate and document how the actions would be achievable in
statistical organizations:
1. 'Big Data' sources can be obtained (in a stable and replicable way),
installed and manipulated with relative ease and efficiency on the chosen
112
platform, within technological and financial constraints that are realistic
reflections of the situation of national statistical offices;
2. The chosen sources can be processed to produce statistics which conform
to the usual quality criteria used to assess official statistics, and which are
comparable across countries;
3. The resulting statistics correspond in a systematic and predictable way
with existing mainstream products, such as price statistics, household
budget indicators, etc.;
4. The chosen platforms, tools, methods and datasets can be used in similar
ways to produce analogous statistics in different countries;
5. The different participating countries can share tools, methods, datasets and
results efficiently, operating on the principles established in the Common
Statistical Production Architecture.
While the first objective is to examine these propositions (the 'proof of
concept'), a second objective is to then use these findings to produce a
general model for achieving the goal of producing statistics from Big Data,
and to communicate this effectively to statistical organizations. So, all
processes, findings, lessons learned and results will be recorded for
dissemination and training activities. In particular, experiences and best
practices for obtaining data will be detailed for the benefit of other
organizations.

5.4.3 Basis for the Recommendations

The task team considered a wide range of alternative possibilities for tools,
datasets and statistics and assessed them against various criteria. These
included the following:
Tools
Whether or not the tools are open source
Ease of use for statistical office staff
Possibilities for interoperability and integration with other tools

113
Ease of integration into existing statistical production architectures
Cost and licences
Availability of documentation
Availability of online tutorials and support
Training requirements, including whether or not a vendor-specific
language has to be learned
The existence of an active and knowledgeable user community.
Datasets
Ease of locating and obtaining data from providers
Cost of obtaining data (if any)
Stability (or expected stability) over time
Availability of data that can be used by several countries, or data whose
format is at least broadly homogeneous across countries
The existence of identification variables which enable the merging of
Big Data sets with traditional statistical data sources
Statistics
At least one statistic that corresponds closely and in a predictable way
with a mainstream statistic produced by most statistical organizations
One or more short term indicators of specific variables or cross-
sectional statistics which permits the study of the detailed relationships
between variables
One or more statistics that represent a new, non-traditional output (i.e.
something that has not generally been measured by producers of official
statistics, be it a novel social phenomenon or an existing one where the
need to measure it has only recently arisen)

5.4.4 Recommendations and Resource Requirements

In the following the recommendations from the task team that will have to be
followed for setting the "sandbox":

114
1. Processing environment: HortonWorks65 Hadoop distribution to be
installed on a cluster provided by a volunteering statistical organization;
2. Processing tools/software: the Pentaho Business Analytics 66 Suite
Enterprise Edition will be deployed under a free trial license obtained
for the purpose of the project. Pentaho Business Analytics Suite
provides a unified, interactive, and visual environment for data
integration, data analysis, data mining, visualization, and other
capabilities. Pentaho's Data Integration67 component is fully compatible
with Hortonworks Hadoop and allows 'drag and drop' development of
MapReduce jobs. In addition to Pentaho suite, tools such as R 68 and
RHadoop69 will be installed;
3. Datasets and statistics to be produced (or feasibility of production to be
demonstrated): one or more from each of the categories below have to
be installed in the sandbox and experimented with for the creation of
appropriate corresponding statistics:
a. Transactional sources (from banks/telecommunications
providers/retail outlets) to enable the recreation of standard official
statistics in easiest possible way, minimizing as far as possible
potential obstacles to access;
b. Sensor data sources;
c. Social network sources, image or video-based sources, other less-
explored sources (to enable the creation of 'new' statistics);
4. Human resource requirements: a task team will be identified at the
outset of the project, composed of experts whose time is volunteered in
kind by their respective organizations for the duration of the work
package. The project manager's first task will be to identify the number
of members required, the requisite skills and the amount of time to be
65
"We Do Hadoop. ^on Windows Too!" Hortonworks. Web. December 2013. http://hortonworks.com/
66
"Business Analytics from Pentaho - Leader in Business Analytics." Pentaho. Web. December 2013.
http://www.pentaho.com/product/business-visualization-analytics
67
"Data Integration | Pentaho Business Analytics Platform." Pentaho. Web. December 2013.
http://www.pentaho.com/product/data-integration
68
"The R Project for Statistical Computing." The R Project for Statistical Computing. Web. December
2013. http://www.r-project.org/
69
"Public RevolutionAnalytics / RHadoop." GitHub. Web. December 2013.
https://github.com/RevolutionAnalytics/RHadoop/wiki
115
committed by task team members to enable the work to progress.
The sandbox will be hosted at the Irish Center for High-End Computing
(ICHEC70) which has the mission to provide High-Performance Computing
resources, support, education and training for researchers in third-level
institutions. ICHEC is a partner in the European High Performance
Computing (HPC) service PRACE71, Partnership for Advanced Computing in
Europe [Prace 2012].
ICHEC will assist the task team to implement the Big Data environment for
the testing and evaluation of Hadoop work-flows and associated data analysis
application software.
The hardware on which the sandbox system is to be based is a High
Performance Computing Linux cluster hosted in the National University of
Ireland, Galway and is composed of 60 compute nodes each of which has two
Xeon quad-core processors, 48GB of RAM and a 1TB local disk. Each node
is connected to two networks one network for accessing the shared Lustre
filesystem and for high performance communications as well as a Gigabit
Ethernet network for management tasks. In addition, a 20TB shared
filesystem is available to all nodes.
ICHEC will dedicate to the sandbox project 20 compute nodes to enable a
Hadoop cluster with 160 cores almost 1TB of RAM and 20TB of HDFS
distributed storage. In addition, the Hortonworks Data Platform Hadoop
distribution will be installed on the cluster as well as the R and Pentaho
application suites. User accounts for up to 30 participants of the sandbox
project will be created to allow remote access to the system.

70
"Ireland's High-Performance Computing Centre." Ireland's High-Performance Computing Centre.
Web. December 2013. https://www.ichec.ie/
71
"PRACE Research Infrastructure." PRACE Research Infrastructure. Web. December 2013.
http://www.prace-project.eu/?lang=en
116
5.5 Recommendations on Big Data coming from
international cooperation

Principle 5 of the Fundamental Principles of Official Statistics72 states that:


Data for statistical purposes may be drawn from all types of sources, be they
statistical surveys or administrative records. Statistical agencies are to choose
the source with regard to quality, timeliness, costs and the burden on
respondents. This means that it is acceptable to use any type of Big Data
source, as long as it meets the four requirements of quality, timeliness, cost
and burden.
International cooperation on Big Data lead to define the considerations
statistical organizations will need to take into account when looking at Big
Data sources. Considerations can be summarized in following steps:
1. The first step is to identify potential sources of Big Data. This may be
done using internet searches, data markets or good practices from other
organizations;
2. The second step is an assessment of the suitability of these sources. For
this, a suitable quality assessment framework is needed. The HLG
project is starting to develop such a framework, using as starting points
the National Quality Assurance Framework73 developed by the UN
Statistical Division. This assessment should include topics such as
relevance (how close are the measured concepts and the concepts
required for statistical purposes), coverage of the target population and
representativeness. The general caveats on use and suitability should be
distinguished following subgroup typology of Big Data (see above 2.2).
3. The third step is to get access to the Big Data source. Four types of
frameworks are needed to manage access:
a. Legal framework laws and rules to guarantee, permit or restrict
access
b. Policy framework - government or organizational policies,
72
"Fundamental Principles of Official Statistics - UNECE." Web. December 2013.
http://www.unece.org/stats/archive/docs.fp.e.html
73
"National Quality Assurance Framework." National Quality Assurance Framework. Web. December
2013 http://unstats.un.org/unsd/dnss/QualityNQAF/nqaf.aspx
117
including codes of practice and data protection
c. Organizational framework - data access arrangements with
suppliers, e.g. contracts, informal agreements etc.
d. Technical framework - data and metadata transfer formats and
mechanisms
4. The fourth step is to produce pilot outputs for more in-depth assessment
at the output stage. This may involve testing data integration methods
and routines, in the case of multiple-source outputs.
5. The fifth step is to use this assessment to decide whether to use the
source in practice.
6. Finally institutions should repeat the third step, but with a view to
longer-term sustainability of access (assuming the source will be used
to produce a repeated statistical output).
In my opinion further recommendations should be added for future activities
of the statistical organizations in the field of big data:
Shared practice - Statistical organizations should jointly develop best
practices and processes related to gaining permission to access datasets,
gaining permission for different data uses, confirming 'ownership' of the data,
determining how, and how often, the data will be accessed;
Unwanted advantages - Statistical community should verify if the use of
data from specific sources in official statistics could indicate a 'default seal of
quality' for those sources. This 'default seal' could provide a competitive
advantage to this provider that was not intended.
Costs and reputation - Whenever statistical organizations pay for data, they
should compare these costs with the costs of traditional collection through
surveys and/or with costs of commercialisation of data. In addition they
should verify if paying for data could impact negatively on reputation.
Bargaining power - Statistical organizations could leverage their collective
bargaining power to enter into these relationships with the multi-national
companies, eg Google, Facebook, Twitter, Vodafone. Statistical organizations
could influence their data streams and output to negotiate 'standardised' data
feeds. Statistical Organizations could agree on how much 'overhead' they
could ask these providers to take on.
118
Statistical Organizations role - It is very important in a Big Data world that
Statistical Organizations are seen as a significant contributor, if not leader, in
specific methodological areas. Statistical organizations should change the
perception that they are 'thinkers', and not only data producers.

119
Conclusions and future actions

In Official Statistics institutions the use of Big Data is becoming an


irreplaceable element that more and more will complement the traditional
data sources, those from surveys and administrative sources.
To fully integrate information coming from Big Data with traditional sources,
a number of challenges need to be addressed in the various domains:
technological, methodological and legal.
In technology field the challenge is on the maturity of the platforms to be
used. Current products available for Big Data management (we refer in
particular to the ecosystem Hadoop and NoSQL databases) still need to be
developed in order to achieve the stability and a level of standardization
comparable with those of the Relational Data Base Management Systems.
Statistical methodologists still need to refine the methodological tools to
address the new challenges posed by Big Data, in particular with regard to
the quality of the data collected and the representativeness of the data sources
coming from multiple sources no longer directly managed by statistical
organizations.
From a legal point of view reference standards must be defined, suitable to
manage the confidentiality and complete availability of data coming from
sources external to the statistical system and collected in innovative ways that
make not always possible to identify the property and responsibility of the
same data.

Discussion and lessons learned


In previous chapters we exposed considerations on actions to be taken by
Statistical Organizations when dealing with Big Data.
Recommendations arising from the use of new technologies( see 2.5) can be
grouped into three parts:
How to use the traditional statistical software to manage Big Data;
Which are the new needed skills, not yet available in statistical
organizations;
120
What are some possible solutions to the impact that Big Data can have
on IT infrastructure.
Another group of considerations derives from the analysis of first experiences
in the usage of Big Data for statistical purposes (see 4.7). These
considerations can be grouped in:
Data collection from the web: how to identify website that can be
sources of data, when it's better to use robots and when assisted human
data collection, how to manage the impact of data collection on target
websites, how to communicate with site owners to improve quality of
data;
Need of a new legal framework capable of describing the new relations
between the different actors: end users, data owners and public
agencies;
Methodological considerations about the management of new data: new
methods for errors management, new processes for data cleaning and
new tools to tackle unstructured data often textual.
A final set of recommendations (see 5.6) is directly derived from
international activities. These are focused on the definition of a standardised
process for Big Data management and on recommendations on international
cooperation as a source of possible improvement in national policies

Final considerations and future actions


The use of big data in official statistics is setting up many challenges for the
statistical organizations, which are having to deal with disruptive innovation
in technologies, methods and organization.
These innovations can be better addressed in a framework of international
cooperation, joining the forces of different statistical offices to find
standardized solutions and shared. Actions of this kind have already been set
and are conducting joint work in various fields.
For the technological aspect a shared environment (sandbox) has been set up
where tools and methods and data sources will be tested in shared mode by
many statistical institutions and academic groups.

121
For methodological issues other work-groups, coordinated by United Nations
agencies, are working to define a quality framework that allows for
statisticians to get on big data the same assurance of quality available today
on traditional data sources.
Finally, for organizational and legal issues, national and international
agencies are defining actions by which statistical data provider will offer
solutions that will give a new role to official statistics, a role that will allow
them to navigate safely in the data deluge

122
Bibliography

[Abadi 2009] D. Abadi, P. Boncz, S. Harizopoulos, Column-oriented database systems


Proceedings of the VLDB Endowment 2.2 (2009): 1664-1665.

[Agrawal 2011] D. Agrawal, S. Das, A. El Abbadi, Big Data and cloud computing:
current state and future opportunities in Proceedings of the 14th International
Conference on Extending Database Technology. ACM, 2011 (available at
http://www.edbt.org/Proceedings/2011-Uppsala/papers/edbt/a50-agrawal.pdf accessed
December 2013).

[Ahas 2008] R. Ahas et al. Evaluating passive mobile positioning data for tourism
surveys: An Estonian case study in Tourism Management 29.3 (2008): 469-486
(available at http://www.congress.is/11thtourismstatisticsforum/papers/Rein_Ahas.pdf
accesses December 2013).

[Allison 2001] P. Allison, ed. Missing data. No. 136. Sage, 2001 (available at
http://www.statisticalhorizons.com/wp-content/uploads/2012/01/Milsap-Allison.pdf
accessed December 2013).

[Angles 2008] R. Angles, C. Gutierrez, Survey of graph database models in ACM


Computing Surveys (CSUR) 40.1 (2008): 1. (available at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.1072&rep=rep1&type=pd
f accessed December 2013).

[Baker 1995] R. Baker, N. Bradburn, R. Johnson, Computer-assisted Personal


Interviewing: An Experimental Evaluation of Data Quality and Cost in Journal of
Official Statistics, Vol.11, No.4, 1995. pp. 413431.

[Balbi 2013] S. Balbi et al. BLUE-ETS Enterprise and Trade Statistics, FP7 Research
Project SSH-CT-2010-244767, Deliverable 5.2: Report on analysis of existing practices
in the data collection field (2013) (available at http://www.blue-
ets.istat.it/fileadmin/deliverables/Deliverable5.2.pdf accessed December 2013).

[Bell 2009] G. Bell, T. Hey, A. Szalay, Beyond the data deluge in Science, volume 323
(2009): 1297-1298 (available at
http://www.cloudinnovation.com.au/Bell_Hey%20_Szalay_Science_March_2009.pdf
accessed December 2013).

[Bettino 2012] L. Bettino, Transforming Big Data Challenges in Opportunities in


123
Information Management, April 2012 (available at http://www.information-
management.com/newsletters/big-data-ROI-IBM-Walmart-USPS-10022342-
1.html#Login accessed December 2013).

[Biemer 2012] P. Biemer Total survey error: Design, implementation, and evaluation in
Public Opinion Quarterly 74.5 (2010): 817-848 (available at
http://poq.oxfordjournals.org/content/74/5/817.full.pdf accessed February 2014)

[Borne 2013] K. Borne Statistical Truisms in the Age of Big Data, 2013 in Wiley
Statistics Views (available at
http://stats.cwslive.wiley.com/details/feature/4911381/Statistical-Truisms-in-the-Age-of-
Big-Data.html accessed December 2013).

[Borthakur 2011] D. Borthakur et al., Apache Hadoop goes realtime at Facebook in


Proceedings of the 2011 ACM SIGMOD International Conference on Management of
data, ACM, 2011 (available at http://ebookbrowsee.net/realtimehadoopsigmod2011-pdf-
d146414448 accessed December 2013).

[Boskin 1998] M. Boskin et al. Consumer prices, the consumer price index, and the cost
of living in Journal of Economic Perspectives 12 (1998): 3-26 (available at
http://www.statistica.unipd.it/insegnamenti/temimacro/matdid/measuring_cpi_JEP.pdf
accessed December 2013).

[Brackstone 1999] G. Brackstone, Managing Data Quality in a Statistical Agency,


Statistics Canada, Survey Methodology, Vol. 25 No. 2, December 1999 (available at
https://www.virtualstatisticalsystem.org/vss_uploads/Brackstone_Managing_Data_Quali
ty.pdf accessed December 2013).

[Brewer 2000] E. Brewer, Towards robust distributed systems in Symposium on


Principles of Distributed Computing (PODC) 2000 (available at
https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf accessed
December 2013).

[Brown 2011] B. Brown, M. Chui, J. Manyika, Are you ready for the era of Big Data?
In McKinsey Quarterly 4 (2011): 24-35 (available at
http://unm2020.unm.edu/knowledgebase/technology-2020/14-are-you-ready-for-the-era-
of-big-data-mckinsey-quarterly-11-10.pdf accessed December 2013).

[Buelens 2012] B. Buelens et al. Shifting paradigms in official statistics, from design-
based to model-based to algorithmic inference Discussion paper 201218, Statistics
Netherlands, The Hague/Heerlen (available at
124
http://www.cbs.nl/NR/rdonlyres/A94F8139-3DEE-45E3-AE38-
772F8869DD8C/0/201218x10pub.pdf accessed December 2013).

[Bughin 2010] J. Bughin, M. Chui, J. Manyika, Clouds, Big Data, and smart assets: Ten
tech-enabled business trends to watch in McKinsey Quarterly 56.1 (2010): 75-86
(available at http://www.itglobal-
services.de/files/100810_McK_Clouds_big_data_and%20smart%20assets.pdf accessed
December 2013).

[Casella 1999] G. Casella, C. Robert, Monte Carlo statistical methods (1999) (available
at http://www.stat.ufl.edu/~casella/MCSM08/short07class.pdf accessed December
2013).

[Cattel 2011] R. Cattell, Scalable SQL and NoSQL data stores in ACM SIGMOD
Record 39.4 (2011): 12-27 (available at http://www.sigmod.org/publications/sigmod-
record/1012/pdfs/04.surveys.cattell.pdf accessed December 2013).

[Cavallo 2009] A. Cavallo, Scraped Data and Sticky Prices: Frequency, Hazards, and
Synchronization, Unpublished paper, Harvard University (2009) (available at
http://chicagofed.org/digital_assets/others/research/research_calendar_attachments/semi
nars_2009/sem_cavallo120209.pdf accessed December 2013).

[Cavallo 2011] A. Cavallo, Scraped Data and Sticky Prices, American Economic
Association 2011 Conference (available at
www.aeaweb.org/aea/2011conference/program/retrieve.php?pdfid=403 )

[Cavallo 2011b] A. Cavallo, R. Rigobon, The distribution of the Size of Price Changes,
No. w16760. National Bureau of Economic Research, 2011 (available at
http://www.bcrp.gob.pe/docs/Publicaciones/Documentos-de-Trabajo/2011/Documento-
de-Trabajo-11-2011.pdf accessed December 2013).

[Codd 1979] E. Codd, Extending the database relational model to capture more meaning
in ACM Transactions on Database Systems (TODS) 4.4 (1979): 397-434.

[Cohen 2009] J. Cohen et al. MAD skills: new analysis practices for Big Data in
Proceedings of the VLDB Endowment 2.2 (2009): 1481-1492 (available at
https://community.emc.com/servlet/JiveServlet/downloadBody/8370-102-1-30903/GP-
Practices%20for%20big%20data%20-mad-skills.pdf accessed December 2013).

[Cook 2011] S. Cook et al. Assessing Google flu trends performance in the United States
during the 2009 influenza virus A (H1N1) pandemic PLoS One 6.8 (2011): e23610
125
(available at
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0023610
accessed December 2013).

[Couper 2000] M. Couper, Review: Web surveys: A review of issues and approaches in
The Public Opinion Quarterly 64.4 (2000): 464-494 (available at
http://kursovaya.googlecode.com/svn/trunk/5kurs/doc/Web%20Surveys.%20A%20Revi
ew%20of%20Issues%20and%20Approaches.pdf accessed December 2013).

[Daas 2012] P. Daas et al. Big Data and official statistics in Sharing Advisory Board,
Software Sharing Newsletter 7 (2012): 2-3. (available at
http://www1.unece.org/stat/platform/download/attachments/22478904/issue%207.pdf
accessed December 2013).

[Daas 2012b] P. Daas et al. Twitter as a potential source for official statistics in the
Netherland, CBS Discussion Paper (2012) (Available at
http://www.cbs.nl/NR/rdonlyres/04B7DD23-5443-4F98-B466-
1C67AAA19527/0/201221x10pub.pdf accessed December 2013)

[Daas 2013] P. Daas et al. Big Data and Offical Statistics in NTTS (New Techniques and
Technologies for Statistics) 2013, Bruxelles (available at http://www.cros-
portal.eu/content/big-data-and-official-statistic-piet-daas-marco-puts-and-bart-buelens-
paul-van-den-hurk accessed December 2013).

[Davenport 2012] T. Davenport, D. Patil, Data scientist: the sexiest job of the 21st
century in Harvard business review 90.10 (2012): 70-77 (available at
http://www.dima.unige.it/SMID/diconodellastatistica/Data%20Scientist_%20The%20Se
xiest%20Job.pdf accessed December 2013).

[deHaan 2013] J. de Haan, R. Hendriks, Online Data, Fixed Effects and the
Construction of High-Frequency Price Indexes (2013) (available at
http://www.asb.unsw.edu.au/research/centreforappliedeconomicresearch/newsandevents/
newsarchive/Documents/Jan-de-Haan-Online-Price-Indexes.pdf accessed December
2013).

[Dean 2004] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large


Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, 2004 (available at
https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/
dean/dean_html/ accessed December 2013).

126
[Dede 2013] E. Dede et al., An Evaluation of Cassandra for Hadoop in Cloud
Computing (CLOUD), 2013 IEEE Sixth International Conference on. IEEE, 2013
(available at http://cs.binghamton.edu/~mgovinda/papers/dede-ieee-cloud-13.pdf
accessed December 2013).

[Devlin 2012] B. Devlin, S, Rogers, J. Myers, Big Data Comes of Age, EMA and 9sight
Consulting Research Report, 2012 (available at
http://www.9sight.com/Big_Data_Comes_of_Age.pdf accessed December 2013).

[DeWitt 1992] D. DeWitt, J. Gray, Parallel database systems: the future of high
performance database systems in Communications of the ACM n. 35.6 (1992): 85-98
(available at ftp://ftp.cs.wisc.edu/pub/techreports/1992/TR1079.pdf accessed December
2013).

[Dunne 2013] J. Dunne, Big Data coming soon...... to an NSI near you, Proceedings 59th
ISI World Statistics Congress, 25-30 August 2013, Hong Kong (available at
http://2013.isiproceedings.org/Files/STS018-P3-S.pdf accessed December 2013).

[Eddelbuettel 2014] D. Eddelbuettel, M. Stokely and J. Ooms. RProtoBuf: Efficient


Cross-Language Data Serialization in R. arXiv preprint arXiv:1401.7372 (2014)
(available at
ftp://ftp2.de.freebsd.org/pub/misc/cran/web/packages/RProtoBuf/RProtoBuf.pdf
accessed February 2014)

[Ferrara 2012] E. Ferrara et al. Web data extraction, applications and techniques: a
survey, arXiv preprint arXiv:1207.0246 (2012) (available at
http://arxiv.org/pdf/1207.0246.pdf accessed December 2013).

[Gantz 2011] J. Gantz, D. Reinsel, Extracting value from Chaos, IDC Iview June 2011
(available at
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for
_innovation accessed December 2013).

[Ghemawat 2003] S. Ghemawat, H. Gobioff, S. Leung, The Google file system, ACM
SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003 (available at
http://static.googleusercontent.com/media/research.google.com/it//archive/gfs-
sosp2003.pdf accessed December 2013).

[Gillick 2006] D. Gillick, A. Faria, J. DeNero, Mapreduce: Distributed computing for


machine learning, Berkley (December 18, 2006) (2006) (available at
http://cs.smith.edu/dftwiki/images/6/68/MapReduceDistributedComputingMachineLear
127
ning.pdf accessed December 2013).

[Ginsberg 2008] J. Ginsberg et al. Detecting influenza epidemics using search engine
query data Nature 457.7232 (2008): 1012-1014 (available at
http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html accessed
December 2013).

[Golder 2011] Golder, Scott A., and Michael W. Macy. "Diurnal and seasonal mood vary
with work, sleep, and daylength across diverse cultures." Science 333.6051 (2011):
1878-1881.

[Groves 2011] R. Groves, Three eras of survey research in Public Opinion Quarterly
75.5 (2011): 861-871 (available at
http://unix.cc.wmich.edu/~lewisj/332/PDF/History.pdf accessed December 2013).

[Guha 2012] S. Guha et al. Large complex data: divide and recombine (D&R) with
RHIPE in The ISI's Journal for the Rapid Dissemination of Statistic Research Stat 1.1
(2012): 53-67 (available at
http://www.stat.purdue.edu/~xbw/research/dr.rhipe.stat.2012.pdf accessed December
2013)

[Hoekstra 2012] R. Hoekstra, O. ten Bosch, F. Harteveld, Automated data collection


from web sources for official statistics: First experiences in Statistical Journal of the
IAOS: Journal of the International Association for Official Statistics 28.3 (2012): 99-111
(available at http://www.rutgerhoekstra.com/publications/2010_Hoekstra-ten%20Bosch-
Harteveld-Automated%20data%20collection.pdf accessed December 2013).

[Horey 2012] J. Horey et al. Big Data platforms as a service: challenges and approach
in Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing.
USENIX Association, 2012 (available at
https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final61.pdf
accessed December 2013).

[Hunt 2010] P. Hunt et al. ZooKeeper: wait-free coordination for internet-scale systems
Proceedings of the 2010 USENIX conference on USENIX annual technical conference.
Vol. 8. 2010 (available at
https://www.usenix.org/legacy/event/usenix10/tech/full_papers/Hunt.pdf accessed
December 2013).

[Java 2007] A. Java et al. Why we twitter: understanding microblogging usage and
communities in Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on
128
Web mining and social network analysis. ACM, 2007 (available at
http://aisl.umbc.edu/resources/369.pdf accessed December 2013).

[Karlberg 2013] M. Karlberg, M. Skaliotis, Big Data for Official Statistics Strategies
and some initial European applications,, Working Paper for the UNECE - CES Seminar
on Statistical Data Collection (Geneva, Switzerland, September 2013). (available at
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2013/mgt1/WP30.
pdf accessed December 2013).

[Keim 2010] D. Keim et al. eds. Mastering The Information Age-Solving Problems with
Visual Analytics. Florian Mansmann, 2010, ISBN 978-3-905673-77-7 (available at
http://www.vismaster.eu/wp-content/uploads/2010/11/VisMaster-book-lowres.pdf
accessed December 2013).

[Khare 2004] R. Khare, D. Cutting, K. Sitaker, A. Rifkin, Nutch: A flexible and scalable
open-source web search engine, Oregon State University, 32 (available at
http://www.master.netseven.it/files/262-Nutch.pdf accessed December 2013).

[Kwan 2009] E. Kwan, I. Lu, M. Friendly, Tableplot in Zeitschrift fr


Psychologie/Journal of Psychology 217.1 (2009): 38-48 (available at
http://www.datavis.ca/papers/Tableplot-Kwan-etal2009.pdf accessed December 2013).

[Lavalle 2011] LaValle, Steve, et al. "Big Data, analytics and the path from insights to
value." MIT Sloan Management Review 52.2 (2011): 21-31 (available at
http://tuping.gsm.pku.edu.cn/Teaching/Mktrch/Readings/Big%20Data,%20Analytics%2
0and%20the%20Path%20from%20Insight%20to%20Value%202011.pdf accessed
December 2013).

[Liu 2013] Z. Liu, B. Jiang, J. Heer, imMens: Real-time visual querying of Big Data
submitted for publication EuroVis 2013 (available at http://www.zcliu.org/wp-
content/uploads/2013/04/2013-imMens-EuroVis.pdf accessed December 2013).

[Lyberg 1995] L. Lyberg et al, Survey Measurement and Process Quality, Wiley Series
in Probability and Statistics, John Wiley & Sons, New York, 1997, ISBN: 978-0-471-
16559-0.

[MacNicol 2004] R. MacNicol, B. French, Sybase IQ multiplex-designed for analytics in


Proceedings of the Thirtieth international conference on Very large data bases-Volume
30. VLDB Endowment, 2004 (available at http://www.vldb.org/conf/2004/IND8P3.PDF
accessed December 2013).

129
[Mayer-Schonberger 2013] V. Mayer-Schonberger, K, Cukier, Big Data a Revolution
The Will Transform How We Live, Work and Think, John Murray, Houghton Mifflin
Harcourt, Boston, 2013, ISBN 978-0-544-00268-2.

[Manyika 2011] J. Manyika et al., Big Data: The next frontier for innovation,
competition, and productivity, McKinsey Global Institute 2011 (available at
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for
_innovation accessed December 2013).

[McAfee 2012] A. McAfee, E. Brynjolfsson. "Big Data: the management revolution."


Harvard business review 90.10 (2012): 60-66 (available at
http://automotivedigest.com/wp-content/uploads/2013/01/BigDataR1210Cf2.pdf
accessed December 2013).

[Mills 2012] S. Mills (ed.) et al., Demystifying Big Data A Practical Guide to
Transforming The Business of Government, prepared by TechAmerica Foundations
Federal Big Data Commission (available at
http://www.techamerica.org/Docs/fileManager.cfm?f=techamerica-bigdatareport-
final.pdf accessed December 2013).

[Mohebbi 2011] M. Mohebbi et al. Google correlate whitepaper. Web document 2011
(available at http://googleproof.org/trends/correlate/whitepaper.pdf accessed December
2013).

[O'Connor 2010] O'Connor, Brendan, et al. "From Tweets to Polls: Linking Text
Sentiment to Public Opinion Time Series." ICWSM (International Conference on
Weblogs and Social Media) 11 (2010): 122-129 (available at
http://www.cs.cmu.edu/~nasmith/papers/oconnor+balasubramanyan+routledge+smith.ic
wsm10.pdf accessed December 2013).

[Olston 2008] C. Olston et al. Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of
data. ACM, 2008 (available at
http://www.dcs.bbk.ac.uk/~dell/teaching/cc/paper/sigmod08/p1099-olston.pdf accessed
December 2013).

[Ostrouchov 2013] G. Ostrouchov et al, Combining R with Scalable Libraries to Get the
Best of Both for Big Data in Proceedings of IASC Satellite Conference for the 59th ISI
WSC (available at
http://www.researchgate.net/publication/256326530_Combining_R_with_Scalable_Libr
aries_to_Get_the_Best_of_Both_for_Big_Data/file/9c96052251d9fb604c.pdf accessed
130
February 2014)

[Paraiso 2012] F. Paraiso et al. A federated multi-cloud paas infrastructure in Cloud


Computing (CLOUD), 2012 IEEE 5th International Conference on. IEEE, 2012
(available at http://hal.archives-ouvertes.fr/docs/00/69/47/00/PDF/paper.pdf accessed
December 2013]

[Parise 2012] S. Parise, B. Iyer, D. Vesset, Four strategies to capture and create value
from Big Data in IveyBusinessJournal, July-August 2012 (available at
http://iveybusinessjournal.com/topics/strategy/four-strategies-to-capture-and-create-
value-from-big-data#.Uuf2x_g1g8o accessed December 2013).

[Pink 2007] D. Pink, Pecha Kucha: Get to the PowerPoint in 20 slides then sit the hell
down in Wired (2007): 15-09 (available at
http://www.wired.com/techbiz/media/magazine/15-09/st_pechakucha# accessed
December 2013).

[Prace 2012] P. R. A. C. E. First Implementation Project "SEVENTH FRAMEWORK


PROGRAMME." (2012) (available at http://prace-project.eu/IMG/pdf/d3.1.7_1ip.pdf
accessed December 2013).

[Prajapati 2013] Vignesh Prajapati, Big Data Analytics with R and Hadoop, Packt
Publishing Ltd, 2013 (available at
http://www.revolutionanalytics.com/sites/default/files/r-and-hadoop-big-data-
analytics.pdf accessed December 2013)

[Saralaguei 2013] J. Saralaguei et al. Use Of Administrative Sources To Reduce


Statistical Burden And Costs In Structural Business Surveys in NTTS (New Techniques
and Technologies for Statistics) 2013, Bruxelles (available at http://www.cros-
portal.eu/content/use-administrative-sources-reduce-statistical-burden-and-costs-
structural-business-surveys accessed December 2013).

[Schafer 2010] J. Shafer, S. Rixner, A. Cox, The Hadoop distributed filesystem:


Balancing portability and performance in Performance Analysis of Systems & Software
(ISPASS), 2010 IEEE International Symposium on. IEEE, 2010 (available at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.3342&rep=rep1&type=pd
f accessed December 2013).

[Schaller 1997] R. Schaller, Moore's law: past, present and future in Spectrum, IEEE
Vol. 34 (1997): 52-59 (available at
http://mprc.pku.edu.cn/courses/organization/autumn2013/paper/Moore%27s%20Law/M
131
oore%27s%20law%20past,%20present%20and%20future.pdf accessed December
2013).

[Schindler 2012] J. Schindler, I/O characteristics of NoSQL databases in Proceedings of


the VLDB Endowment 5.12 (2012): 2020-2021 (available at
http://www.vldb.org/pvldb/vol5/p2020_jirischindler_vldb2012.pdf accessed December
2013).

[Seeger 2009] M. Seeger, Ultra-Large-Sites, Key-Value stores: a practical overview in


Computer Science and Media, 2009 (available at http://blog.marc-
seeger.de/assets/papers/Ultra_Large_Sites_SS09-Seeger_Key_Value_Stores.pdf
accessed December 2013).

[SNZ 2012] Statistics New Zealand, Using cellphone data to measure population
movements experimental analysis following the 22 february 2011 Christchurch
earthquake New Zealand government. ISBN 978-0-478-37759-0, 2012 (available at
http://www.stats.govt.nz/~/media/Statistics/services/earthquake-info/using-cellphone-
data-measure-pop-movement.pdf accessed December 2013).

[Sotomoayor 2009] B. Sotomayor et al. Virtual infrastructure management in private


and hybrid clouds in Internet Computing, IEEE 13.5 (2009): 14-22 (available at
http://s3.amazonaws.com/academia.edu.documents/31063850/An_Open_Source_Solutio
n_for_Virtual_Infrastructure_Management_in_Private_and_Hybrid_Clouds.pdf?AWSA
ccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1390762372&Signature=fnTsXt
GpXd%2FxkJ2K9vRnwDdA%2Fqo%3D&response-content-disposition=inline accessed
December 2013).

[Stokely 2013] M. Stokely, HistogramTools for Distributions of Large Data Sets.


(available at
http://cran.wu.ac.at/web/packages/HistogramTools/vignettes/HistogramTools.pdf
accessed February 2014)

[Stonebracker 2010] M. Stonebraker, SQL databases v. NoSQL databases in


Communications of the ACM 53.4 (2010): 10-11.

[Taniar 2008] D. Taniar, C. Leung, W, Rahayu, S. Goel, Database Processing and Grid
Databases, John Wiley & Sons, New Jersey, 2008, ISBN 978-0-470-10762-1.

[Tennekes 2011] M. Tennekes, E. de Jonge, P. Daas, Visual profiling of large statistical


datasets, Paper for the 2011 European New Techniques and Technology for Statistics
conference, Brussels, Belgium. 2011 (available at
132
http://www.pietdaas.nl/beta/pubs/pubs/NTTS2011_Tableplot_paper.pdf accessed
December 2013).

[Tennekes 2013] M. Tennekes, E. de Jonge, P. Daas, Visualizing and Inspecting Large


Datasets with Tableplots in Journal of Data Science, 11, 43-58. (available at
http://www.jds-online.com/file_download/379/JDS-1108.pdf accessed December 2013).

[Thusoo 2010] A. Thusoo et al. Hive-a petabyte scale data warehouse using hadoop
Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010
(available at http://people.cs.kuleuven.be/~bettina.berendt/teaching/2010-11-
2ndsemester/ctdb/petabyte_facebook.pdf accessed December 2013).

[Tufecki 2013] Z. Tufekci, Big Data: Pitfalls, Methods and Concepts for an Emergent
Field, SSRN (March 2013) (available at
http://www.datascienceassn.org/sites/default/files/Big%20Data%20-
%20Pitfalls,%20Methods%20and%20Concepts%20for%20an%20Emergent%20Field.p
df accessed December 2013).

[Valdivia 2010] A. Valdivia et al. Monitoring influenza activity in Europe with Google
Flu Trends: comparison with the findings of sentinel physician networks-results for
2009-10 Eurosurveillance 15.29 (2010): 2-7 (available at
http://www.eurosurveillance.eu/images/dynamic/EE/V15N29/V15N29.pdf accessed
December 2013).

[Vassiliadis 2002] P. Vassiliadis, A. Simitsis, S. Skiadopoulos, Conceptual modeling for


ETL processes in Proceedings of the 5th ACM international workshop on Data
Warehousing and OLAP. ACM, 2002 (available at
http://www.cci.drexel.edu/faculty/song/dolap/dolap02/paper/p14-vassiliadis.pdf
accessed December 2013).

[Vavilapalli 2013] V. Vavilapalli, Apache hadoop yarn: Yet another resource negotiator
in Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2013
(available at https://54e57bc8-a-62cb3a1a-s-
sites.googlegroups.com/site/2013socc/home/program/a5-
vavilapalli.pdf?attachauth=ANoY7coEhSbUnw95pIYZH-
pIzIzEHiIiSaLgHIxlcBSepMKTe6zlZMeAecWIXh2b40XzA0XHpUw0E0xEuYrrWJ1o
ZWkOWwG5vKEte1aQss31uvKIbO0tqMvTj4lwOI5AflYb4Lx41PwyGII80CNEssHu_
DzY5XZbCyrgXVmDqmwwrUCDAYfcbYpAcAdHrtxg5VsFX5Pi6u1SN52jZ-
DNkyJg1jdEUfTQ9_EyaM5ureYBZmbcQRnSK-g%3D&attredirects=0 accessed
December 2013).

133
[Wallgren 2007] A. Wallgren, B. Wallgren, Register-based Statistics: Administrative
Data for Statistical Purposes, Wiley Series in Survey Methodology, John Wiley & Sons,
New York, 2007, ISBN: 978-0-470-02778-3-0.

[White 2012] T. White, Hadoop: the definitive guide, O'Reilly, 2012, ISBN 978-1-449-
31152-0.

[Zhang 2012] L. Zhang et al. Visual analytics for the Big Data eraA comparative
review of state-of-the-art commercial system, Visual Analytics Science and Technology
(VAST), 2012 IEEE Conference on. IEEE, 2012 (available at http://www.inf.uni-
konstanz.de/gk/pubsys/publishedFiles/ZhStBe12.pdf accessed December 2013).

[Zykopolous 2012] P. Zikopoulos et al. Harness the Power of Big Data The IBM Big
Data Platform. McGraw Hill Professional, 2012, ISBN: 9780071808170 (available at
ftp://public.dhe.ibm.com/software/pdf/at/SWP10/Harness_the_Power_of_Big_Data.pdf
accessed December 2013).

134

You might also like