Professional Documents
Culture Documents
Big Data
Massive Parallelism
Huge Data Volumes Storage
Data Distribution
High-Speed Networks
High-Performance Computing
Task and Thread Management
Data Mining and Analytics
Data Retrieval
Machine Learning
Data Visualization
Operations
Analysis
Analyze
a
variety
of
machine
data
for
improved
business
results
10
Security/Intelligence
Extension
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
11
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
12
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
13
14
USD: billions
NoSQL DB ==> Distributed DB, Document-Orinted DB, Graph NoSQL DB, and In-Memory NoSQL DB.
It is not uncommon for an enterprise IT organization to support multiple NoSQL DBs alongside legacy RDBMSs.
Indeed, there are single applications that often deploy two or more NoSQL solutions, e.g., pairing a documentoriented DB with a graph DB for an analytics solution. [Dec 2013]
15
http://wikibon.com/hadoop-nosql-software-and-services-market-forecast-2013-2017/
16
Course Structure
Class Data
Number
Topics Covered
09/10/15
09/17/15
09/24/15
10/01/15
10/08/15
10/15/15
10/22/15
10/29/15
11/05/15
11/12/15
10
11/19/15
11
11/26/15
Thanksgiving Holiday
12/03/15
12
12/10/15
13
14-15
17
Course Grading
3 Homeworks: 50%
-- Individual work; Language Requirement: Java, JavaScript, Python, C/C++, Perl
-- Report and source code
Data Store & Processing
Analytics
In-Memory and Graph Computing
Final Project: 50%
-- Teamwork: 2 - 3 students per team (on campus); 1+ per team for CVN
Proposal (slides)
Final Report (paper, up to 12 pages)
Workshop Presentation (Oral and Demo)
Open Source
18
Course Information
Website:
http://www.ee.columbia.edu/~cylin/course/bigdata/
Textbook:
-- None, but reference book(s) and/or articles/papers will be provided each lecture.
19
20
Other Issues
Professor Lin:
Office Hours:
Thursday 9:30pm 10:00pm (SIPA 415, lecture room) (every week)
Contact: c {dot} lin {at} columbia {dot} edu (the same as <cl300>)
Telephone: 914-945-1897
TAs:
Ghazel Fazelnia <gf2293>, EE; Office Hours and Location: TBD
We are recruiting more TAs..
21
Apache Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
http://hadoop.apache.org
22
23
http://hortonworks.com/hadoop/hdfs/
24
MapReduce example
http://www.alex-hanna.com
25
http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
26
27
28
NoSQL Database
Key-Value Store
Document Store
Tabular Store
Object Database
Graph Database (property graphs, RDF graphs)
29
Document Store
31
Graph Data
Property
Graphs
RDF
Graphs
born
1973
home
Palo Alto
Larry Page
1934
try
Internet
industry
per
l o ye
HQ
try
died
4.1
kernel
ced
ed
Linux
4.0
2011
r
nde
fou
rd
try
us
boa
ind
versi
emp
HQ
Cupertino
on
1955
Steve Jobs
Apple
versi
pre
es
born
Services
hic
54,604
develo
32
Android
Mountain View
Hardware
indus
emp
stry
o
empl
develo
indu
433,362
try
industry
IBM
yees
ind
us
HQ
er
nd
fou
Armonk
us
p
gra
r
nde
fou
Charles Flint
Software
ind
rd
died
1850
boa
born
OpenGL
per
l o ye
iOS
on
kernel
pre
es
7.1
ced
ed
XNU
7.0
80,000
App
Builder
Integration
&
Governance
Streams
Graphs
Hadoop/Spark
Warehouse
Data Explorer
Linked
Big
Data
Connector
Framework
CM,
RM,
DM
RDBMS
Feeds
Web 2.0
Web
Volume ==> Hadoop / Spark; Velocity ==> Streams; Variety ==> Graphs
33
TechRadar: Enterprise
DBMS, Q12014
34
Graph DB is in the significant success trajectory, and has the highest business value among
the upcoming DBs.
E6893 Big Data Analytics Lecture 1: Overview
35
45
Symbolic
Network
Graph500 (Large)
log2(m)
40
Graph500 (Medium)
1 billion
edges
35
Twitter (tweets/day)
Graph500 (Small)
Graph500 (Mini)
30
Graph500 (Toy)
USA-road-d.USA.gr
25
USA-road-d.LKS.gr
20
15
USA-roadd.NY.gr
20
25
30
log2(n)
36
1 trillion
nodes
1 billion
nodes
35
40
45
No. of nodes
2015 CY Lin, Columbia University
http://www.graph500.org
July 2015: IBM Researchs Software powered all Top 3 winners of Graph 500
benchmark and 9 out of the Top 10 winners (supercomputers in US, Japan,
France, UK, and Germany; except Tianhe 2 in China).
Sequoia
#1
3.25
#4
#4
#4
TSUBAME 2.5
K computer
CPU only
FX10
#3
#4
TSUBAME-KFC
SGI UV2000
#3
GPU
4-way Xeon server
CPU only
The July 2015 winner, K-computer supercomputer of 83K nodes and 663Kcores,
achieved graph search of up to 38, 621,400,000 vertices per second.
37
item
Enhancing:
user
Graph Visualizations
39
Communities
Graph Search
Bayesian Networks
Centralities
Graph Query
Shortest Paths
Graph Matching
Graph Sampling
Markov Networks
Shortest
Paths
Centralities
Graph
Search
Dynamic
networks
of
400,000+
IBMers:
Shortest
Paths
Social
Capital
On
BusinessWeek
four
times,
including
being
the
Top
Story
of
Week,
April
2009
Bridges
Help
IBM
earned
the
2012
Most
Admired
Knowledge
Enterprise
Award
Hubs
Wharton
School
study:
$7,010
gain
per
user
per
year
using
the
tool
Expertise
Search
In
2012,
contributing
about
1/3
of
GBS
Practitioner
Portal
$228.5
million
savings
and
benefits
Graph
Search
APQC
(WW
leader
in
Knowledge
Practice)
April
2013:
Graph
Recomm.
Independent
experts
on
healthcare
A
cluster
of
XYZ
experts
42
My various paths to Tom. SmallBlue can show the paths to any colleagues up to 6-degree away
44
How
many
people
in
my
personal
networks?
47
Collabrative Filtering
Collaborative + Content Filtering
CBDR
PersonalizedUpper
Rec. Upper
Community
BoundBound
Non-Personalized
Upper Bound
Global
Upper Bound
No. of people
500
400
300
CB
DR
CB
DR
200
100
CB
DR
0
>=1 useful
48
>=2 useful
>=3 useful
>=4 useful
>=5 useful
2015 CY Lin, Columbia University
Precision
0.5
0.4
Network
Info Flow
0.3
0.2
0.1
0
1
Early adopter
Late adopter
Innovators
Early adopters
Recall
0.14
ado
Early majority
Laggards
0.12
0.1
0.08
0.06
0.04
0.02
0
1
Late majority
?pt?
of retrievedusers
users
Number ofNo.
recommended
2
3
No. ofofretrieved
users users
Number
recommended
Tests:
1
month
586
new
docs
1,170
users
49
People with
similar tastes
login
50
53
Bayesian
Network
browsing
search
Latent
Network
comparing
Checkout
IBM 2003
IBM
2009
Data
Source:
Targets:
20
Fortune
companies
normalized
Profits
Goal:
Learn
from
previous
5
years,
and
predict
next
year
Model:
Support
Vector
Regression
(RBF
kernel)
51
0.45
0.4
0.35
0.3
st
0.25
sp
0.2
tp
0.1
0
Network
feature:
s
(current
year
network
feature),
t
(temporal
network
feature),
d
(delta
value
of
network
feature)
Financial
feature:
p
(historical
profits
and
revenues)
std
0.15
0.05
dp
stdp
52
Monitoring filter
53
Dynamics
in
Graphs
Heterogeneous
Synchronicity
Networks
Predict
Performance
Engineer team
Delivery team
Team
Design team
Account team
Person
Sociology
CS
EE
Healthcare
SNA
Improve
Info
Sensor
Detected
as
top
1
anomaly
in
Sandy
Tweets
54
Outperform
existing
approaches
by
up
to
180%
(IJCAI
13)
2015 CY Lin, Columbia University
Motivation:
Info morph: new links keep
emerging to give new meaning to
existing phrases
Approach:
Compare characteristics of metapaths between nodes in
heterogeneous networks
55
(ACL 2013)
Train Classifiers
Discover
SAD
sentiment
EYES
words
Training
from 6 million tags
Select
Adj-Noun Pairs
Performance
Filtering
Sentiment
Prediction
SentiBank
(1200
Detectors)
Experiment on Sentiment
Detection Accuracy
on Twitter
56
Text
0.43
Visual
0.70
T+V
0.72
57
58
Precision-Recall
performance of
predicting info
propagation by
different features
(Our proposed influence
index: FLOWER)
Flow Analytics - I
Topic cluster tree shows how
sequences content are related
to each other
60
Applications
High
Value
Customer
Identification
&
targeting
Personalized
Advertisement
enable
Viral
marketing
campaign
Customer Profiles
Approach
Construct social graphs from CDRs based on
{caller, callee, call time, call duration}
Extract customer social features (e.g.
influence, communities, etc.) from the
constructed social graph as customer social
profiles
Build analytics applications (e.g. personalized
advertisement) based on the extracted
customer social profiles
Degree
Centrality
Weakly
Connected
Component
Pagerank
Community
Detection
K-core
System G Analysis
Maximal
Cliques
BigInsights
CDR
2015 CY Lin, Columbia University
Enhancing:
Huge
Network
Visualization
Network
Propagation
I2
3D
Network
Visualization
Geo
Network
Visualization
Graphical
Model
Visualization
Communities
Graph Search
Bayesian Networks
Centralities
Graph Query
Shortest Paths
Graph Matching
Graph Sampling
Markov Networks
Graph
Matching
Query
headache
chill
high
fever
migraine
stomachache
cough
Graph
Communities
63
64
65
Indexing
time
E6893 Big Data Analytics Lecture 1: Overview
Graph
Matching
66
SocialHelix:
Visualizaiton
of
Sentiment
Divergence
in
Social
Media
http://systemg.ibm.com/apps/socialhelix/index.html
http://systemg.ibm.com/apps/socialhelix/index.html
67
Graph
Search
query
Improved
search
results
index
ranking
re-ranking
Interest
/
social
network
based
content
recommendations
Info-Socio
networks
68
Graph analysis
query context
YouTube:
6M
Edges
Flickr:
24M
Edges
LiveJournal:
72M
Edges
System G
MapReduce
Category 3: Security
Network
Info Flow
Ego Net
Features
Normal:
(1) Clique-like
(2) Two-way
links
Attacker:
Near-Star
Graph Visualizations
70
Communities
Graph Search
Bayesian Networks
Centralities
Graph Query
Shortest Paths
Graph Matching
Graph Sampling
Markov Networks
Enterprise
Information
Leakage
Impacted
economy
and
jobs
Feb
2013
Emails
Instant
Messaging
Web
Access
Graph analysis
Social sensors
Behavior analysis
Executed Processes
Printing
Feed subscription
Semantics analysis
Copying
Database access
Psychological
analysis
Log On/Off
Multimodality
Analysis
Detection,
Prediction
&
Exploration
Interface
Infrastructure
+
~
70
Analytics
71
(1)Personal stress:
(2)Job stress:
Personal
event
(2)Planning:
Online chat with a hacker confiding his first
attempt of leaking the information
Job
event
Personality
Personal
stress
Job
stress
(1)
Attack:
Brought music CD to
work and downloaded/
copied documents onto
it with his own account
Unstable
Mental
status
Planning
Attack
72
75
Semantics
Layer
Concept
Layer
Feature
Layer
Sensor
Layer
: observations
: hidden states
Transmitted
images,
speech
content,
video
content
future
additions?
2015 CY Lin, Columbia University
74
77
Latent
Network
Bayesian
Network
Each month, 3 cases were inserted (1 abnormal person per case) in the real data.
Each performer system retrieved top abnormal people out of the 5,500 people per month.
This chart showed where the 3 IBM systems (Sabotage, Espionage, and Fraud) ranked the abnormal person
in each case. All is a combined rank list of the 3 systems. (Oct 2013 review on 12/12 ~ 03/13 data)
12. Layoff Logic Bomb: An engineer is worried about rumors of
impending layoffs feels that he needs some kind of an
insurance policy, in case he gets laid-off or fired. He creates
a "logic bomb" which will delete all files from a number of
company Linux systems in five days, unless he resets the
timer before then.
13. Outsourcer's Apprentice: (http://www.bbc.co.uk/news/
technology-21043693) A software developer outsources his
job to China and spends his workdays surfing the web. Most
surfing occurs on a second laptop. He pays just a small
fraction of his salary to a Chinese company to do his job. The
developer provides his VPN credentials to the company and
enabling Terminal Services on his workstation. The Chinese
consulting firm sends the developer PayPal invoices.
8. Anomalous Encryption: A Subject wishes to pass sensitive
information to a foreign government in exchange for that
government setting him up with his own business. Subject
researches NSA monitoring capabilities, generates a long
random passphrase and then tests encrypting and mails data
to personal account. The subject encrypts documents and
emails the key.
75
Promising
results.
IBMs
system
successfully
caught
the
bad
guys
of
the
12
cases:
4
as
Top
#1,
3
in
Top
#2-#5,
2
in
Top
#6-#20,
1
in
Top
#21-#50,
and
2
in
Top
#51-#100.
Performer
2
did
not
report
results.
Performer
3
reported:
3
of
the
12
cases
Top
#50-#100,
6
cases
Top
#101-#500,
and
3
cases
beyond
Top
#501.
E6893 Big Data Analytics Lecture 1: Overview
Network
Info Flow
Ponzi scheme Detection
Normal:
(1) Clique-like
(2) Two-way
links
76
Attacker:
Near-Star
Ego Net
Features
77
Network
KPIs
Server
KPIs
Graph
Matching
Bayesian
Network
Causality
analyzer
78
81
Communities
Graph Search
Bayesian Networks
Centralities
Graph Query
Shortest Paths
Graph Matching
Graph Sampling
Markov Networks
Bayesian
Network
Bayesian
Network
*
3
timesteps
*
63
variables
*
3.9
avg
states
*
4.0
avg
indegree
*
16,858
CPT
entries
Junction
Tree
*
67
cliques
*
873,064
PT
entries
in
cliques
79
Network load
level report
Graph
Analysis
Network topology
Approach
Cellular network comprises a hierarchy of network
elements
Map CDR onto network topology and infer load on
each network element using graph analysis
Estimate network load and localize potential problems
80
CDR
Network
KPIs
Server
KPIs
Causality
analyzer
Our
approach:
Probabilistic
monitoring
via
sampling
&
estimation
81
Select
KPI
pairs
(sampling)
Test
link
existence
Estimate
unsampled
links
based
on
history
Overall
graph
2015 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
DB2
RDF
Native
Store
Netezza
HBase
DB2
TinkerPop
Compliant
DBs
HDFS
82
85
Graph application
Graph application
Graph
objects
Convert
from
relational
Graph objects
Convert
to
relational
Graph
DB
Graph DB model
Relational DB
83
84
transportation
vehicles)
and
navigation
of
individual
cars
utilizing
the
dynamic
real-time
information
of
changing
road
condition
and
predictive
analysis
on
the
data
85
Predictive results
Real-time
update
E6893 Big Data Analytics Lecture 1: Overview
Graph
store
2015 CY Lin, Columbia University
Use Case 18: Graph Analysis for Image and Video Analysis
Vertex
Correspondence
Ys
86
Attribute
Transformation
ARG s
E6893 Big Data Analytics Lecture 1: Overview
ARG t
Yt
2015 CY Lin, Columbia University
Ongoing discussions
87
88
89
90
Questions?
Sign List
Name (Last, First)
UNI
Department
Degree (yr)