On Real-Time Twitter Analysis.

On Real-Time Twitter Analysis
Mikio L. Braun mikiobraun twimpact UG (haftungsbeschrnkt) with Matthias Jugel thinkberg Apache Hadoop Get Together, Berlin April 28, 2012 http://blog.mikiobraun.de http://twimpact.com
Apache Hadoop Get Together, April 18, 2012, Berlin
2012 TWIMPACT
Big Data and Data Science and Social Media
There's a lot you can do with social media data
Trend analysis (trending topics) Sentiment analysis Impact analysis (Klout, Kred, etc.) More general studies (diameter of network, distribution patterns, etc.) Event treams (Twitter stream) Graph data (user relationships, retweet networks) Text data (sentiment analysis, word clouds) URLs
Types of data

2012 TWIMPACT
Social Media Streaming Data
Examples

Twitter firehose/sprinkler Click-through data bit.ly URL resolution requests up to a few thousand events per second events are small up to a few kilobytes
Some numbers:

2012 TWIMPACT
What's in a Tweet?
Tweet Hashtag Link User Mention Keywords Retweeting User Retweeted User Retweeted Tweet
Timestamp
2012 TWIMPACT
TWIMPACT - Retweet trends

Trending by retweet activity Robust matching of tweets even if shortened, edited (slightly) Compute trends for links, hashtags, URLs Aggregate TWIMPACT score for users
2012 TWIMPACT
How to scale stream processing?
2012 TWIMPACT
History of approaches

Started in June 2009 Free Twitter stream (capped at 50 tweets/s)

Language Storage backend
Version 1 Version 2 Stream mining + in memory
Version 3
2012 TWIMPACT
Putting it all in a data base
Insert millions of rows into data base Get reports by

SELECT*,COUNT(*)FROMevents WHEREcreated_at> ANDcreated_at< GROUPBYid ORDERBYCOUNT(*)DESC LIMIT100;
Hardly real-time. Also, data bases will become slower and slower...
2012 TWIMPACT
NoSQL: Cassandra
Structure: Families Tables Rows Key Value pairs Easy clustering (peer-to-peer configuration) Flexible consistency, read-repair, hinted handoff, etc. No locking, (in 0.6.x:) no support for indices, counters complete rewrite Operations profile (about 50:50 read/write)
2012 TWIMPACT
Cassandra: Multithreading
Multithreading helps (but without locking support?)

64 32 Tweets per second 4 2 16 8
Core i7, 4 cores (2 + 2 HT)

2012 TWIMPACT
Seconds
Cassandra: Configuration
Flush
Memtables, indexes, etc.
Size of Memtable: 128M, JVM Heap: 3G, #CF: 12

Compaction
2012 TWIMPACT
Cassandra: Configuration
Tweets per second
Compaction
Big GC
2012 TWIMPACT
NoSQL/Cassandra - Summary
Works quite well, faster than PostgreSQL (from 200 to 600 tps) Lack of locking/indices require a lot of manual management Configuration messy 4 node cluster vs. single node:
Single node consistently 1.5 3 times faster!
Ultimately, becomes slower and slower Doesn't handle deletions gracefully

2012 TWIMPACT
Stream processing frameworks
Stream processing = scalable actor based concurrency For example:

Twitter's (backtype's) Storm https://github.com/nathanmarz/storm Yahoo's S4 http://incubator.apache.org/s4/ Esper http://esper.codehaus.org/ Streambase http://www.streambase.com
2012 TWIMPACT
Stream processing- some thoughts

Maximum throughput hard to estimate Not everything can be parallelized Scalable storage system still necessary How to deal with failure/congestion? Persistent messaging middleware not what you might want.
2012 TWIMPACT
The DataSift infrastructure
http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html
Parse
Augment Content
Custom Filters
Delivery
Throughput: 120,000 tweets per second
936 CPU cores Analyzes 250 million tweets per day Peak throughput: 120,000 t/s monitoring & accounting
C++, PHP, Java/Scala, Ruby MySQL on SSDs, HBase (30 nodes, 400TB), memcached, Redis for some queues 0MQ, Kafka (LinkedIn)
but: 120,000 / 936 = 128.2 tweets per second per core

2012 TWIMPACT
Principles of Stream Processing
Keep resource needs constant Control maximum processing rates Disks too slow, keep data in RAM
2012 TWIMPACT
Stream mining
fixed number of slots asd erq qwe 13r13t erqew erq qer fgsa gwth 5z3 wet 42 37 25 20 17 13 10 7 4 qer 5 21
Focus on relevant data, discard the rest Provably approximates true counts Keep data in memory
Space Saving algorithm (Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, International Conference on Database Theory, 2005.)
2012 TWIMPACT
TWIMPACT
Real-time Twitter Retweet Analysis
Stream mining to keep hot set of few hundred thousand most active retweets in memory Secondary indices, bipartite graphs, object stores Write snapshots to disk for later analysis Up to several thousand tweets per second in single threaded operation.
2012 TWIMPACT
2011 in Retweets
2012 TWIMPACT
2011 in Retweets
2012 TWIMPACT
Our Analysis Pipeline

JSON parsing Thread 1 Tweets Retweet Matching & Retweet Trends Thread k synchronized worker threads single threaded
Analyzing dependent trends (links/hashtags/etc.) Day 1 Day 2
Snapshots
Trends
Day n map reduce like
2012 TWIMPACT
Most retweeted users
2012 TWIMPACT
Most retweeted tweets
2012 TWIMPACT
Social network buzz
2012 TWIMPACT
Summary
Many interesting challenges in social media. Many different data types, including streams. MapReduce doesn't really fit stream processing You can't just scale into real-time Principles of Stream Processing

Bounded hot set of data in memory Mine stream, discard irrelevant data
Real world applications often include a mixture of multithreading, stream processing, map reduce and single thread stages.
2012 TWIMPACT

On Real-Time Twitter Analysis.

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On Real-Time Twitter Analysis.

Uploaded by

Copyright:

Available Formats

On Real-Time Twitter Analysis

Apache Hadoop Get Together, April 18, 2012, Berlin

Big Data and Data Science and Social Media

There's a lot you can do with social media data

Apache Hadoop Get Together, April 18, 2012, Berlin

Social Media Streaming Data

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

TWIMPACT - Retweet trends

Apache Hadoop Get Together, April 18, 2012, Berlin

How to scale stream processing?

Apache Hadoop Get Together, April 18, 2012, Berlin

Started in June 2009 Free Twitter stream (capped at 50 tweets/s)

Version 1 Version 2 Stream mining + in memory

Apache Hadoop Get Together, April 18, 2012, Berlin

Putting it all in a data base

Insert millions of rows into data base Get reports by

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Multithreading helps (but without locking support?)

Core i7, 4 cores (2 + 2 HT)

Size of Memtable: 128M, JVM Heap: 3G, #CF: 12

Tweets per second

Apache Hadoop Get Together, April 18, 2012, Berlin

Ultimately, becomes slower and slower Doesn't handle deletions gracefully

Apache Hadoop Get Together, April 18, 2012, Berlin

Stream processing frameworks

Stream processing = scalable actor based concurrency For example:

Apache Hadoop Get Together, April 18, 2012, Berlin

Stream processing- some thoughts

Apache Hadoop Get Together, April 18, 2012, Berlin

The DataSift infrastructure

Throughput: 120,000 tweets per second

but: 120,000 / 936 = 128.2 tweets per second per core

Principles of Stream Processing

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Real-time Twitter Retweet Analysis

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Our Analysis Pipeline

Analyzing dependent trends (links/hashtags/etc.) Day 1 Day 2

Day n map reduce like

Apache Hadoop Get Together, April 18, 2012, Berlin

Most retweeted users

Apache Hadoop Get Together, April 18, 2012, Berlin

Most retweeted tweets

Apache Hadoop Get Together, April 18, 2012, Berlin

Social network buzz

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

You might also like