You are on page 1of 3

Streaming 101

What is stream?
Data processing system designed for unbounded data

Cardinality of Data – Size of data

Bounded (finite data set) and Unbounded Data (infinite data set)

Constitution of Data - In what form data is available to be processed/interacted

Two forms of constitution - Table and Streams

Table – A snapshot in point in time


Stream - An element-by-element view of the evolution of a dataset over time.

Correctness – Having a checkpoint for data even if the machine storing it fails, strong
consistency

Two domains of time


Event Time – time at which event happened
Processing Time – time at which event observed by the system

Skewness in data causes:


1. Shared resource limitations
2. Software level distributed logic
3. Feature of data themselves – Amount of data generated at varying rate

Two types of skewness observed – Horizontal (Processing time) & Vertical (Event time) Both
are same thing from different angles

Processing Time skewness – Difference between ideal processing time (i.e. event time) &
observed processing time

Event time skewness – How much lagging an original pipeline is than ideal one.

Data Processing Patterns


Bounded Data – Mostly batch processing. Streams can be used if properly setup

Unbounded Data: Batch


1. Fixed windows sessions – Not a problem when processing is doesn’t care about the
event time. Data is sorted by time and then split into multiple batches for
processing. Problem of completeness – How do you make sure all the data in that
time period has already arrived? Requires delaying batch until all the necessary data
has been collected
2. Sessions – More complex scheme for breaking into batches. Requires making batch
size bigger (latency concern over here) or combining data for one session from all
previous batches.
Unbounded Data: Streaming
In real world data is unbounded and coming from multiple input sources. Also,
1. It is highly unorder on even-time basis
2. Of varying time skew means most of the X number of instances are not always
generated with Y amount of difference

Approaches: 1. Time-agnostic 2. Approximation 3. Windowing by processing time 4.


Windowing by event time

Time-agnostic – time (event time mainly) is irrelevant. Batch systems can be used in such
cases. Possible scenarios – filtering of records or joining of multiple streams (inner join, for
outer joins – how do you make sure the corresponding paired entry will come). In case of
both inner and outer join you do need to maintain some kind of TTL for incomplete pairs.

Approximation Algorithms – Calculating approximate n-top, streaming k-means. Processing


is mainly based on approximation algorithms but there aren’t many.

Windowing – Chopping a data source into finite chunks along temporal boundaries for
processing.
1. Fixed windows (aka tumbling windows) – Slice data into fixed size temporal lengths
2. Sliding windows (aka hopping windows) – Defined by length of window (how many
elements) & period of window (offset from start of previous one). If length = period
it is fixed window case.
3. Sessions – Dynamic windows, commonly used for analyzing user behavior over time

Window by processing time – Buffer the data for fixed time period (like 5 minutes) and send
the collected data for processing downstream.

Properties:
1. Simple implementation. No need to rearrange the data just collect it as it arrives and
forward it.
2. Judging window completeness is straight forward. Perfect metric (like all the data
coming in next 5 minutes) for completion of window.
3. User case - Information about the source as it is observed. Monitoring systems
4. If event time is used as context for processing data then it must arrive in order which
rarely happens in real world.

Do not trust source with low/no event-time skewness. As it is not always guaranteed.

Windowing by event time – When you want observed data to be in order when that event
actually happened.

Drawbacks: window TTL is more than its length


1. Buffering – Requires storing data though a cheapest resource
2. Completeness – We can never know in advance the when to close a buffer as we
won’t know whether it

You might also like