Professional Documents
Culture Documents
What is stream?
Data processing system designed for unbounded data
Bounded (finite data set) and Unbounded Data (infinite data set)
Correctness – Having a checkpoint for data even if the machine storing it fails, strong
consistency
Two types of skewness observed – Horizontal (Processing time) & Vertical (Event time) Both
are same thing from different angles
Processing Time skewness – Difference between ideal processing time (i.e. event time) &
observed processing time
Event time skewness – How much lagging an original pipeline is than ideal one.
Time-agnostic – time (event time mainly) is irrelevant. Batch systems can be used in such
cases. Possible scenarios – filtering of records or joining of multiple streams (inner join, for
outer joins – how do you make sure the corresponding paired entry will come). In case of
both inner and outer join you do need to maintain some kind of TTL for incomplete pairs.
Windowing – Chopping a data source into finite chunks along temporal boundaries for
processing.
1. Fixed windows (aka tumbling windows) – Slice data into fixed size temporal lengths
2. Sliding windows (aka hopping windows) – Defined by length of window (how many
elements) & period of window (offset from start of previous one). If length = period
it is fixed window case.
3. Sessions – Dynamic windows, commonly used for analyzing user behavior over time
Window by processing time – Buffer the data for fixed time period (like 5 minutes) and send
the collected data for processing downstream.
Properties:
1. Simple implementation. No need to rearrange the data just collect it as it arrives and
forward it.
2. Judging window completeness is straight forward. Perfect metric (like all the data
coming in next 5 minutes) for completion of window.
3. User case - Information about the source as it is observed. Monitoring systems
4. If event time is used as context for processing data then it must arrive in order which
rarely happens in real world.
Do not trust source with low/no event-time skewness. As it is not always guaranteed.
Windowing by event time – When you want observed data to be in order when that event
actually happened.