You are on page 1of 23

Architecture Stories for Big

and Tiny Data


We are
Unnatidata Labs

www.unnati.xyz
@raghothams @nischalhp
?
3 Stories
Velocity | Volume | Variety
What are
we solving? Infrastructure

Architecture Learnings
#1
FinTech
Small Data | Early Startup | Data Driven
FinTech | What are we solving?
Evaluate college students to determine their
creditworthiness
Lack of credit history
Tiny data
Enrich data with alternate data sources
Statistical modelling to evaluate students initially.
As the user activity increases, build machine learning
models to predict creditworthiness.
FinTech | Thought process for Infrastructure

Data velocity estimation for the next 6 months


Complexity of data science algorithms
No. of calls being serviced by the data science APIs
Cost

AWS instance x1, 8 GB ram, 4 cores


FinTech | Architecture
FinTech | Learnings

Small data problems are tricky


Go behind low hanging fruits first
Need clever techniques
Beware of data sanity with NoSQL
Embracing data science early helps the business
grow taller, stronger & sharper
#2
Campaign Management
Medium Size Data | Established Startup
Campaign Management | What are we solving?

Predict user behavior


Business has amassed data over 2-3 years
Educate team about data science & benefits
Ideate & prioritize problems that can be solved
RoI, pricing for new plugins
Campaign Management | Thought process for
Infrastructure
200+ Million rows
Parallel Analytics data warehouse
Data pipelines, automated workflows
Distributed machine learning models
Prediction as a Service
Cost

Dedicated bare metal server


32 GB ram | 8 cores | 1 TB SSD
Campaign Management | Architecture
Campaign Management | Learnings

Postgresql read replicas pause long running queries


Understand postgresql WALs
Data pipelines break. Exception handling,
notifications, logging is utmost important
We wired luigi exceptions to slack for notifications
Pandas transformations are slow for large datasets
PySpark to the rescue!
Use monitoring tools like Munin for profiling
#3
Unstructured Healthcare
Medium - Big Size Data | Generic Data
Science Platform
Healthcare | What are we solving?
Analytics on healthcare spend
Medical claims - many data providers - no standard
Data volume 500 M rows to start + high velocity
Robust data ingestion, data cleaning system
Data Security and HIPAA compliance
Data pipeline is the heart of the platform

Adding more servers is
easy, writing more code is
hard
Healthcare | Thought process for Infrastructure
Flexible schema calling out for NoSQL
Massive ingestion & cleaning tasks
Denormalize + Wide format
100s of transformation & analytics tasks
Luigi to the rescue
Spark for transformation & analytics

Database Instance
32 GB ram | 8 cores | 5 TB SSD
Application Instance
API Server 4 GB ram | 2 cores | 500 GB SSD
Healthcare | Architecture
Healthcare | Learnings
Authorizations for databases are very important
Aim to parallelize tasks for ingestion
Data redundancy is totally fine for data science
Polyglot of services - Use the right tools
Understand business expectations & landscape
before jumping into architecture
Toolbox
Want
big
impact?

www.unnati.xyz
Fin.

Any questions?

tweet : @unnati_xyz
enquiry@unnati.xyz