Professional Documents
Culture Documents
The basic task of knowing what is going on with the business, drives platform adoption and technology
buying decisions. Databricks is a managed platform for running Apache Spark that aims to provide a fast
and generalized GUI for large scale data analysis. Its an implementation of Spark to help reduce complexity
of setup and operation by providing dashboards and scheduled jobs. The client does not have to learn
cluster management concepts nor perform Spark cluster maintenance. It is a point and click interface for
data analysts and BI professionals with options to automate data jobs and AWS private cluster integrations.
Their core components are
First impressions by using the Databricks community version, it seemed like a merge of visualization suites
like inCites / Exploratory.io, and liveCode tools like Jupyter / Zepppelin. It felt like an investigative
convenience tool that pulls in functionalities of Apache Spark and presents them in a web based interface.
Since Spark became top level apache project in 2014, it has been tremendously improved in specific areas
of Data integration, ETL, Machine Learning and visualization. Data scientists can now use python APIs to
run BI code and visualization tools like Qlikview / Tableau can connect directly to Spark SQL. The data
scientist responsible to drive insights most likely is already proficient in all of the aforementioned tools.
The question then arises that what value would databricks add to the existing and rapidly evolving
infrastructure.
Databricks presents itself as a convenience tools, that anyone can be trained on, for easy cluster
management, ease of setup, collaboration, visualization etc. Although DataBricks web based interface saves
time in visualizations, but certainly restricts customization in machine learning frameworks specially deep
learning. This enforces power users to restrict queries and analysis within the bounds of the web based
system. Being a data scientist, I have used similar systems previously and I would still prefer Python and R
over a web-based tool for the heavy lifting and flexibility. Certain areas that will undergo massive change,
with the use of transfer learning (deep learning technique), are real time processing for outliers and fraud
detection and recommendations on user feedback. The web based system show no support on handling
transfer learning and this is still a vision in the company profile.
It is important to note that Databricks was founded by the creators of Apache Spark and this has played a
huge part in their success at seed funding rounds. Not to be shadowed by their popularity in Venture
Capitalists, companies would be better off adding another talented data scientist for the price of their
annual subscription.
Appendix: Quick review of the latest offerings in Spark
Spark is paving new ways to give easier access to big data for data scientists. This is reflected in their latest
architecture and platform integrations. Most recent update is the introduction of Dataset i.e. a combination
of RDDs and Dataframes. Datasets allow users with the ease to type like a RDD and query like a dataframe.
Datasets are predicted to be the way forward in Spark data structures.
Recent versions of Spark can run from Jupyter notebooks, Apache Zepppelin and Rstudio. The command
shell now natively supports Java, Scala, Python and R. The previous Java based batch oriented technologies
like MapReduce and its abstractions like Hive,Pig,Mahout etc are phasing out due to slow and tedious
performances.
Legend
RDD: a container built using varying data types spread across the cluster
Dataframe: a subset of RDD, that only inherits key value pairs and not the different data type
Author:
Saad Sadiq, PhD candidate
College of Engineering
University of Miami
Coral Gables, FL