You are on page 1of 17

openSAP

Big Data with SAP HANA Vora


Week 01 Unit 01
00:00:10

Hi. Welcome to the openSAP course on Big Data with SAP HANA Vora.My name is Balaji
Krishna and I'm a senior director in the Vora Product Management team.

00:00:20

I'll be one of the course instructors for this course, along with Puntis Jifroodian, who is a senior
data scientist.And together we will take you through this introductory course on SAP HANA
Vora.

00:00:34

This course will run for four weeks.I'll be covering Week 1, which is an overview of SAP HANA
Vora.

00:00:40

And as part of this, I'll be covering topics like the motivation, key features, benefits, integration
Web to HANA.And from Week 2 onwards, Puntis is going to take over and she'll cover topics
related to the Vora modeler

00:00:56

and how to develop on SAP HANA Vora.Week 4 will be our final exam.

00:01:02

And as usual, at the end of each week, we also have the test questions to evaluate what you
have learned during the week.We also have system exercises for some of the units.

00:01:14

So there'll be some units that have hands-on system exercises, and we deliver those as part of
our SAP CAL environment in AWS.

00:01:23

For you to go through these exercises, you'll have to set up an account on AWS and you also
have to create an SAP CAL account.

00:01:32

Apart from this, we also have an SAP education course called HA500.This is classroom
education, so interested folks can sign up for this.

00:01:44

Thank you for your interest and for staying through the duration of the course. And happy
learning.All right, so what's the background and motivation in terms of what SAP HANA Vora is
and what it provides for our customers?

00:02:00

Why did we create this? SAP HANA Vora is an in-memory query engine that is built to run
natively in the Hadoop and Spark ecosystem.

00:02:08

It provides a highly distributed SQL engine to process large volumes of data.We talk about Big
Data.

00:02:15

This is one of the key features in order to process large volumes of data which are distributed
across the landscape to run varied workloads like analytical processing, time series, graph
processing kinds of workloads.

00:02:29

Though some of the key things with Vora are that it allows you to make decisions, precision
decisions much faster.It enables you to democratize the data access and simplify the Big Data
ownership.

00:02:42

We provide features like drilldowns on HDFS.We'll talk about some of the features related to
hierarchies, how we build hierarchies using SAP HANA Vora,

00:02:52

and how it allows you to do analysis of your data when the data is in its unstructured
format.Typically data that is stored in HDFS is mostly unstructured, although in the recent past
we've also seen a lot of structured data as well.

00:03:08

We provide optimizations for Apache Spark to be able to integrate with SAP HANA.We have
some of the mashup APIs for enhanced and compiled queries.

00:03:21

Some of the things that I'll be talking about in the future units how the performance
optimization is done by processing these queries by translating the SQL code to C programs.

00:03:33

And finally, we also provide some of the HANA integration using the HANA Spark controller
and also enhance Apache Spark through the data source API framework.

00:03:47

So in terms of the motivation, why did SAP have to create another product? I'm sure that's one
of the questions a lot of listeners are thinking about.

00:03:56

Obviously one of the main reasons why we wanted to create Vora is that when we look at the
current data landscape across the different industries and across the different customers,

00:04:07

we see very distributed data landscapes that are already available in the customers'
environment.It could be starting from Hadoop or some distributed SQL databases,

00:04:20

SAP HANA is one of them.Some of the other ones from open source like Google F1 or
Facebook Presto,

00:04:26

or NoSQL databases like Cassandra, MongoDB, or Neo4j.A lot of these databases are doing a
specific task.

00:04:35

For example, if you look at Neo4j, it provides graph-specific processing.Or if you look at
GraphX or GraphFrames as part of Spark,

00:04:45

it again allows you to process the graph algorithms when it is stored in Apache Spark
framework.When we looked at these different kinds of existing solutions,

00:04:56

and we also listened to our customers who were using HANA, at the same time, they were
looking at: "How do I extend my data platform to run petabytes of data?"

00:05:07

In some cases hundreds of terabytes, but mostly petabytes of data.We felt that the existing
portfolio of products within SAP does not allow customers to be able to scale and run this data
processing

00:05:23

when having data that has to spawn to tens of thousands of nodes.And that is one of the
things that Vora provides.

00:05:30

So the motivation for us to build this was looking at what the other open-source products were
and some of the products that our customers might be interested in using.

00:05:40

Some where they think: "Hey, it's open source, I'm not comfortable putting that in my
production system".And that's where SAP came in and provided Vora,

00:05:48

which allows you to process the data that is within those open-source Hadoop and Apache
Spark frameworks.Basically, what we're providing is a holistic, enterprise-ready, distributed,
and massive scale- out solution

00:06:03

that runs on thousands of nodes, either on-premise or on the cloud.So, we're providing that inmemory data fabric for enterprises using this distributed computing framework.

00:06:15

What you see in a typical customer environment is SAP HANA on the left-hand side, which
spawns to, say, tens of terabytes.Again, there are no technical limitations on how big a HANA
system can be.

00:06:31

But what we have seen from customers, depending on where they are, whether it's a small
enterprise customer, a large enterprise customer,

00:06:38

they can go anywhere between tens of terabytes, in some cases close to 3040 terabytes.And
again, all this data in HANA is compressed data,

00:06:48

so there's no technical limitation, but from a landscape standpoint, what you will see is
customers are running in tens of nodes.And on the right side, you have your Hadoop
distributed landscape,

00:07:01

where Vora is not running natively on each of the Hadoop or Spark nodes.So on the left side,
you have tens of nodes, on the right side you could have thousands of nodes.

00:07:10

And what we are providing from SAP is how to build out the scale-out extension of your HANA
infrastructure to be able to store and process petabytes of data.

00:07:27

Hundreds of terabytes to petabytes of data.And what we provide with Vora is the ability to be
able to federate the queries,

00:07:36

so in terms of I want to access my Hadoop data process through Vora from HANA, or the other
way round, I have my HANA data.It could be my ERP data which is sitting in HANA, it could be
my BW data.

00:07:50

How do I expose that and make it available on the right side for a data scientist who is coming
in from a Hadoop or Spark landscape to be able to consume that data?

00:08:00

So we provide the ability to be able to expose the data bidirectionally.And again, this is virtual
data access, so you're not physically moving data from one layer to another layer.

00:08:13

You have the data stored in Vora, in Hadoop, and you expose those virtually using virtual
tables, consume it through HANA.

00:08:20

Or when the data is in HANA, you want to expose that virtually to be consumed through
Vora.You use the Spark data source API that we'll be talking about in one of the future slides.

00:08:31

So typically, Vora runs on commodity hardware.Again, since it's running natively in the Hadoop
ecosystem,

00:08:39

there are no restrictions on the Tailor Data Center or certified hardware from SAP.Whatever
infrastructure your Hadoop is running on, Vora will continue to run on those.

00:08:51

And I'll show you the slides where we cover how the installation works in terms of being able to
take advantage of existing Hadoop administration tools,

00:08:59

like for a customer who is using MapR, Cloudera, or Hortonworks.You can use your existing
administration tools like Ambari, Cloudera Manager, or MapR Control System

00:09:11

to be able install Vora on all the Hadoop nodes.So we're simplifying the landscape in terms of
how you install, upgrade, manage, and administer the nodes.

00:09:23

And from a consumption standpoint, again depending on where you're consuming, whether
you're coming in from HANA, using a BI tool like SAP Lumira or BusinessObjects,

00:09:33

or you're coming in from Vora, you can use the same BI tools.Or in the case of a data scientist
persona, when the data scientist is using his or her own tools,

00:09:45

like it could be IPython Notebook, Jupyter, or Zeppelin.You're running these data scientist
kinds of workflows.

00:09:52

You want access to the SAP data, we export that virtually so it can be consumed through
HANA Vora.

00:10:02

So, some of the major benefits in terms of using Vora.The key use cases that we enable as
part of Vora.

00:10:10

One of them is how you do SQL on Hadoop, or "OLAP" on this Big Data.And the reason we
have these quotes for OLAP is that it's not really like building an OLAP cube, an OLAP engine,
on top of your Hadoop.

00:10:26

But basically, what Vora provides us with is a SQL engine that allows us to process the data
that is stored in HDFS or S3 file systems.And again, most of the data that is stored in these file
systems, in the Hadoop environments, is unstructured data.

00:10:48

We'll be talking about the different kinds of data, like for example it could be sentiment data, it
could be text data, it could be CSV files.This is unstructured data now how do I do SQL on
top of that?

00:11:00

And that's what Vora is going to provide.So some of the things that we enable as part of that
SQL on Hadoop solution are hierarchical data storage

00:11:09

and contextual data that supports structured analysis.So how do I build hierarchies on this
unstructured data?

00:11:16

We show you how you can do that using the Vora modeler, which Puntis will be covering
during Week 2.How do you enable fast drilldown interaction for root cause analysis?

00:11:28

Again, the key point here is that I have my business users who are familiar with a specific kind
of navigation when they're going against a relational database or a data warehouse kind of
environment.

00:11:41

Now that I'm allowing them to use Hadoop, why should they have to change the way they
interact with the data? And that's where being able to do slice and dice, being able to do
parent-child hierarchies,

00:11:55

those are things that we provide as part of Vora and hierarchy support that is provided as an
extension as part of Apache Spark.

00:12:04

We provide that familiar OLAP look and feel for you to create star schemas off that
unstructured data.So this allows experience business analysts to derive useful insights from
contextual data.

00:12:20

Since we work natively on Hadoop, it goes without saying that we have to support those
specific file formats.So, for example, supporting HDFS.

00:12:30

And other file formats, like it could be Parquet, ORC, or CSV files that are stored in the HDFS
layer.The other advantage that Vora provides in terms of the feature is the capability of the
just-in-time compilation.

00:12:46

So this is a JIT compilation which is using the LLVM technology to translate SQL code into C
programs so that you can process the data much faster.

00:12:58

Obviously when we talk about Hadoop and the open-source environment data platforms, there
is a large volume of data.So optimization in terms of performance is one of the key things.

00:13:10

And using this LLVM and Clang allows us to provide that performance optimization when
processing large volumes of data and doing things like aggregations and other kinds of
projections on top of the data.

00:13:30

So, from a strategic point of view, what does Vora provide? I talked about the benefits, so
again you have Vora here in the middle, and you have the Hadoop and Spark ecosystem on
the right.

00:13:41

We provide additional functionality for enterprise applications, for things like hierarchies, the
OLAP modeling a Vora modeler so Vora modeler will be covered in Week 2.

00:13:51

It is a Web-based modeling tool which allows you to join data that is in your Hadoop
systems.So typically you read from a Hadoop file system, you write that into the in-memory
snapshot of Vora,

00:14:05

where you have the base tables.And then you build star schemas on top of those base tables,

00:14:10

where you can join those, you can create your dimension, you can create your facts, You can
add additional measures, if you want, where you carry out additional calculating measures.

00:14:22

You can do all those things as part of that modeler.And in the future, you will also be able to
bring in HANA data.

00:14:28

So, for example, you have a calc view in HANA.How do I bring that into the Vora modeler,
create those joins, and then make those available for visualization?

00:14:37

So those are the capabilities of the OLAP modeling within Vora.And we want to do all of these
without compromising the performance.

00:14:46

So that's where boosting the SQL performance is one of the important aspects.And this is
where we now depend on Apache Spark from a Spark Core and Spark SQL standpoint.

00:14:59

But beyond that, Vora also provides that integration into HANA.So federated access across
SAP HANA and Hadoop.

00:15:07

Like I said earlier, it's providing virtual access from Vora or Hadoop to HANA, or providing
virtual access from HANA to the Hadoop layer in terms of being able to consume that HANA
data

00:15:22

to be processed for data scientist workflows, predicted workflows, or even for doing Big Data
analytics on top of that data.And finally, we want to provide that integrated tooling which you'll
see in the future releases

00:15:35

where we have a seamless experience when a user is either coming in from HANA or from
Hadoop.And how do we build in other aspects like security and data governance?

00:15:46

Those are some of the things that we're looking to provide in the future versions of Vora as we
go about building this product.This completes Unit 1.

00:15:57

Thanks for your time.And in Unit 2, we'll be covering topics like "what do Hadoop and Spark
do?"

00:16:04

And we'll introduce you to the Hadoop and Spark environments.Thank you.

Week 01 Unit 02
00:00:10

Hi, welcome back to Big Data with SAP HANA Vora.This is Unit 2 of Week 1.

00:00:17

And in this session, I'll be talking about what Hadoop and Spark do.Why is it relevant for SAP
HANA Vora?

00:00:30

So, Hadoop and Spark play a major role in SAP HANA Vora.And in this unit, I'm going to be
talking about how customers use Hadoop,

00:00:42

like they can get it through partners like Cloudera, Hortonworks, or MapR.Or they can even
download it from open source through the Apache Software Foundation.

00:00:56

Some of the motivations in terms of using Hadoop are this emergence of the digitalization
across the different industries, and that this digitalization is enabling customers across the
different industries to be able to track new kinds of data signals.

00:01:19

And what I'm showing here is some of the new kinds of data signals that have emerged in the
past 10+ years. Things like being able to capture what the sentiment of your customer or
consumer is.

00:01:34

Most of the data that is generated as part of the sentiment data is when you're looking at
products that you purchase,

00:01:41

you want to provide input whether good or bad, you post your comments on Facebook,
Twitter, LinkedIn, or any of the other social media Web sites.

00:01:51

It could just be a picture that you're uploading on Instagram.This is enabling that Big Data from
a sentiment data standpoint.

00:02:00

Or it could be other kinds of corporate data like, for example, it could be server logs.Think of a
data center within your own company or within any of the other customers,

00:02:12

or even in terms of the cloud providers.They have a large number of data centers, and it is
important for them to capture these logs

00:02:22

so that they can look at what is happening in the future and try to resolve problems before a
catastrophe actually happens in the data center, whether there is a node failure, a disk failure,
a memory failure, or even an application failure.

00:02:38

So this information is stored in logs, and there are several companies who build major
businesses out of their server logs.So this is a new kind of data that is being monitored on a
regular basis.

00:02:51

Talk about the Internet of Things, and all the sensor information that is being embedded
across the different industries.You talk about manufacturing.

00:03:02

We have sensors installed on all the manufacturing units.Talk about automation there are
sensors installed in your cars.

00:03:12

So it's constantly streaming new data that is important for the proper functioning of that
machine, whether it's a car, a manufacturing machine, or a refrigerator in your house.

00:03:30

All these new kinds of data are adding that data deluge, which is enabling customers to say:
"Hey now I cannot store this data in my traditional database.

00:03:42

I'll have to look at a much simpler and more cost-efficient way of storing this data".That's
where Hadoop comes into play in terms of providing that Hadoop file system layer where you
can store the data in its raw format.

00:03:59

This could be data coming in from sensors, or it could be clickstream data which is again being
captured as I go into a Web site, and I'm looking at, say, a shopping Web site

00:04:09

and I want to browse through four or five different products before I finally pick the product that
I want to add to my shopping basket.This is another kind of data which is being monitored to
provide recommendations

00:04:23

based on the buying or browsing patterns that consumers might have.And finally, all that
unstructured data which is stored in this Hadoop file system,

00:04:35

whether it's PDF files, text files, or Word documents.All these different kinds of files which are
stored are enabling us to look at Hadoop as a new data platform

00:04:50

within the organization to store these new kinds of data.Sorry, I forget to mention geospatially.
That is another important thing.

00:04:59

We all have handheld devices on us, sometimes more than one.And we are giving out
information about what we're doing and where.

00:05:10

So the "where" aspect, the latitude and longitude, is very important for a lot of companies,
whether it's retail companies, whether it's insurance companies,

00:05:19

to be able to look at what my consumer is doing and provide the best services that he or she
needs, depending on where they are at that point in time.

00:05:33

If you talk about Hadoop and Spark, and specifically on the Apache Hadoop framework, it
provides a whole range of projects.

00:05:43

So what I have here is an example from the open-source Apache, Hortonworks, and Ambari
usage.So what you see here is the different projects that are available as part of Hortonworks.

00:05:57

So depending on the kind of workloads that you're doing, for example, if you're doing a batch
or a sequel workload, you use things like Hive or Spark.

00:06:07

If you're doing NoSql, so that's not only sequel, then you use things like HBaseAccumulo.Or if
you're doing other kinds of things, like you want to stream data, then you use Storm,

00:06:19

or in Apache Spark you have Spark streaming, you can use those.And in certain cases, if I
want to do machine learning,

00:06:27

there are specific projects that are part of this Hadoop and Spark framework.So Hadoop
provides that extended framework where you can store the data, you can use things like
resource management

00:06:39

from YARN, which is yet another resource manager.You use YARN to be able to allocate
resources for your task in terms of the JVMs that are required,

00:06:52

how much memory is required, how much CPU is required for processing the job.And finally,
once you have this framework, which is in the middle here,

00:07:02

now you want to provide tools.For example, you could use Zeppelin, which is a developerfriendly tool for you to run Scala code or Python code on top of that data.

00:07:15

So typically, all this data is unstructured and I cannot immediately run SQL on top of that.And
that's where Vora kind of adds value. We'll talk about it later.

00:07:25

But if you want to use Zeppelin and you want to interact using Python, Scala, or Java, you can
absolutely do that.Ambari also provides user views for you to be able to deploy this
ecosystem.

00:07:39

So, for example, there are Ambari blueprints which allow you to deploy Hadoop nodes based
on what your ideal requirements are.Some of the other projects that you see on the left-hand
and right-hand side are core data center-specific topics.

00:07:53

For example, security is something that each and every customer needs, whether they are a
small, medium, or large enterprise.They need to have proper security in terms of rules and
authorizations

00:08:05

that are required for me to expose this Hadoop data to my business users, to my end users.Or,
how does the data address security work? How does the data-in-motion security work?

00:08:16

So all those things are provided by some of the open-source projects like Ranger.Or in terms
of Cloudera, we have a similar thing called Sentry that can be used.

00:08:26

In terms of governance and integration, for example data lifecycle management and
governance, we have projects like Falcon and Atlas.

00:08:36

SAP is one of the contributors on Atlas where we are trying to see what we can do in terms of
data governance.Other kinds of things, like Kafka for being able to ingest the data from your
transaction system,

00:08:51

whether it's IoT, a sensor, or any other kind of system, to be able to ingest that data into
Hadoop, you can use things like Kafka, Flume, or Sqoop.

00:09:01

To be able to run specific rules, run specific ETL, before you can load that data into Hadoop,
or in most cases, the raw data in its raw format is loaded into Hadoop.

00:09:14

And finally, we'll talk about the operational aspect of it, which is the provisioning, managing,
administering and monitoring of your landscape

00:09:22

by using Ambari or some of the other cloud- based Hadoop tools.I'll also provide some of
these simplified manageability and operational aspects of Hadoop.

00:09:36

So the core of this Hadoop is the bottom-most part, which is HDFS.YARN, which is that
resource management, and then you have some of these projects,

00:09:46

like it could be Hive for SQL, or it can be Spark for in-memory, and several other ones are
being built on top of this.

00:09:56

So why is this Hadoop important for us, and how does Vora simplify this landscape? Now what
you see here is the versioning of those different projects.

00:10:07

So as you go from left to right, starting from the core components, which is Hadoop and
YARN, which is currently 2.7.1, which maps to the Hortonworks version of 2.4.

00:10:20

Or, in the case of Cloudera, it's going to be 5.6, and I think Cloudera already has 5.7 at the
time of this recording.And MapR also has the M5 and M7 versions available.

00:10:34

So what the major Hadoop vendors do is try to hide all this complexity in terms of being able to
maintain, manage, and upgrade these different projects.

00:10:47

Imagine I'm running some streaming and SQL workloads.So in this case, I want to be able to
use Storm here for my streaming,

00:10:57

and for my SQL, I'll either use Hive or in some cases, if I already have Spark, I'll also use
Spark If I have to upgrade only those specific projects, it's a nightmare for the IT system
administrator.

00:11:12

So what the major Hadoop vendors provide is a package where customers can go ahead and
upgrade their entire package, and that provides all the specific components.

00:11:26

And they will have done the testing to ensure that you have the right versioning of these opensource projects.And again, the reminder here is Apache is an open-source project,

00:11:36

so all these things that you see here are open source.So if I want to run my own open-source
Hadoop, then I'll have to manage all this upgradability,

00:11:45

which again is a complex thing.I'm not saying nobody does it.

00:11:49

We do have several start-ups who do this.But predominantly for our SAP customers, we see
that they typically go with one of these preferred vendors.

00:12:00

They could also be using some other Hadoop distributions, like it could be Hadoop on the
cloud from AWS which uses EMR (Elastic MapReduce).

00:12:09

Or they could be using other cloud vendors like Microsoft Azure, and they could be using
HDInsight.

00:12:15

Or if they are using the IBM BigInsights, then they'll be using the IBM-specific Hadoop
distribution, which again uses some of the Ambari concepts to be able to administer, deploy,
and manage the Hadoop nodes.

00:12:29

So the key idea here is to go with one of the strategic Hadoop vendors so that you personally
as a customer don't have to go through the complexity of looking at the different versions here.

00:12:44

So in terms of our overview of what a Hadoop cluster looks like, three main components of the
Hadoop cluster are: the master nodes, which primarily look at the infrastructure:

00:12:58

You have the worker nodes where the data is stored.So in Hadoop, the data is replicated three
times so that it takes care of the data redundancy aspect.

00:13:08

In case one of the nodes goes down, there are still two other nodes that can serve the end
users.And this node that went down will be brought up immediately.

00:13:19

And finally, I talked about YARN, which is the resource manager.So here I'm just talking about
what a simple environment of Hadoop cluster looks like

00:13:29

with the master nodes which are running some of the projects, like Oozie Server, ZooKeeper,
or HiveServer2 to process the high SQL workloads.

00:13:39

The Ambari server, which again is Hortonworks- specific, but it could be similar if you're
running Cloudera and using Cloudera Manager, or if you're running MapR and using the MapR
Control System.

00:13:52

And then you have the worker nodes where you have the NodeManager, and you can have
these in the RegionServer as well.

00:13:58

So basically, the DataNode is where all the data is stored, and Vora is using the data that is
stored in these Hadoop nodes to be able to process that.

00:14:12

So just a quick screenshot for how Vora looks when installed through the different Hadoop
administration tools.The three main Hadoop vendors that we support today are MapR,
Cloudera, and Hortonworks.

00:14:29

And in this screenshot, you see that we already have the Vora services showing up in each
one of those.Obviously in the future releases we will try to provide much more simplified
administration.

00:14:41

So, for example, if there are more and more services that we are providing, we'll probably have
the base server which shows up as part of your admin tool.

00:14:49

And then we'll manage the services part of a separate UI that can be launched from your
Ambari or Cloudera Manager.But the idea here is that Hadoop administrators do not have to
learn Solution Manager or any of the other SAP-specific tools to install Vora.

00:15:08

They do it using the Hadoop administration tools.Now switching a little bit, I'm going to talk
about Apache Spark.

00:15:18

Apache Spark is a fast and general engine for large-scale data processing.It provides this
concept of advanced DAG execution engine that allows you to process the data and store it in
memory.

00:15:37

So one of the key differentiators between Hadoop MapReduce and Spark is that Spark
provides this completely distributed framework which allows the data to be stored in memory,
so you can process the data much faster.

00:15:54

Spark has a whole stack of libraries to perform different kinds of workloads.For example, if you
want to do just SQL, you can use Spark SQL.

00:16:03

If you want to do machine learning, artificial intelligence, predictive, then you can use MLlib or
SparkR.So what Spark has done for R is take all those packages and allow it to run on
distributed landscapes.

00:16:19

So that allows you to process much larger volumes of data using SparkR.It also provides some
of the things like DataFrames and machine learning pipelines on top of these frameworks.

00:16:31

And at the core of Spark, it also provides this data source API, which is what SAP has used as
part of Vora to be able to connect to HANA as a remote data source.

00:16:43

There are open-source projects as part of Spark packages which you can go and look up.It
provides connectivity to Cassandra, MongoDB, MySQL, and other open-source projects.

00:16:53

We've taken that framework API and provided connection to HANA so that we can expose that
OLTP data or OLAP data that is in HANA to Vora using this Spark framework.

00:17:08

So this ends Unit 2 of Week 1, and we'll be back with you with Unit 3 in a short while.

00:17:14

Thank you.

Week 01 Unit 03
00:00:10

Hi, welcome to Week 1 Unit 3 of Big Data with SAP HANA Vora. In this unit, we'll be talking
about the key features of SAP Hana Vora.

00:00:24

As explained in the previous units, SAP HANA Vora provides a distributed computing
framework at enterprise scale to be able to accelerate, innovate, and simplify your data.

00:00:36

It provides the business coherency that is required across enterprise data and Big Data,
especially when you are trying to build these mash-ups of your enterprise data

00:00:47

which could be sitting in an SAP system or a non-SAP system.You have your enterprise data,
and then you have these new kinds of data signals that we talked about in the previous unit,

00:00:57

which are stored a Hadoop landscape.How do I combine these two to make sense of what's
happening in my social circles in my consumer environment,

00:01:07

and tie that into what's happening in my point of sales or what's happening in my MRO
system? So, being able to bring these two together, Vora provides that layer on top of Hadoop.

00:01:21

It allows you to enable high-performing and flexible analytics in terms of innovation.It allows
you deliver enterprise-grade and advanced analytic capabilities.

00:01:32

So being able to deliver these enterprise-grade analytics.And what we mean by enterprisegrade analytics is not just doing your basic reporting SQL on top of it,

00:01:43

but bringing in complex functions.For example, I have loaded all of my data into Hadoop, now I
want to do things like currency conversion.

00:01:52

How do I provide this currency conversion as a business function which can be called from the
Vora modeler? Or I want to do things like unit of measure conversion.

00:02:01

Rounding errors instead of calling all these using SQL or Scala, which can be done if you have
all those development resources at your disposal.But what we want to do with Vora is to
provide these enterprise-analytic features out of the box

00:02:20

that can be called as part of your Vora modeling.So we provide the Vora modeler, and as part
of your Vora modeler you can bring in these business functions.

00:02:29

You can call them as part of that, and then build fully scalable OLAP models that can then be
exposed to a BI client to be able to visualize that Big Data environment.

00:02:43

So that's where innovation and simplification go hand in hand in terms of being able to easily
combine that enterprise data with Big Data, and also provide these business functions out of
the box so that you can process the data much faster

00:03:01

now that Vora provides that performance optimization and also the SQL engine to be able to
understand this data without having to go through complex coding like Python, Scala, or Java.

00:03:18

So in terms of the Vora SQL engine, which is at the core of SAP HANA Vora, on this slide,
what we are highlighting is some of the key concepts of Vora's relational in-memory SQL
engine,

00:03:31

which is designed for efficient SQL execution in a distributed cluster.It supports hierarchical
data and OLAP modeling,

00:03:40

as well as in-memory RDBMS kinds of functions, along with columnar store and main memory
or cache-efficient algorithms.

00:03:50

It allows you to parallelize your operations much faster, provides fast column scans and
dictionary encoding.

00:03:58

We use things like byte compression instead of bit compression for references to the
dictionary, which allows you to parallel process this data much faster.

00:04:09

This makes the code much smaller and simpler and reduces the complexity for the code
generation.

00:04:15

We also apply a variety of cache-efficient algorithms, for example in relational operators like
aggregation,

00:04:23

or joins which are all directly embedded in the generated code.So things like the code
generation and compressed columns allow us to store large volumes of data in that in-memory
structure,

00:04:36

in that in-memory snapshot when we build those Vora tables.Vora was created from scratch,
which means we did not just take the source code off HANA and make it run on Hadoop.

00:04:51

It came out as a PhD thesis by one of the students who is currently the lead architect for
Vora.And the idea of being able to provide these in-memory snapshots on the Hadoop data
allows us to process the data much faster.

00:05:11

It's available on multiple platforms, but for this conversation with Vora, as it is externally
available as a product,

00:05:18

it currently runs only on Linux.We support various flavors of Linux, it could be CentOS, Red
Hat, or SUSE.

00:05:26

But we're also looking at how this Vora SQL engine can be trimmed down to be able to run
that on, say, sensors.You hear a lot about the fog computing aspect, which today is just
collecting data from sensors.

00:05:44

But if there's a way for me to also provide some kind of reporting on the sensor, how do I
provide these capabilities?

00:05:53

So this is possible when I'm using, for example, the Vora engine, and allow that to be able to
provide analytics at the edge.These are again research topics at this point,

00:06:08

but given the nature of the source code, Vora provides multiplatform support which can be
used across different layers.But for this conversation, we're mainly going to focus on Hadoop
and Spark-specific use cases.

00:06:26

So switching to the next slide...Here I talk about how we use Vora's LLVM technology to
translate SQL code into C programs.

00:06:40

Typically in a database engine, it transforms an incoming SQL query into a query plan, which
is then to be interpreted by an execution engine to compute the result.

00:06:51

The interpretation of the query plan can be quite expensive and it forces the splitting into
coarse- grained independent operators.But in the case of Vora's relational in-memory engine,

00:07:03

we turn as many relational operators as possible into a single piece of generated native
code.In the best case, the whole query is translated into a single function.

00:07:13

To translate a SQL query into one piece of generated native code we take an optimized query
plan and give it a piece of C code for every operator in the operator tree using the produceconsume model.

00:07:28

And this results in one or multiple C functions which are then further translated into native
executable code by the LLVM compiler framework.

00:07:39

So this allows us to translate the SQL code into C code and execute that much faster when
processing large volumes of data.So when talking about the distributed in-memory computing
architecture

00:07:55

and I talked a little bit about this during Unit 2 but another view of the benefits that Vora
provides when running this side by side with HANA.

00:08:08

What you see here on the left is HANA, you can have multiple nodes of HANA, scale-out
HANA or scale-up HANA.On the right side, you have Vora, which is running natively on the
Hadoop ecosystem.

00:08:20

So it has the data locality, so it can look at what other files are stored in that specific Hadoop
node, process them.It can work with YARN to be able to get the resources,

00:08:32

for example how much memory and how much CPU I need.And then the key thing I want to
focus on here is that we are working across all the different Hadoop distributions.

00:08:44

Today, we support Hortonworks, MapR, and Cloudera, but we will also support the other
distributions and also the Apache software foundation in the future.

00:08:53

The main benefits of Vora are being able to integrate the SAP data with the data lakes or
being able to do analytics on the data lakes,

00:09:04

provide the HANA connectivity with Hadoop, provide enterprise analytics like hierarchies for
SQL and Hadoop,

00:09:11

and finally, we are looking at how to tap into some of these new use cases like archiving of
ERP data and how to use Vora to bring back the data as part of the Vora SQL engine.

00:09:28

Some of the key features that are available as part of version 1.3, which is the latest release
that was made available for customers as a generally available product in September.

00:09:43

And we have several customers who are using this.Some of the main features are some things
that I talked about in the previous units,

00:09:50

like being able to install that using the Hadoop administration tools.We support the most recent
version of Spark.

00:09:58

I wrote 1.5.2, but with 1.3 we also support 1.6.x.So it could be 1.6.1 or 1.6.2 depending on
whatever your Hadoop vendor provides.

00:10:09

We do support the latest version of Spark, so my apologies for writing 1.5.2 here.

00:10:15

We leverage Spark core and Spark SQL APIs.I talked about the Vora modeler that allows you
to build SQL on top of Hadoop.

00:10:23

We have introduced new services like distributed log, or Dlog, and Discovery
services.Distributed log allows you read all the metadata, store all the metadata, catalog
information that is part of Vora.

00:10:38

Discovery service allows you to see where the different Vora services are running, and this
uses an open-source component called Consul Nomad, which is part of Vora,

00:10:50

to be able to deploy the different services, the client component, the server component, and be
able to talk to each other.And finally we extend the IDE support using some of the Web
notebooks like Jupyter and Zeppelin.

00:11:07

With 1.3, one of the key features I want to mention is that we support new kinds of processing
like Time Series, Graph, DocStore for example if you have no SQL kinds of requirements,
you can use the Vora document store.

00:11:20

We also provide a disk store where we want to be able to utilize features like spill to disk if our
data cannot be completely stored in memory.Talking about those 1.3-specific key features.

00:11:36

Some of the latest innovations, I talked about the Graph engine.So Vora embeds an inmemory graph database for real-time graph analysis.

00:11:44

The primary focus on this is for complex, read- only analytical queries on very large
graphs.Time Series Vora provides a highly-distributed Time Series engine

00:11:57

which can be used to support storing and analyzing of Time Series data.We also enhance the
user credentials for SAP HANA Data Source, so the Spark Data Source

00:12:10

We've enhanced that as part of our 1.3 release.And finally, we have enhanced performance by
providing an optimizer extension to auto-detect join push-downs for co-located joins.

00:12:23

Co-located joins and partitions are things that Puntis is going to cover in Week 3.This
completes Week 1 Unit 3.

00:12:33

Thank you for staying with me, and in the next unit we'll talking about the deep dive on HANA
and Vora integration.

Week 01 Unit 04
00:00:10

All right. Welcome back to Week 1 Unit 4 of Big Data with SAP HANA Vora.In this session, I'm
going to be talking about SDA and SAP HANA connectivity.

00:00:22

For those who are not familiar with SDA, it is Smart Data Access, which is a federation
technology that we provide as part of HANA.And as part of today's session, I'll be talking about
how Smart Data Access is enhanced using SAP HANA Vora

00:00:38

to provide much more optimized connectivity to SAP HANA. So before we go into the details of
the Vora- specific connectivity as part of SAP HANA,

00:00:51

I just want to take you guys on a journey to when we started the integration with HANA and
Hadoop.And what you see here on this slide and again I want to draw your attention to these
different interface lines here.

00:01:07

We started this journey back in SPS 12 of SAP HANA.I think that was some time in 2012.

00:01:15

That's when we introduced the fast connection from HANA into Hadoop.And this was provided
using the virtual tables to Hive.

00:01:24

So in Hadoop, you build a relational structure on structured data that is sitting in HDFS.And
you call those the Hive tables.

00:01:32

So what we provided with Smart Data Access in SPS 06 was allowing you to expose these
Hive tables as virtual tables to be joined with data that is in HANA.

00:01:44

And you create these calc views which combine the data from Hadoop and HANA.So you build
these calc views and then make it available for visualizations or for your applications.

00:01:58

We enhanced this Hive connectivity with SPS 07 by providing a remote caching


feature.Remote caching allows you to materialize the data on the Hive side.

00:02:09

So one of the questions we initially got was...When you provide this join between HANA data
and Hadoop data, obviously the data in HANA is in the main memory.

00:02:19

The data in Hadoop back then was mostly on disk, so I had to execute a MapReduce job on
the fly when a join was executed as part of a BI query.

00:02:29

This was very slow and time-consuming, so what we provided was the ability to materialize the
Hadoop data as a Hive table, and use that Hive table to join with HANA when you are building
this calc view.

00:02:44

This overcomes some of the performance issues.Obviously the performance is not the same
as one would have when the entire dataset is sitting in HANA's main memory.

00:02:55

Customers did understand that when they were connecting using remote resources, there was
a slight delay when joining this data.And most of the customers were okay with that.

00:03:07

In some cases they had to physically ETL the data from Hadoop into HANA.In SPS 08 and
SPS 09 of HANA, we provided these MapReduce functions as part of a virtual user-defined
function.

00:03:22

So typically as part of your calculation view, when you're building your calc view, You create a
virtual user-defined function which is going to invoke a MapReduce job on the Hadoop side.

00:03:34

As you can see, this bypasses the Hive layer, and this was roughly back in late 2013 when
Apache Spark started receiving a lot more interest.

00:03:46

Now with Spark, you can provide real-time access to the data that is in Hadoop.You don't have
to go through batch mode.

00:03:54

You can run some of the real-time functions since Spark allows you to store the data in
memory.So we bypassed Hive, we directly went to user- defined functions and HDFS.

00:04:07

With SPS 10 and even before SPS 10 we introduced this connectivity to Spark using
Apache Shark, which was the predecessor of Spark SQL.

00:04:18

So we from SAP provided an ODBC driver, or we used an ODBC driver to connect to Shark
and connect to the data that is in Spark.

00:04:29

But with SPS 10, we created our own Spark controller, which is a Scala code that our
development wrote in order to provide the connectivity between HANA and Hadoop.

00:04:41

So as you can see, before that we were using third-party drivers like ODBC drivers from
Simba.With SPS 10 and above, we use our own native code,

00:04:51

so it eliminates the requirement to install third- party ODBC drivers on the HANA side.And it
also allows us to not have any bottlenecks

00:05:03

in terms of one of the core processes within the HANA index server being stabilized by these
non-SAP delivered, third-party ODBC drivers.Some of the things that we provide as part of
Spark controller:

00:05:16

It is installed on the Hadoop nodes.It interacts with the Hive metastore.

00:05:21

So whether you have data stored in the Hive metastore or in the Spark catalog, you can
expose those as virtual tables.

00:05:28

And again, from a workflow standpoint you continue to use the HANA modeler with either the
studio or the Web-based modeling tool Web IDE,

00:05:38

to model your data irrespective of whether the data is in HANA or in Hadoop, and then make
those available for visualization.

00:05:47

Some of the key benefits as part of this are obviously that it provides deep integration for both
storage and processing, and with Spark controller we went from read-only access, which is
what we had with ODBC and Hive.

00:06:02

Back in the day, Hive was still read-only, it didn't have insert update capabilities but now it also
provides insert update capabilities.

00:06:09

But we are focusing primarily on the Spark controller.And with the Spark controller, we provide
the ability to write data into Hadoop as well.

00:06:16

And I'll talk about that as part of the data lifecycle management.When customers are using
HANA and they want to offload some of their older data into low- cost storage,

00:06:30

they have an option of using this tool called data lifecycle management and displacing the data
from HANA into Hadoop using the data lifecycle management tool.

00:06:39

So that allows us to write data into Hadoop as well, and this is done using Spark controller.

00:06:45

We also provided a unified administration tool by enabling the HANA cockpit to display tiles for
your Hadoop administration, for example Ambari.

00:06:59

So what we did was as part of your HANA cockpit you can add a new tile, and in that tile you
can now show your Hadoop nodes,

00:07:07

and you can go in there and monitor your Hadoop nodes as well.So it's kind of a unified
administration as part of the SAP HANA cockpit

00:07:17

to be able to not only manage your HANA assets, but also your Hadoop assets.So again, as I
mentioned earlier on in this talk,

00:07:29

there were several layers, several interfaces that we used.These were multiple iterative
innovations that we provided as part of the connectivity between HANA and Hadoop.

00:07:40

And finally we settled on the Spark-specific integration using the Spark controller.And you'll
also see how Vora takes this to the next level in terms of where SDA stops

00:07:53

and Vora can take over to provide much more optimized connectivity between HANA and
Hadoop.So going a little bit deeper into the Spark controller,

00:08:04

what it is and how it provides connectivity and optimization.Spark controller is part of the SAP
HANA delivery.

00:08:12

So it is installed as part of SAP HANA.There's an additional add-on that you have to install,

00:08:18

and this is installed on the Hadoop nodes.It allows you to manage the connection and
communication to the remote Spark controller,

00:08:26

which is an SAP-deployed component on the Hadoop cluster.It manages the access to Spark
controller services.

00:08:34

What you see here on the graphic is that the Spark controller runs as an in-process service
within the HANA index server.So the HANA index server, which is the core process, is
managing this Spark controller

00:08:49

and it provides the connection to the Hadoop cluster as part of the controller, which is also
running on the Hadoop side.It allows us to translate the SAP HANA requests and de-serialize
the data that's received from Hadoop back into SAP HANA.

00:09:07

It simplifies access to Hadoop, as in you can browse through the Hadoop data structures, so
you can go through the Hive metastore, you can look at all the different tables that are in Hive
metastore.

00:09:18

And we also provide optimizations as part of creating statistics histograms on those remote
datastores so that when you are doing the joins, there's a much faster join that happens

00:09:32

between the data editors in HANA and the data editors in Hadoop through the Spark
controller.Finally, it enables the creation of virtual tables through the remote source interface.

00:09:44

So you could be exposing a Hive table, you could be exposing an RDD, which is a Spark RDD,
or a data frame.All those will be available as virtual tables to be consumed from the HANA
side.

00:09:58

So this provides unidirectional connection to be consumed from HANA when the data is in
Hadoop.Now let's switch to how Vora enhances this.

00:10:11

And internally we kind of call this the SDA++ to highlight additional functionality that Vora
enables as part of this integration.The first thing is that Vora is running natively on your
Hadoop node,

00:10:26

so it is a first-class citizen in that Hadoop infrastructure, if you will.What that means is you use
the Hadoop administration tools to install Vora and monitor them.

00:10:40

And also Vora has the data locality of where the individual data pieces are across this Hadoop
node.So it can use that Vora sequel engine to process that data faster and using some of the
optimizations which I talked about,

00:10:56

which is the LLVM optimization, which is able to to process the data much faster and then join
the data that's coming from the distributed nodes.

00:11:05

In some cases, we do replicate the data.For example, when you're doing joins we replicate the
dimension tables on all the nodes.

00:11:15

And for the fact table, we allow the capability of co-located joins, which is another topic that
Puntis is going to be talking about during the development of Vora in Week 3.

00:11:30

So what you typically have on the HANA side is a data mart or a data warehouse that is
running on HANA.Or it could be an ERP application, like you could be running SAP ERP on
HANA or you could be running S/4 on HANA.

00:11:46

And the advantage with Vora is that all the assets that are running on HANA so let's take a
perfect example, which is my ERP system is running on HANA.

00:11:59

So I have my S/4 system running on HANA, and as part of S/4 I get access to this business
content which is called CDS (Core Data Services) views.

00:12:09

So we provide these views for different subject areas and you can expose this content, the
CDS views, as a virtual function to be consumed by a Spark user using Vora.

00:12:21

And this is the connectivity using the enhancement that we did for Apache Spark where we
use Apache data source API

00:12:30

and build this function where you can expose HANA data or ERP data to a Hadoop application
or a Spark user who's coming in through the Vora interface.

00:12:41

Typically we call these consumption patterns as if a user is coming in from HANA, which is our
primary use case.This is the inside out, where as part of my inside-out consumption I am
bringing in data from Hadoop and exposing it to HANA.

00:12:56

The other way around is outside in, which means my user is coming in from Hadoop, Spark, or
Vora and they want access to the data that is sitting in my HANA system.

00:13:08

Before Vora, the only way to get access to this SAP data was to use ETL technologies to
physically move the data from HANA into Hadoop,

00:13:22

and then have your data scientists execute whatever algorithms they are building on top of
Spark.But now with Vora you can expose this as a virtual artifact,

00:13:33

which means you don't have to worry about delta loads or data consistency because the data
keeps changing here, and as and when there is new data loaded.

00:13:42

If you were doing ETL, you had to do ETL data moment, delta loads into Hadoop as well.

00:13:49

But providing virtualized access means you don't have to worry about all those things.My view
is always available for Vora or Spark,

00:13:57

and it can consume the data as and when the data is updated on the ERP side or on the
HANA side.So it could be an S/4HANA application, which is an ERP application, or it could be
SAP BW.

00:14:11

In a BW scenario, if I have InfoCubes that are on my HANA system, I can expose those
InfoCubes as a calc view.

00:14:21

Once you expose this as a calc view, the calc view can be consumed by Vora using the data
source API.So Vora provides that bidirectional connectivity by enhancing Smart Data Access.

00:14:36

With Smart Data Access, I could only consume the Hadoop data from HANA.But with Vora,
you can also consume the SAP data or HANA data from Hadoop or Spark using Vora's
capabilities.

00:14:53

So again, what are the advantages of using Vora over Smart Data Access? Some of the
primary ones are that it replaces existing ODBC and JDBC connectivity with direct access to
Hadoop and Spark

00:15:09

through HANA Vora using the Spark controller.So as mentioned in the past, when we were
using these third-party ODBC drivers,

00:15:17

sometimes the ODBC drivers would cause inconsistencies on the HANA index server
process.So the index server is one of the core processes in HANA.

00:15:27

By using these third-party ODBC drivers, sometimes, depending on the function calls that were
made, the index server would crash.

00:15:36

Now using Spark controller, we've removed that dependency completely away from HANA and
we install the Spark controller on the Hadoop side.

00:15:46

Plus, the code that we run for the Spark controller is completely owned by SAP.So in case
there are any issues, we have control over fixing it so that our customers are not running into
"production down" scenarios,

00:16:02

as opposed to ODBC and JDBC, where we have to wait for third-party vendors.It provides
deeper integration with Hadoop because the Vora engine runs natively on each of the Hadoop
and Spark nodes.

00:16:14

So there's tight integration between Spark and Vora, where from the Vora side we've created
our own SQL context.

00:16:24

Just like you have the Spark SQL context, we've provided an SAP SQL context which allows
you to use the Vora SQL engine to process the data that is stored in HDFS, S3, or any of the
data formats.

00:16:39

Future integrations to Hadoop and Spark from SAP HANA will be driven through Vora.So what
you're seeing here is just the initial steps on how we plan to deliver this seamless integration

00:16:52

across these thousands of nodes which are HANA plus Vora.And we want to provide a much
more seamless integration between these two layers.

00:17:02

Not just for data transfer, but along other areas like administration, manageability, data
governance all the data center-related aspects.

00:17:12

How do you do a single sign-on when you're connecting to a Kerberos-enabled Hadoop


cluster? So those are some of the things that we're already working on to ensure that the roles
and authorizations that are built in one layer

00:17:26

can propagate the user credentials all the way to the Hadoop layers as well.Vora delivers
features for data consumption from both Hadoop and Spark,

00:17:39

and from SAP HANA natively using the calculation used.Like I mentioned earlier in the
scenario where it's a CDS view or if it's a BW InfoCube which can be exposed as a calc view.

00:17:50

And this extends the platform to data scientists.Again, when you're coming in from the outsidein scenario, the data scientists can consume SAP

00:17:59

without having to physically load that data into the Hadoop layer.So to summarize, as part of
the SDA and the Vora connectivity using SDA and Vora,

00:18:13

providing the connectivity to HANA, we enhanced what we already had as part of ODBC and
virtual UDF, and now with Spark controller and Vora, we provide a much more optimized
connection.

00:18:25

Customers who have been using SDA to connect to Hadoop can continue using Smart Data
Access and Spark controller.But if they're looking at deeper, more optimized integration, then
our recommendation is to look at SAP HANA Vora and what it offers.

00:18:42

Beyond HANA integration, Vora also delivers OLAP-style reporting on Hadoop by taking
advantage of the enterprise analytics features like hierarchies

00:18:52

and supporting the native Hadoop file formats like Parquet and ORC.HANA Vora also delivers
performance optimizations.

00:19:02

Since it runs natively in the Hadoop ecosystem, and it takes advantage of technologies like
LLVM capabilities

00:19:11

to translate or convert SQL code into C code.This is what HANA Vora integration connectivity
is at the high level.

00:19:22

This concludes Week 1 Unit 4, and all the content for Week 1.In Week 2, we will have a
different speaker, Puntis, who is my colleague from Vora Product Management.

00:19:38

She'll be going into more detail on Vora modeler and how you develop using Vora.Thanks
again for your time and we'll see you next week.

00:19:49

www.sap.com

2016 SAP SE or an SAP affiliate company. All rights reserved.


No part of this publication may be reproduced or transmitted in any form
or for any purpose without the express permission of SAP SE or an SAP
affiliate company.
SAP and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP SE (or an
SAP affiliate company) in Germany and other countries. Please see
http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for
additional trademark information and notices. Some software products
marketed by SAP SE and its distributors contain proprietary software
components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for
informational purposes only, without representation or warranty of any kind,
and SAP SE or its affiliated companies shall not be liable for errors or
omissions with respect to the materials. The only warranties for SAP SE or
SAP affiliate company products and services are those that are set forth in
the express warranty statements accompanying such products and services,
if any. Nothing herein should be construed as constituting an additional
warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue
any course of business outlined in this document or any related presentation,
or to develop or release any functionality mentioned therein. This document,
or any related presentation, and SAP SEs or its affiliated companies
strategy and possible future developments, products, and/or platform
directions and functionality are all subject to change and may be changed by
SAP SE or its affiliated companies at any time for any reason without notice.
The information in this document is not a commitment, promise, or legal
obligation to deliver any material, code, or functionality. All forward-looking
statements are subject to various risks and uncertainties that could cause
actual results to differ materially from expectations. Readers are cautioned
not to place undue reliance on these forward-looking statements, which
speak only as of their dates, and they should not be relied upon in making
purchasing decisions.

You might also like