You are on page 1of 17

openSAP

Big Data with SAP Vora - Engines and Tools


Week 01 Unit 01

00:00:05 Hello and welcome to the openSAP course “Big Data with SAP Vora: Engines and Tools". My
name is Jason Hinsperger and I am a product manager at SAP, specializing in Big Data and
SAP Vora.
00:00:19 Balaji Krishna, senior director of Product Management, and I will be your instructors for this
course. So, what should you expect over the next five weeks?
00:00:30 Let’s take a look. The goal of this course is to provide you with an overview of the SAP Vora
Tools and analysis engines.
00:00:40 SAP Vora is an in-memory, distributed computing solution that helps organizations uncover
actionable business insights from Big Data.
00:00:48 Use it to run enriched, interactive analytics on both enterprise and Hadoop data, quickly and
easily. By the end of the course, you should have a basic understanding of the various
analysis engines in Vora,
00:01:03 as well as how to leverage the Vora Tools to extract maximum value from your Big Data. We
have also included supplemental exercises based on the Vora VINE project to help you digest
the material
00:01:16 and get some hands-on experience with the tools. VINE, short for Vora INteractive Education,
is a set of data and samples that introduce you to the use of various Vora features.

00:01:30 You'll be able to complete VINE-based exercises using the Vora on-premise developer edition,
which is delivered as a preconfigured virtual machine
00:01:39 and is available for download for free from sap.com. If you're interested in further education
beyond this course,
00:01:47 we also have an SAP classroom education course "HA500 – SAP HANA Vora" that you can
sign up for to get additional instruction. Now let's preview what we will cover during each week
of the course.
00:02:01 Week 1 covers an overview and review of Vora, including installation, configuration, and
review of the various services. Week 2 covers each of the analysis engines currently available
in Vora and how to get started with them.
00:02:16 We will cover the relational, time series, document store, graph, and disk analysis engines.
Week 3 covers the Vora Tools in detail, describing how to visually interact with the various
analysis engines.
00:02:32 During Week 4, we'll talk about deployment environments for Vora, considerations for sizing,
and provide some basic starting points for troubleshooting.
00:02:41 And of course, we finish up in Week 5 with our final exam. After successfully completing the
four-week course, you will have one final week to prepare for and complete the final exam
00:02:55 and earn your record of achievement. Throughout the course your feedback, questions, and
your ideas would be very much appreciated,
00:03:02 so please feel free to leave them in our discussion forum. So how do you get the points and
successfully complete the course?
00:03:10 Well, there are four graded assignments throughout the first four weeks of instructional
content, and each assignment is worth 30 points for a total of 120 points, which is half of the
total points available in the course.
00:03:23 The other half of the available points come from a final exam. And just like every openSAP
course,
00:03:29 you will need at least half of the maximum points available to pass the course and receive your
record of achievement. Now that we have our preview done, let's get started with topic 1, an
introduction and review of SAP Vora.
00:03:46 Pentaho CTO James Dixon has generally been credited with coining the term “data lake”. He
describes a data mart, a subset of a data warehouse, as akin to a bottle of water –
00:03:59 "cleansed, packaged, and structured for easy consumption" – while a data lake is more like a
body of water in its natural state. Data flows from the streams, which are the source systems,
into the data lake.
00:04:14 Users have access to the lake to examine, take samples, or just dive right in. We see here the
key qualities of the data lake.
00:04:25 Many enterprises have massive amounts of data flowing through their systems on a daily
basis. The value of some of this data to the business is clear,
00:04:35 and so this data is routed to the enterprise RDBMS and/or data warehouse for safekeeping
and analysis. However, a significant amount of data – some people actually say the majority of
data – that flows through the enterprise on a daily basis
00:04:53 has unknown value, or is in a format that is not conducive to an RDBMS. In other words, it is
semi-structured, unstructured, and possibly dirty.
00:05:05 Audio/video, e-mail archives, documents, Web pages, and click counts are all just some
examples of this. This is a problem because ingesting this data into the enterprise RDBMS
would be time-consuming
00:05:20 because of all the transformation and cleansing that would have to occur. This in turn
increases costs and will most likely impact the performance of the analysis of data
00:05:29 that is already known to add value for the enterprise. So, until the value of the data can be
determined and extracted, enterprises need an inexpensive, scalable location to store it,
00:05:44 either in the corporate data center or in the cloud, like Amazon S3, for example. Of course, this
also means that tools to access, explore, and analyze the data are also required.
00:05:56 This is the purpose of the data lake. SAP has a known and trusted presence in the enterprise
worldwide, with 76% of transactions touching an SAP system.
00:06:10 However, some estimates indicate that SAP solutions, including applications and database
solutions like HANA, still only account for less than 5% of the data footprint.
00:06:23 If the data lake accounts for the other 95% of the data footprint, containing raw data of all
types, both structured and unstructured, at SAP we want to enable enterprises to generate
value from this data.
00:06:36 We want to provide democratized access to it, enabling people from all parts of an enterprise
to explore and analyze the data, and combine it with enterprise data sources.
00:06:47 And in tandem with this, we want to provide enterprise controls like secure access and the
ability to track lineage, – and all this without creating a data silo, or a set of data silos.
00:07:02 This is where SAP Vora enters the picture. SAP Vora is a distributed in-memory analysis
engine which provides enriched interactive analytics for the data lake.
00:07:14 In this picture in green, we see the components which make up the stack of services provided
by Vora, on top of a Hadoop-based data lake. We will talk detail about these services in future
units.
00:07:27 They provide a single enterprise-ready entry point that allows different users and applications
to easily gain insight from a variety of data sources and types.
00:07:38 At a high level, SAP Vora has three big benefits. First, it enables precision decision-making by
providing more data context and coherence than ever before.
00:07:49 For example, allowing you to seamlessly access data stored in SAP HANA and Hadoop at the
same time provides you with a fuller picture – in other words, more context for your analysis.
00:08:02 Second, Vora democratizes data access for data scientists and other investigators of Big Data.
This means the data scientists can work with all the data in an enterprise through a single,
unified interface

2
00:08:17 but still leverage the strengths and capabilities of their tool or programming language of
choice. Finally, Big Data ownership is simplified, reducing the complexity of working with it
00:08:30 and making it easier to manage systems with hot, warm, and cold data. Not having to rely on
different tools and learn new APIs to perform analysis simplifies the landscape for the data
lake immensely.
00:08:48 At a technical level, Vora provides the ability to create hierarchy queries, do sequel drilldown,
perform data conversions, and many other SQL functions on Hadoop data.
00:09:03 It provides specialized analysis engines to write high-performance access to functions specific
to the nature of the data being analyzed. We currently have four analysis engines that we will
talk about during this course.
00:09:18 The graph analysis engine embeds an in- memory graph database for real-time graph
analysis. The primary focus of the graph engine is on complex, read-only analytical queries on
very large graphs.
00:09:32 The time series analysis engine provides efficient compression analysis for the time series
data. The document store analysis engine supports querying semi-structured JSON data,
00:09:45 providing NoSQL-type access to semi-structured data in your Hadoop data lake. The disk-to-
memory accelerator, or disk engine, provides relational capabilities without loading all the data
into memory,
00:09:58 providing the ability to efficiently analyze data in those cases where it is not possible to load all
the data into memory. You can also do data mashups, combining datasets from the various
Vora engines and other sources,
00:10:11 including Spark and SAP HANA, to provide even more context to your analysis and enhance
your insights. All this is done under a unified landscape.
00:10:22 So, for example, you can work with Hadoop data and HANA data in a single simplified
environment. Vora also comes prepared to participate in secure environments.
00:10:35 It provides full support for Kerberos-enabled Hadoop landscapes, supports reading from
encrypted HDFS zones and supports the use of ACLs (Access Control Lists) on HDFS.
00:10:48 Vora cluster management components also support Transport Layer Security (TLS) for
communication between the various services in the Vora cluster.
00:10:59 In addition, the Vora components provide password-controlled access to prevent unauthorized
access to service configurations. And finally, though Vora does not function inside of YARN, it
does respect the use of CGroups
00:11:14 that allows you to control resource utilization and help you manage your Vora deployment
inside of your cluster. That concludes this unit. Thank you.

3
Week 01 Unit 02

00:00:06 Hello and welcome to Week 1, Unit 2, "SAP Vora Installation and Configuration". SAP Vora is
available from a variety of locations.
00:00:15 As a developer, you can access the Vora developer editions in both an on-premise and in-
cloud fashion. For the purpose of this course, we are using the Vora 1.4 on-premise developer
edition, a VM that is downloadable from sap.com.
00:00:33 When you are ready to deploy into production, the SAP Vora installation can be found on SAP
Service Marketplace or you can deploy on Amazon from the AWS Marketplace.
00:00:46 For this course, we are using the pre-configured developer virtual machine which contains all
the required components for running Vora. However, I want to take a minute to go over the
installation process at a high level
00:00:58 to give you an idea of what's involved in deploying Vora into an existing cluster. The pre-
requirements for installing Vora are documented in the Vora Installation and Administration
Guide,
00:01:10 as well as on the PAM (Product Availability Matrix) Web page at sap.com. Specific versions of
these required components such as operating system, Hadoop and Spark versions, can be
found here.
00:01:23 Vora runs on-premise in the popular Hadoop cluster distributions provided by Hortonworks,
Cloudera, and MapR. In version 1.2 of Vora, the installation of Vora involved configuration of
many services inside the cluster management infrastructure.
00:01:40 A significant amount of work had to be done to support this for each Hadoop vendor we
wished to support. In Vora 1.3 and later, we decided to simplify this process by providing a
single management service for Vora
00:01:55 that contained its own UI for configuration of the Vora services. This allows us to provide a
more consistent experience regardless of the Hadoop distribution
00:02:05 and also allows us to spend more time expanding the analytic capabilities of Vora rather than
struggling with configuration issues for each environment.
00:02:14 Prior to Vora 1.4, SAP Vora also contained different installers for each of the major Hadoop
platforms. This process was simplified in Vora 1.4 via the introduction of a common command
line installer
00:02:27 which takes care of distributing the required components to the cluster. In the first step of the
install process, we deploy the Vora Manager to the cluster nodes
00:02:37 and make it available in the cluster management tools. This deployment is done in several
phases.
00:02:44 First, the installer process collects all the information about the Hadoop cluster onto which
Vora will be installed and prompts for the initial user name and password that the Vora
Manager will use.
00:02:59 This user ID and password are used by both Tools and the services configuration. In the
second phase, the Vora Installer installs the Vora Manager by installing the distribution-
specific vora-manager RPM file.
00:03:16 In the third phase, the Vora Installer distributes the RPM packages on the Hadoop cluster to
the various nodes. To do this, you can use SSH or you can leverage HDFS to make the RPM
available to any node which requires it.
00:03:30 In the fourth phase, the Cluster Manager needs to be restarted so it can pick up the Vora
Manager package for installation. And finally, after restarting the Cluster Manager, you can
configure the Vora Manager from the Cluster Manager console.
00:03:45 The Cluster Manager will use the RPMs you deployed in phase 3 to install the local software.
After that, the Cluster Manager is no longer necessary.
00:03:58 The Vora Manager is used to configure the Vora services. On the next few slides, we will
review the various Vora services and their roles in the system.

4
00:04:11 So the Vora Manager itself is a password- protected tool that provides the infrastructure for
managing the configuration and deployment of the services in the cluster.
00:04:20 Each service has its own set of configuration parameters which can be set, like the location for
log data, log verbosity, memory sizes, and so on.
00:04:30 The UI also provides a list of nodes in the cluster and allows you to specify which node or
nodes you will have the Vora service running on. Once the services are configured, the
Manager UI can be used to start or stop any of the Vora services in the cluster.
00:04:47 By default, the Vora Manager user interface is found at port 19000 of the node where the Vora
Manager is installed. Now let's talk about the services that make up the Vora environment.

00:05:03 In general, we can divide the set of services into three groups: First we have the control
services, which provide access control, connection and transaction management services for
Vora.
00:05:15 Usually there are only a few of these types of nodes in your system. Compute services, which
provide the distributed in-memory analytics, are often deployed in many nodes in the system.

00:05:30 This enables faster distributed query processing. And finally we have the persistency services,
which provide consistency and recoverability
00:05:39 in the case of a service or node failure in the cluster, as well as the storage for the raw data.
00:05:48 On the next few slides, we will discuss each group in a little more detail. First let's talk about
the control nodes,
00:05:56 which manage access to the Vora analysis engines and provide a consistent experience for
user connections. First we have the Transaction Coordinator, which enforces consistency for
metadata creation and modification.
00:06:12 For example, it ensures that multiple users cannot create tables with the same name and
different definitions. Next we have the Catalog service, which is the Vora metadata store,
where all of the Vora object definitions are kept,
00:06:25 for example, all your table and view definitions, partition definitions, and so on. After that we
have our Transaction Broker, which is responsible for managing user transactions.
00:06:36 And the Lock Manager, which provides a distributed read-write lock mechanism for concurrent
load statements. This allows us to avoid loading the same partition multiple times
00:06:47 and provides controls for query execution on the distributed analysis engines. Finally, we have
the Landscape Server service, which controls the partitioning and placement of data across
the analysis engines in the cluster.
00:07:02 There must be at least one instance of each of these control services running in the cluster in
order for Vora to function. Now let's take a look at the compute nodes.
00:07:14 These consist of the distributed analysis engines, which were introduced in Vora 1.3. Your
cluster can have zero or more of each of these services running.
00:07:25 For example, if you are not analyzing semi- structured data, there is no need to deploy the
document store engine into your cluster. On the other hand, if you are performing complex
analysis or analyzing a large amount of time series data,
00:07:40 you may wish to have several instances of the time series analysis engine deployed in your
cluster to partition your data and provide distributed processing for your queries.
00:07:50 Here is an overview of the various compute node services in Vora, also known as the Vora
analysis engines. All of the engines, except the disk-to-memory accelerator, operate on data
in-memory.
00:08:01 The relational engine is a columnar in-memory database supporting a wide variety of standard
and it operates best against data that can be mapped to a structured, relational schema.

5
00:08:12 The graph engine analyzes relationships between data that is best represented in a graph
structure. For example, analyzing social networks, building a recommendation engine, and
doing network analysis.
00:08:27 The time series engine is designed for efficient compression and analysis of data collected
over time, such as stock price data or data provided by sensors in an Internet of Things
application.
00:08:38 The document store engine is an in-memory analysis engine for semi-structured JSON
documents, providing NoSQL capabilities to your Vora system.
00:08:47 The disk-to-memory accelerator provides high- performance analysis even when all of the data
does not fit in memory. Data loaded into the disk engine is stored on the local node in a format
optimized for Vora query access.
00:09:03 In addition to the capabilities found in each engine, since the access to the data is all SQL
based, it is also possible to combine analysis across engines in a single query.
00:09:14 For example, combining relational engine queries with disk-to-memory engine queries to
enable analysis of extremely large datasets, or combining time series queries with reference
data from the relational engine to provide a more complete analysis
00:09:28 and expose additional insights without requiring intermediate storage or learning separate
APIs. Our final set of Vora services are the persistency services.
00:09:43 Vora can access data from a variety of back-end systems, which are independently managed
by their respective technologies. For example, the analysis engines can use HDFS or S3 as
the source for raw data loaded into memory for analysis.
00:09:58 In addition to the raw data sources, Vora also has its own persistency for things like the Vora
catalog information. This persistency is provided by the Distributed Log, or DLog, service.
00:10:10 Finally, we also have the previously mentioned disk-to-memory accelerator, which uses local
persistency to store data in an optimized format for Vora analysis.
00:10:23 In addition to the Vora services we have already discussed, there are a couple of additional
services which provide client access to the Vora system.
00:10:31 The first is the Thriftserver, which provides Hive- compatible JDBC access to the Vora analysis
engines, enabling access to Vora from third-party tools and applications, such as Apache
Zeppelin, Lumira, and Tableau.
00:10:50 And finally, we have the Vora Tools service, which can connect to Vora through the
Thriftserver and provide a browser-based GUI for interacting with Vora, including a data
browser, an ad hoc SQL tool, and an advanced data modelling UI.
00:11:04 We can see in the diagram the layer of Vora services in a typical installation. This concludes
our overview of the Vora installation and configuration process.
00:11:16 Thank you.

6
Week 01 Unit 03

00:00:06 Hi. My name is Balaji Krishna and I'm part of the SAP Big Data Product Management team.
And I want to start off this recording by thanking my colleague Puntis Jifroodian for helping us
prepare this content.
00:00:19 Welcome back to Week 1, Unit 3 of "Big Data with SAP Vora: Engines and Tools". We'll be
continuing the topic of distributed engines that Jason started earlier in the week.
00:00:33 So SAP Vora Distributed Analysis Engines – distributed computing solution for the enterprise.
SAP Vora delivers a distributed computing solution available on Hadoop and Apache Spark
platforms.
00:00:47 So as you can see here on the slide, we support the most commonly used Hadoop
distributions, both cloud and on-premise. So we support Hortonworks, Cloudera, MapR – that's
where we see most of our customers running their Hadoop installations currently.
00:01:02 And we also have our own SAP Big Data Services, which used to be Altiscale. As part of this
SAP Vora platform, we enabled a SQL engine on top of that, and this is what we covered
during the previous openSAP course.
00:01:19 And in this SAP course, the focus is going to be on the distributed analysis engines. So the
engines like the time series engine, graph engine, document store engine,
00:01:29 and also the disk engine that we provided as part of this platform. Optionally, you can also
have a HANA platform running,
00:01:36 where you have data exchange between your Hadoop platform and HANA platform. So there's
always this requirement to be able to join or combine this data which is coming from an SAP
system,
00:01:49 like SAP ERP system running on HANA or SAP data warehousing system running on HANA,
and other contextual data that is enabled on the Hadoop platform as part of your digitalization.

00:02:01 Vora delivers a capability where it can not only provide integration to Hadoop as a standalone,
but also deliver that enterprise integration by connecting to SAP HANA.
00:02:15 So talking a little bit more about the different distributed analysis engines, this is a slide that we
talked about during the previous openSAP course, so I just wanted to start off with this,
00:02:26 and then go deep into each one of those engines. So what we've provided with the latest
version of Vora that's in 1.3, and also enabled and enhanced in 1.4,
00:02:37 are these four different kinds of engines which allow us to do different kinds of processing
based on the engine that you're using, starting with the graph engine, then the time series
engine, specifically to address the IoT kind of data,
00:02:53 document store, where we store JSON files. This is getting more and more popular as you see
other kinds of NoSQL data stores, pure-play NoSQL data stores
00:03:04 which are being used for processing these JSON files. And finally, we also enabled a disk
engine which allows us to store larger volumes of data in terms of petabyte-scale data.
00:03:19 When we talk about Big Data, it's not limited to terabytes – forget gigabytes, it's not even
limited to terabytes. It's now in the petabyte range.
00:03:27 So we have to ensure that we deliver a disk-to- memory accelerator, or a disk engine which
allows customers to tier the data from a relational in-memory engine to this engine so that they
can have a single table
00:03:42 or they can have a data model which spans across multiple petabytes of data. And beyond
that we also deliver some additional OLAP features like hierarchies and currency conversion,
00:03:57 which is a unit of measure conversion. So now going into detail on each of these analysis
engines, starting with the time series engine.
00:04:07 The time series engine delivers a series of measurements which is recorded over time. For
example, it could be a temperature gauge which records temperature for every minute or every
second.

7
00:04:19 Or it could be a sales amount which you record every week for a period of five years. This
information is recorded back to a database or to a data file,
00:04:29 so you maybe have a Kafka set up in your IoT scenario, where you have thousands of sensors
which save the data at different points in time.
00:04:39 And the data streams are delivered through the Kafka engine and get dumped into a Hadoop
cluster in the HDFS file. Once you get all the data in there, you want to run some kind of
analysis on top of that,
00:04:52 and that's where time series analysis comes into play. You can recognize the outliers and see
if this is some sort of an error you want to keep an eye
00:05:02 or you can run additional machine learning algorithms on top of them to build some kinds of
predictions. You may also want to do things like standard aggregation, such as MAX sales or
MIN sales or average,
00:05:17 or do grouping and comparing the correlations, looking at different features and saying how
these are or are not related to each other. Finally, we also deliver things like granularization,
which is the concept of dealing with data that is not perfectly consistent.
00:05:35 Maybe you have occasionally missed a measure, or your measurements are coming in at time
intervals that are not exactly matching up,
00:05:44 so that you want to bring them in in a way that can improve the accuracy of the time series
analysis. And there are lots of other analyses, such as the Fourier transform, the regression,
binning, smoothing –
00:05:57 these are the different kinds of other analysis that can be done using the time series engine.
Now moving to the next engine in terms of the advanced analysis – the graph engine.
00:06:10 A Vora graph engine is a distributed in-memory graph engine that supports graph processing
and allows you to execute typical graph operations on data that is stored in SAP Vora.

00:06:23 The graph engine uses a node-centric graph store for high-performance analytical query
processing. It supports directed and undirected graphs and has an underlying property graph
model.
00:06:36 Properties can currently be specified as nodes only and not as edges. And SAP Vora graph
engine supports both directed and undirected graphs.
00:06:47 However, you cannot mix these two in a single graph – that's a current limitation. A graph also
consists of a set of nodes and its accompanying metadata.
00:06:57 A node has a type, like a set of primitive type properties, and a set of outgoing edges. Talking
a little bit more about the graph engine.
00:07:10 The Vora graph engine also supports both directed and undirected graphs, as I mentioned.
And these graphs consist of a set of nodes and its accompanying metadata.
00:07:24 So different kinds of usages that we see in terms of the graph analysis. It could be building
high-performance and real- time applications,
00:07:31 or it could be building these distributed engines for scale-out scenarios. For example, within
SAP we have SAP Ariba Network, which is one of the most complex graph databases, if you
will.
00:07:45 Now we're working with our SAP business data network colleagues to understand how they
can leverage the Vora graph engine and the properties that we have delivered
00:07:53 so that they can simplify their graph for processing and graph insights based on the business
data network that they have within their environment.
00:08:05 Switching to the third engine. So we talked about the time series engine and the graph engine.

00:08:11 Now switching to the third engine, which is the document store. This is primarily used to store
JSON documents.

8
00:08:18 It is a distributed, in-memory JSON document store that supports rich query processing over
JSON data. The document store uses a special binary JSON format, and it's highly optimized,
parallel, and NUMA-aware in terms of the execution engine
00:08:34 to be able to provide high performance on analytical workloads. It also combines JSON with
most of the regular SQL features.
00:08:42 You do not have to learn a new query language, so that's the advantage – where you can
continue to use SQL for JSON data as well.
00:08:51 The key goals of the document store are to provide a very low latency for key lookups, not only
for a small number for cache misses,
00:09:00 but we also deliver things like writers not blocking readers, for example, for MVCC and for
lock-free capabilities. Readers should not do any memory store on shared data structures,
00:09:14 be able to run very fast inserts in terms of the scalability, and also have a fast query execution
due to the code generation that Vora provides. And also enables the ease of use due to the
SQL syntax in terms of being able to use SQL as a language to process JSON data.

00:09:32 So that's the advantage of the document store. So we talked about three main engines – the
time series engine, the graph engine, and the document store engine.
00:09:43 Now moving to the next engine or the next kind of capability that we provide in Vora, it's the
disk-to-memory accelerator or, simply put, it's the disk engine.
00:09:55 As we talked about the three different engines, along with the relational engine which is used
for OLAP analysis. We always run into situations where the cluster size that the customer has
is not completely capable of storing all the data in memory,
00:10:12 which is totally understandable in a Big Data environment. And that's where I need to have
some kind of a tiering mechanism where I can tier the data from my in-memory store,
00:10:22 which is purely storing the data in memory to a disk store where I can do things like "spill to
disk", in terms of if my memory is full or for a specific query,
00:10:35 I have parts of the data in memory, parts of the data in my disk, and this can be done across
all the engines.
00:10:41 Currently we support this for the relational engines, but in the future we will also look at
enabling the disk engine for the time series, graph, and other engines as well.
00:10:51 What we do from a SQL syntax standpoint, from a programming standpoint, is use this class
called "com.sap.spark.engines.disk" to be able to interact with the disk engine.
00:11:05 And again, this disk-to-memory accelerator allows us to store the data in disk to be able to
scale to petabytes of data. Thank you.

9
Week 01 Unit 04

00:00:06 Hi. Welcome back to Week 1, Unit 4 of "Vora: Overview". Here we will be talking about SAP
Vora Tools, which is also known as SAP Vora Modeler.
00:00:16 This is a Web-based data modeling environment used for modeling data on Hadoop and HTFS
data stores, and also cloud storage data stores like Amazon S3 and Swift.
00:00:31 So in terms of the different capabilities that we've delivered in the latest version of Vora Tools
– that's version 1.4 – we've added new enhancements based on some of the usability testing
that we did as part of customer live feedback.
00:00:47 We've also added a visualization capability as part of 1.4. This allows data modelers to
validate whether the data models that they've created are parsing correctly.
00:01:01 So this is not to replace an existing BI tool – it's more a kind of data preview mechanism where
once you've built out a data model that's a fully fledged task schema or you're building a
specific table for a specific engine,
00:01:15 you'll have a capability to look at different kinds of charts, whether it's a pie chart or a bar
chart. And we've provided specific visualizations for each one of the engines.
00:01:25 So there is a certain way to visualize time series data. The same thing if you're building
hierarchies and want to have that hierarchical view of your information.
00:01:34 Or if it's a graph, I need to have a visualization of the graph from a node to the edge. How
many different nodes is my specific graph connecting to? What is the number of nodes?
00:01:46 So these kinds of visualizations are something we've added to the same Vora Modeler that's
available as part of that Web-based interaction. In terms of OLAP capabilities, we've added the
cube and the star joins.
00:02:01 So, typically, when you are building these cubes or star schemas on top of your Hadoop data,
you always need the ability to, say, join your fact table with multiple dimension tables,
00:02:14 bringing that same OLAP capability which was available in a relational data warehousing
concept now to the Hadoop world where my data is sitting in files or my data is sitting on the
cloud.
00:02:29 It's again stored in files like CSV or Hadoop-native formats, How do I build that same kind of
OLAP structure on top of this data?
00:02:38 So that is something that we are delivering as part of the OLAP modeling tool. Additional
enhancements: We've enabled modeling of the documents to a time series and graph.
00:02:50 All these different kinds of data types can be modeled using the same Modeler UI. So for a
person who is familiar with SQL, they'll be able to use this.
00:03:01 Or you can just easily use the graphical user interface to enable these capabilities. Apart from
that, we also deliver functions for new features that we've delivered.
00:03:11 like, for example, the currency conversion, which is a type of unit of measure conversion.
Especially for a lot of enterprise customers who are global customers,
00:03:20 it's very important for them to deal with these kinds of unit of measure corrections. For
example, it could be currency conversions –
I want to do a conversion from euros to dollars –
00:03:31 or it could be I want to do some kinds of other conversions, like kilograms to pounds. Those
kinds of functions are really important,
00:03:40 and we deliver these capabilities out of the box as business functions that can be called from
the modeler. Another new feature that we added in 1.4 is this concept of export/import of the
views.
00:03:52 Again, when you're building these data models and you want to transport them from DEV to
QA to PROD, it's important that you don't lose out on the development effort that's already
been done.
00:04:04 That's where we've provided these basic concepts of exporting your view from an existing
environment and importing it into a new environment.

10
00:04:13 We also provide other features like saving it locally, doing the reference handling to
understand as part of my model if I'm using a specific object, where are the different places
that I've used this object?
00:04:26 So being able to do that reference handling, and also do a preview of the file content during
the table creation.
00:04:32 Other things like visual hints and help are some of the other things that we've added. And all
this comes as part of a new landing page for the modeler.
00:04:44 So talking about the new landing page, this is the new landing page for the Vora Modeler.
Again, it has three main components in terms of the data modeling.
00:04:56 And you also have a very basic security capability in terms of user management that we've
delivered. Users who are very familiar with SQL can use the SQL Editor.
00:05:07 So they can directly jump into the SQL Editor and start typing SQL statements to either create
tables or create views or select information,
00:05:16 so either to do a read or write from the dataset that is in the Hadoop system. You also have
the option of using the Data Browser, which allows you to navigate using a graphical
mechanism,
00:05:30 navigate to your folders and files, and using those files, build a table using Vora's relational
engine or disk engine or any of the other engines.
00:05:40 And finally, when you want to bring all these things together, put all these joins together, you
can use a modeler where you can do things like build out a star schema using your fact table,
dimension tables.
00:05:51 So you've created your dimensions, you've created your facts, and now you can bring all these
together. You can enhance this to add additional measures,
00:06:00 like you have a calculated measure, you have some kind of restricted measure, you can do all
those kinds of complex data modeling using the Modeler perspective of the Vora Modeler or
the Vora Tools UI.
00:06:16 So in terms of the visualization for the relational engine, what we're showing here is we're
highlighting a specific dataset
00:06:25 where we have a number of complaints from a financial institution based on per state, so we've
grouped it by state and this is using the Vora 1.4 relational engine.
00:06:36 So this is one of the charts. So once you've built out your table, you can click on these charts
and you can have different visualizations of the charts.
00:06:45 So you can use a bar chart, a line chart, or a pie chart, or you can just have a tabular view of
this data.
00:06:51 So this is for your OLAP or relational engine. Similarly, when you are working with time series
data, the visualization is going to be slightly different.
00:07:01 In this example, we are showing time series data for a national demand between the months of
January and February where you can also change the level of granularization to have a closer
look at the values in a specific timeframe.
00:07:18 This is just a snapshot of what we provide for time series, but in Week 4, we will cover each of
these engines in detail
00:07:26 where we'll also show you how you model some of this data. So in this case, we are using a
time series table and we are using the line chart to be able to visualize that time series data.
00:07:42 So the next visualization that's available is for the graph engine. In this case, we are looking at
the summary of graph statistics.
00:07:52 So on the left side, you have the types of nodes and the number of them, and on the right side,
you have the types of edges that are available, whether it's a directed or an undirected edge,

00:08:04 and the source and the target nodes for each edge. And finally, at the bottom, you can choose
the node type that shows the available values for the

11
00:08:15 in terms of being able to show the graph topology and how the nodes are connected. So this is
one view of the graph engine visualization.
00:08:27 We also have another visualization where we are showing the hierarchy and, actually, in this
case, we are showing a second kind of visualization that's available in the graph.
00:08:41 So here you're looking at the different nodes and edges, and you also have the configuration
for each one of those nodes and edges.
00:08:48 And you can do things like have a different color for your nodes, you can do a different color
for your borders. So you can do all the different levels of visualization so that it's more
interactive for an end user who's able to consume this information.
00:09:07 Finally, we have also added a visualization for hierarchies. So in terms of... the first image that
we are showing is the ability to visualize a hierarchy and show the total number of levels.
00:09:22 So in this case, you have the "king", which is at the root level, and then you have multiple
levels below that. So this is a visualization hierarchy that is available as part of the Modeler
tool.
00:09:33 In the previous versions, you could just see this in a tabular form, but now we've provided a
hierarchical visualization as well. So if I go to the next screenshot, here you can do things like
add aggregations at each level while visualizing them.
00:09:50 The advantage of hierarchies is not just to be able to visualize the hierarchies, but also run the
aggregations at each hierarchical level, at each aggregation level.
00:10:00 So these are things that you can provide as part of the hierarchy's visualization that's available
in the Vora Modeler. Finally, coming to the document store...
00:10:12 In the case of the document store, we've delivered specific visualizations that are available for
a JSON kind of dataset. Here, we show the keys that are derived from a first record and also
their types.
00:10:25 So here it also shows the document rows. For example, at the bottom you can see the
different document rows.
00:10:33 And finally, on the next slide we are showing the entry point. This is the welcome page for the
modeler.
00:10:43 This provides easy access to all the created tables and views, including the ones that are from
HANA. It's not just limited to Vora, but also for the HANA tables.
00:10:54 It allows easy creation of tables and views, and it also enables you to easily register all the
tables and views, and do things like import and export.
00:11:05 So this concludes Week 1, Unit 4. Thank you.

12
Week 01 Unit 05

00:00:04 Hi, welcome back to Week 1, Unit 5 of "Overview: SAP Vora", where we introduce you to the
different data sources that can be used in Vora.
00:00:15 These include Hadoop proprietary file formats like Parquet, ORC, Avro, and also cloud
storages like Amazon S3 or Swift storage.
00:00:29 So starting with the different Vora data sources. The Vora data source API provides a set of
table options that can be added to the Spark SQL statements.
00:00:39 So this is the enhancement that we've made to the Spark SQL statement to provide additional
information, such as which files to load, what file formats, what memory tweaks can be done,
00:00:51 and also access information for Amazon S3 or Swift object storage. You can use a data source
in Spark by creating a table for a database object, such as a graph, time series, or a JSON or
collection
00:01:09 using the Spark SQL commands like CREATE TABLE or CREATE OBJECT, together with
other key words, like "using" and the respective data source.
00:01:19 So note that the engine data sources are used differently from the SAP Vora and the SAP
HANA data sources. So here we talk about the relational engines, the disk engine, the different
Spark engines, and the Vora relational engine as well.
00:01:37 So in the next slide, we show some examples of how you create these different data sources,
how you create a table when connecting to these different data sources.
00:01:48 Starting off with the relational data source which was used prior to 1.4. So one of the things
that we added in the Vora 1.4 was a new relational engine
00:01:58 which allows you to have better memory management in terms of the memory footprint that is
used when storing a table in a relational engine.
00:02:09 And that's why we have an older relational engine which is typically a deprecated engine. So
you will see this in the Vora Manager as well.
00:02:16 So our recommendation for customers is to start off with using the new relational engine,
which is what we talk about here in the second example. And consider using the deprecated
engine if you require it at all.
00:02:30 The reason why we provided this is that there are customers who are coming from 1.3 – they
don't lose the data that they've already created, they don't the development that they've
already done in the 1.3 version,
00:02:42 and hence we've delivered the deprecated version so that they can continue to use it. But for
new usage, they will switch to the 1.4 relational engine.
00:02:53 So here are some examples of how you can run things like a CREATE TABLE using the
different engines. Obviously the syntax changes from one to another,
00:03:02 but what you see is that between the relational data source and the disk engine, the CREATE
TABLE syntax is very similar. For example, other than the "using" option,
00:03:13 where we use ".com.sap.spark.engines.relational" for a Vora 1.4 relational engine, and for a
disk engine we use the engine stored disk as the class that is used to build these tables.
00:03:27 And similarly for a graph engine or for a time series engine, we use those specific classes. For
example, for Spark we use this com.sap.spark.engines,
00:03:38 and you also want to note the source file that we're using, which is in a specific file format
called JSG. So this is a proprietary file format that we use which can be understood by the
Vora graph engine.
00:03:54 So now talking about cloud-based storage. So here we're referring to the AWS S3, which is the
Amazon Simple Storage Service,
00:04:03 which is a storage for the Internet that allows you to store and retrieve any amount of data at
any time, from anywhere on the Web. The "storagebackend" should be set to "s3", and it's not
a local file.

13
00:04:18 Another important thing when you're connecting to S3 is you have to get the key ID and the
key secret from the Amazon console. For security reasons, you can also configure the
parameters in the spark-defaults.conf file so that you don't expose this information.
00:04:35 And then you have the "s3endpoint" and the "s3region", depending on what region your AWS
is connecting to. So these are the parameters that need to be defined when you're connecting
to AWS S3 as a source
00:04:48 and you're creating a Vora table, whether it's a relational table or a disk table. The other cloud
storage that we support is Swift, which is an object store,
00:05:02 so this is from the OpenStack Object Storage server. This is used to load the data file from the
Swift server into a table in SAP Vora.
00:05:11 And here you need to specify the temporary URL key of your Swift account in the following
format. So you have the "swiftkey", which has the "X-Account-Meta-Temp-URL-Key",
00:05:23 and you can also define the Swift chunk size and the Swift expiration time as parameters. So
this is something that you will need to specify when building these tables.
00:05:35 And finally, note that as with other parameters, you can also configure the Swift parameters in
the spark-defaults.conf file. For security reasons, we recommend that you configure the Swift
secret file in this CONF file
00:05:48 to avoid having to enter this in Spark shell when you're using Spark shell. The spark-
defaults.conf file is typically located in your Spark installation folder – "<spark-
installation>/conf/ ".
00:06:00 For example, in our Vora developer edition it is under /opt/spark/conf folder. Now switching to
other data sources like the ORC or Parquet files.
00:06:16 ORC is the Optimized Row Columnar format file that provides a highly efficient way to store
data in files. It was designed to overcome some of the limitations in other Hive file formats.
00:06:31 Using this ORC file also improves performance when Hive is reading, writing, and processing
the data. Compared to the RCFile format, for example, ORC has a lot of advantages,
00:06:44 such as the block-mode compression based on the data type. For example, a single file as the
output for each task, which reduces the NameNode's load.
00:06:55 In the case of Hive-type support, the datetime, decimal, and other complex types – like struct,
list, map, and union – can be used. Or in terms of the lightweight indices stored within the file.

00:07:09 Similarly with Parquet files, these are in column- oriented format, which can be split by
organizing them into row groups. It has efficient binary encoding and also delivers efficient
compression.
00:07:22 That can also infer the schema for the ORC and Parquet file. So the typical syntax for creating
a table using an ORC or Parquet file as defined here.
00:07:36 This will also be part of the hands-on exercises that you do so that you can really touch and
feel all these things. So create a Parquet file or an ORC file and then create a Vora table on
top of that.
00:07:47 You'll be able to do these as part of the exercise document after the session. So next moving
to the HANA data source.
00:07:58 As we discussed early on in one of the slides, the connectivity to HANA is very important
because there's a lot of data, enterprise corporate data that's in HANA
00:08:08 that needs to be combined with the data that is sitting in the Hadoop system. For us, the HANA
data source is used to access the data in HANA.
00:08:17 An enhanced data source API implementation that supports predicate pushdowns for all the
predicates that HANA can process. We use the class com.sap.spark.hana for the HANA data
source.
00:08:31 Here you see the connectivity between the two layers. There's an option to use Spark
controller using the native Spark connectivity.

14
00:08:39 There's also the option to use the Vora adapter, which is what we recommend for our
customers because more and more performance optimizations and the SQL pass-through are
being built through this bottom,
00:08:51 the Vora adapter layer, which directly connects to the Vora transaction coordinator, which
delivers the distributed query processing capabilities.
00:08:59 And that can be used to transfer the data virtually across the two layers, whether your
consumption is happening on the left side through HANA or on the right side through the
Hadoop layer.
00:09:14 Now talking a little bit more about loading data from HANA to SAP Vora. In this case, the
tables from the HANA catalog can be referenced from a Spark session catalog.
00:09:26 Essentially, the table metadata is copied between the catalogs. And to reflect an SAP HANA
table in this Spark session catalog,
00:09:35 you can use a "create table" statement that references an existing HANA table. The table
metadata is copied from the HANA catalog into the Spark session catalog.
00:09:46 The table in HANA has to exist prior to importing it into the Vora layer, into the Vora catalog.
So the syntax again is shown here.
00:09:56 What you're doing here is taking the data which is in Vora and you're pushing that into HANA
in this case where you're consuming it as part of HANA, but it could be the other way round as
well.
00:10:12 So just some examples of what the different things are that you can do when using the HANA
data source. You have the option to do an append or you want to do an overwrite in case
there's an error.
00:10:25 List out the error information or, when you're doing a save, ignore the statement when doing
that save, ignore the errors that are coming.
00:10:32 So we use the Spark DataFrame. So as you can see here, we use "dataFrame.write.format",
00:10:38 in then in your mode you can specify whether the "SaveMode" is "Append", or overwrite the
different options that we show here. So the other option that we provide in terms of the HANA
data source is being able to list the tables in SAP HANA.
00:10:57 The optional parameters in the "dbschema" and the "tablePattern" are to limit the returned
result. If dbschema is not specified, then the default schema, which is equal to the value of the
user parameters,
00:11:09 will be picked up automatically. For example, dbschema set to "spark" and the tablePattern set
to "%8" will display the tables with the digit "8".
00:11:23 So you're doing your matching pattern there. And at the end of their names created within the
schema "spark", the default value tablePattern value is "%", which matches all the tables.
00:11:36 And it is therefore recommended that you use these parameters to avoid displaying too many
tables because when you are scrolling through the different tables in a HANA system,
especially if your HANA is running on ERP
00:11:47 or if it's using BW, there are way too many tables. So using these wildcards is good practice.

00:11:58 Another important thing is you want to register these tables, so the HANA tables have to be
registered to the current Spark context. And you do that using the "REGISTER TABLE" option.

00:12:08 Primarily you specify the host, instance number, user name, and password, so that way you're
registered to the Spark catalog
00:12:15 when you want to do the joins between your HANA data and your Hadoop or Spark data. You
can do each individual table or you can register all tables using these options.
00:12:28 Finally, you also have the capability to drop or expose the metadata, for example, the DROP
TABLE command drops the specified table from the Spark context.
00:12:39 If the table is a non-virtual table, it also deletes the corresponding in-memory SAP HANA table,
provided that exists within the HANA user which is allowed to perform the action.

15
00:12:49 So basically, that user needs to have the permission, the security, to be able to drop the table.
So these are the different DROP commands that are available for the HANA data source as
part of your DROP command.
00:13:03 So this concludes Week 1, Unit 5. Thank you very much.

16
www.sap.com

© 2017 SAP SE or an SAP affiliate company. All rights reserved.


No part of this publication may be reproduced or transmitted in any form or
for any purpose without the express permission of SAP SE or an SAP affiliate
company.
SAP and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP SE (or an
SAP affiliate company) in Germany and other countries. Please see
http://global12.sap.com/corporate-en/legal/copyright/index.epx for additional
trademark information and notices.
Some software products marketed by SAP SE and its distributors contain
proprietary software components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for
informational purposes only, without representation or warranty of any kind,
and SAP SE or its affiliated companies shall not be liable for errors or
omissions with respect to the materials. The only warranties for SAP SE or
SAP affiliate company products and
services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should be
construed as constituting an additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue
any course of business outlined in this document or any related presentation,
or to develop
or release any functionality mentioned therein. This document, or any related
presentation, and SAP SE’s or its affiliated companies’ strategy and possible
future developments, products, and/or platform directions and functionality
are all subject to change and may be changed by SAP SE or its affiliated
companies at any time
for any reason without notice. The information in this document is not a
commitment, promise, or legal obligation to deliver any material, code, or
functionality. All forward-looking statements are subject to various risks and
uncertainties that could cause actual results to differ materially from
expectations. Readers are cautioned not to place undue reliance on these
forward-looking statements, which speak only as of their dates, and they
should not be relied upon in making purchasing decisions.

You might also like