Professional Documents
Culture Documents
com
ii
Table of Contents
1.
2.
3.
4.
Introduction ................................................................................................................ 1
Prerequisites ................................................................................................................ 4
Installing Spark ........................................................................................................... 5
Validating Spark .......................................................................................................... 9
4.1. Run the Spark Pi example ................................................................................. 9
4.2. Run the WordCount Example ......................................................................... 10
5. Installing Spark with Kerberos ................................................................................... 13
5.1. Accessing the Hive Metastore in Secure Mode ................................................. 14
6. Best Practices ............................................................................................................ 15
6.1. Using SQLContext and HiveContext ................................................................ 15
6.2. Guidelines for Determining Spark Memory Allocation ...................................... 15
6.3. Configuring YARN Memory Allocation for Spark ............................................. 16
7. Accessing ORC Files from Spark ................................................................................. 18
8. Using Spark with HDFS .............................................................................................. 20
9. Troubleshooting Spark .............................................................................................. 21
10. Appendix A: Upgrading from the Spark Tech Preview .............................................. 23
iii
List of Tables
1.1. Spark Support in HDP, Ambari ................................................................................. 2
1.2. Spark Feature Support by Version ............................................................................ 2
2.1. Prerequisites for running Spark 1.3.1 ........................................................................ 4
iv
1. Introduction
Hortonworks Data Platform supports Apache Spark 1.3.1, a fast, large-scale data processing
engine.
Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside
other engines such as Hive, Storm, and HBase, all running simultaneously on a single data
platform. YARN allows flexibility: you can choose the right processing tool for the job.
Instead of creating and managing a set of dedicated clusters for Spark applications, you can
store data in a single location, access and analyze it with multiple processing engines, and
leverage your resources. In a modern data architecture with multiple processing engines
using YARN and accessing data in HDFS, Spark on YARN is the leading Spark deployment
mode.
Spark Features
Spark on HDP supports the following features:
Spark Core
Spark on YARN
Spark on YARN on Kerberos-enabled clusters
Spark History Server
Spark MLLib
Support for Hive 0.13.1, including the collect_list UDF
The following features are available as technical previews:
Spark DataFrame API
ORC file support
Spark SQL
Spark Streaming
SparkSQL Thrift JDBC/ODBC Server
Dynamic Executor Allocation
The following features and tools are not officially supported in this release:
ML Pipeline API
SparkR
Spark Standalone
1
GraphX
iPython
Zeppelin
Spark on YARN uses YARN services for resource allocation, running Spark Executors in
YARN containers. Spark on YARN supports workload management and Kerberos security
features. It has two modes:
YARN-Cluster mode, optimized for long-running production jobs.
YARN-Client mode, best for interactive use such as prototyping, testing, and debugging.
Spark Shell runs in YARN-Client mode only.
The following tables summarize Spark versions and feature support across HDP and Ambari
versions.
Ambari
Spark
2.2.4
2.0.1
1.2.1
2.2.6
2.1.1
1.2.1
2.2.8
2.1.1
1.3.1
2.2.9
2.1.1
1.3.1
2.3.0
2.1.1
1.3.1
1.2.1
1.3.1
Spark Core
Yes
Yes
Spark on YARN
Yes
Yes
Yes
Yes
Yes
Yes
Spark MLLib
Yes
Yes
Yes
DataFrame API
TP
ORC Files
TP
Spark SQL
TP
TP
Spark Streaming
TP
TP
TP
TP
SparkR
Spark Standalone
GraphX
If you are evaluating custom Spark builds or builds from Apache, please see the
Troubleshooting Spark section.
2. Prerequisites
Before installing Spark, make sure your cluster meets the following prerequisites.
Description
(Optional) Ambari
Software dependencies
Note
If you installed the tech preview, save any configuration changes you made to
the tech preview environment. Install Spark, and then update the configuration
with your changes.
3. Installing Spark
To install Spark manually, see "Installing and Configuring Apache Spark" in the Manual
Install Guide.
To install Spark on a Kerberized cluster, first read Installing Spark with Kerberos (the next
topic in this Quick Start Guide).
The remainder of this section describes how to install Spark using Ambari. (For general
information about installing HDP components using Ambari, see Adding a Service in the
Ambari Documentation Suite.)
The following diagram shows the Spark installation process using Ambari.
2. On the Assign Masters screen, choose a node for the Spark History Server.
3. On the Assign Slaves and Clients screen, specify the machine(s) that will run Spark clients.
Click "Next" to continue.
4. On the Customize Services screen there are no properties that must be specified. We
recommend that you use default values for your initial configuration. Click "Next" to
continue.
5. Ambari will display the Review screen.
Important
On the Review screen, make sure all HDP components are version 2.2.6 or
later.
Click "Deploy" to continue.
6. Ambari will display the Install, Start and Test screen. The status bar and messages will
indicate progress.
7. When finished, Ambari will present a summary of results. Click "Complete" to finish
installing Spark.
Caution
Ambari will create and edit several configuration files. Do not edit these files
directly if you configure and manage your cluster using Ambari.
4. Validating Spark
To validate the Spark installation, run the following Spark jobs:
Spark Pi example
WordCount example
To view job status in a browser, copy the URL tracking from the job output and go to
the associated URL.
3. Job output should list the estimated value of pi. In the following example, output was
directed to stdout:
9
10
4. Submit the job. At the scala prompt, type the following commands, replacing node
names, file name and file location with your own values:
val file = sc.textFile("/tmp/data")
val counts = file.flatMap(line => line.split(" ")).map(word =>
(word, 1)).reduceByKey(_ +_)
counts.saveAsTextFile("/tmp/wordcount")
c. Use the HDFS cat command to list WordCount output. For example:
hadoop fs -cat /tmp/wordcount/part*
12
The following example shows user spark running the Spark Pi example in a Kerberosenabled environment:
13
su spark
kinit -kt /etc/security/keytabs/spark.keytab spark/blue1@EXAMPLE.COM
cd /usr/hdp/current/spark-client/
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarncluster --num-executors 3 --driver-memory 512m --executor-memory 512m -executor-cores 1 lib/spark-examples*.jar 10
14
6. Best Practices
This section contains recommendations and best practices for using Spark with HDP 2.3.
Note
In yarn-client mode on a secure cluster you can use HiveContext to access the
Hive Metastore. HiveContext is not supported for yarn-cluster mode on a secure
cluster.
Examples
The following functions work with both HiveContext & SQLContext:
Avg()
Sum()
The following functions work only with HiveContext:
variance(col)
var_pop(col)
stddev_pop(col)
stddev_samp(col)
covar_samp(col1, col2)
For more information, see the Spark Programming Guide.
To avoid memory issues, Spark uses 90% of the JVM heap by default. This percentage is
controlled by spark.storage.safetyFraction.
Of this 90% of JVM allocation, Spark reserves memory for three purposes:
Storing in-memory shuffle, 20% by default (controlled by
spark.shuffle.memoryFraction)
Unroll - used to serialize/deserialize Spark objects to disk when they dont fit in memory,
20% is default (controlled by spark.storage.unrollFraction)
Storing RDDs: 60% by default (controlled by spark.storage.memoryFraction)
Example
If the JVM heap is 4GB, the total memory available for RDD storage is calculated as:
4GB x 0.9 X 0. 6 = 2.16 GB
Therefore, with the default configuration approximately one half of the Executor JVM
heap is used for storing RDDs.
For additional information about Spark memory use, see the Apache Spark Hardware
Provisioning recommendations.
The following command starts a YARN client in yarn-cluster mode. The client will start
the default Application Master. SparkPi will run as a child thread of the Application Master.
The client will periodically poll the Application Master for status updates, which will be
displayed in the console. The client will exist when the application stops running.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
lib/spark-examples*.jar 10
Considerations
When configuring Spark on YARN, consider the following information:
Executor processes will be not released if the job has not finished, even if they are no
longer in use. Therefore, please do not overallocate executors above your estimated
requirements.
Driver memory does not need to be large if the job does not aggregate much data (as
with a collect() action).
There are tradeoffs between num-executors and executor-memory. Large executor
memory does not imply better performance, due to JVM garbage collection. Sometimes
it is better to configur a larger number of small JVMs than a small number of large JVMs.
17
18
peopleSchemaRDD.registerTempTable("people")
val results = hiveContext.sql("SELECT * FROM people")
**# List query results
results.map(t => "Name: " + t.toString).collect().foreach(println)
19
If HADOOP_CONF_DIR is not set properly, you might see the following error:
Error from secure cluster
2015-09-04 00:27:06,046|t1.machine|INFO|1580|140672245782272|MainThread|
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.
PythonRDD.collectAndServe.
2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|: org.
apache.hadoop.security.AccessControlException: SIMPLE authentication is not
enabled. Available:[TOKEN, KERBEROS]
2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|
MainThread|at sun.reflect.NativeConstructorAccessorImpl.
newInstance(NativeConstructorAccessorImpl.java:57)
2015-09-04 00:27:06,048|t1.machine|INFO|1580|140672245782272|MainThread|at
{code}
20
9. Troubleshooting Spark
When you run a Spark job, you will see a standard set of console messages.
In addition, the following information is available:
A list of running applications, where you can retrieve the application ID and check the
application log:
yarn application list
yarn logs -applicationId <app_id>
For information about a specific job, check the Spark web UI:
http://<host>:8088/proxy/<job_id>/environment/
The following paragraphs describe specific issues and possible solutions.
Issue: Spark YARN jobs dont seem to start. YARN Resource Manager logs show an
application with "bad substitution errors in its logs.
Solution: Make sure that your $SPARK_HOME/config/spark-defaults.conf file
includes your HDP version. For example:
spark.driver.extraJavaOptions
-Dhdp.version=2.3.0.0-2557
spark.yarn.am.extraJavaOptions
-Dhdp.version=2.3.0.0-2557
Issue: Job stays in "accepted" state; it doesn't run. This can happen when a job requests
more memory or cores than available.
Solution: Assess workload to see if any resources can be released. You might need to stop
unresponsive jobs to make room for the job.
Issue: Insufficient HDFS access. This can lead to errors such as the following:
21
Solution: Make sure the user or group running the job has sufficient HDFS privileges to the
location.
22
When you install Spark 1.3.1 using Ambari or the manual installation process, Spark creates
and populates the hive-site.xml file -- you no longer need to create hive-site.xml.
23