Professional Documents
Culture Documents
Recommended Platform
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 1/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
Spark Architecture
Apache Spark follows a master/slave architecture with two main
daemons and a cluster manager –
• Cluster Manager
Spark Architecture
Prerequisites
Create a user of same name in master and all slaves to make your tasks
easier during ssh and also switch to that user in master.
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 2/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
<MASTER-IP> master
<SLAVE01-IP> slave01
<SLAVE02-IP> slave02
$ java -version
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 3/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
$ scala -version
$ ssh slave01
$ ssh slave02
Install Spark
Download latest version of Spark
Use the following command to download latest version of apache
spark.
$ wget http://www-us.apache.org/dist/spark/spark-
2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 4/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
Add the following line to ~/.bashrc file. It means adding the location,
where the spark software file are located to the PATH variable.
$ source ~/.bashrc
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 5/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
Edit spark-env.sh
Move to spark conf folder and create a copy of template of spark-env.sh
and rename it.
$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh
export SPARK_MASTER_HOST='<MASTER-IP>'
export JAVA_HOME=<Path_of_JAVA_installation>
Add Workers
Edit the configuration file slaves in (/usr/local/spark/conf).
master
slave01
slave02
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 6/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
$ cd /usr/local/spark
$ ./sbin/start-all.sh
$ cd /usr/local/spark
$ ./sbin/stop-all.sh
$ jps
Spark Web UI
Browse the Spark UI to know about worker nodes, running application,
cluster resources.
Spark Master UI
http://<MASTER-IP>:8080/
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 7/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
Spark Application UI
http://<MASTER_IP>:4040/
. . .
Y ou can use the following link to know how to use PySpark (Spark
Python API) on Jupyter Notebook.
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 8/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 9/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 10/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium
https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 11/11