Apache Spark is an open-source, multi-lingual, fast unified analytics engine for big data and machine learning. Originally it was developed at UC Berkeley’s AMPLab and later its codebase is donated to Apache Software Foundation.
It distributes workload across multiple computers in a cluster to effectively process a large set of data. Apache Spark supports various programming languages such as Java, Scala, Python, and R.
Today in this article we will discuss how to install Apache Spark on a Ubuntu system.
Prerequisites
To follow this article you need the following things –
- A computer system with Ubuntu installed on it
- Access to a user account with sudo privileges
Installing required packages for Apache Spark
Apache Spark requires a few packages to be installed on your system before you install it. The required packages are Java, Scala, and Git, to install these packages use the following command in your terminal –
sudo apt install default-jdk scala git -y
Once completed you can verify the installation by using the given command-
java -version; scala -version; git --version
Download and install Apache Spark on Ubuntu
Go to the official download page of Apache Spark and choose the latest version and download it. At the time of writing this article, Spark 3.2.0 with Apache Hadoop 3.3 is the latest version so we will install it.
Alternatively, you can use the following command to download the latest Spark package from your terminal.
wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Next extract the downloaded package –
tar -xvzf spark-3.2.0-bin-hadoop3.2.tgz
Finally, move the extracted Spark directory to /opt
by using –
sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark
Configure Spark environment variables
Before you start the Apache Spark on your system you need to set up a few environment variables in the .profile
file.
echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile
Next, you need to source the .profile file in order to make the changes effective.
source ~/.profile
Start Apache Spark on Ubuntu
You have set up everything now it’s time to start Spark master and slave servers. Use the following command to start the master server –
start-master.sh
This will start the Spark master server now you can check its web interface by entering the given URL in your browser.
http://server_domain_or_ip:8080/
For example –
http://127.0.0.1:8080/
Now, this should display the given page in your browser.
If you want to start a slave server(worker process) with your master server then run the following command in the given format –
start-slave.sh spark://master:port
For example –
start-slave.sh spark://acer-pc:7077
Now when you reload your master’s web interface in your browser you will see one worker is added.
Test Spark Shell
Once the configuration is finished you can load the apache spark-shell by using –
spark-shell
Here scala is the default interface if you want to use Python in spark then execute the given command in your terminal.
pyspark
Now if you want to stop Spark master or slave servers then use one of the given commands –
To stop master server use –
stop-master.sh
To stop slave server (or worker process) use –
stop-slave.sh
Conclusion
This is how you can install and use Apache Spark in Ubuntu. Now if you have a query then write us in the comments below.