How to install Apache Spark in Ubuntu?

Apache Spark is an open-source, multi-lingual, fast unified analytics engine for big data and machine learning. Originally it was developed at UC Berkeley’s AMPLab and later its codebase is donated to Apache Software Foundation.

It distributes workload across multiple computers in a cluster to effectively process a large set of data. Apache Spark supports various programming languages such as Java, Scala, Python, and R.

Today in this article we will discuss how to install Apache Spark on a Ubuntu system.

Prerequisites

To follow this article you need the following things –

A computer system with Ubuntu installed on it
Access to a user account with sudo privileges

Installing required packages for Apache Spark

Apache Spark requires a few packages to be installed on your system before you install it. The required packages are Java, Scala, and Git, to install these packages use the following command in your terminal –

sudo apt install default-jdk scala git -y

Once completed you can verify the installation by using the given command-

java -version; scala -version; git --version

Download and install Apache Spark on Ubuntu

Go to the official download page of Apache Spark and choose the latest version and download it. At the time of writing this article, Spark 3.2.0 with Apache Hadoop 3.3 is the latest version so we will install it.

Alternatively, you can use the following command to download the latest Spark package from your terminal.

wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

Next extract the downloaded package –

tar -xvzf spark-3.2.0-bin-hadoop3.2.tgz

Finally, move the extracted Spark directory to /opt by using –

sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark

Configure Spark environment variables

Before you start the Apache Spark on your system you need to set up a few environment variables in the .profile file.

echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

Next, you need to source the .profile file in order to make the changes effective.