How to install Hadoop in Ubuntu?

Apache Hadoop is a collection of open-source software utilities that facilitates using clusters of computers to process a large volume of data sets. The core of Hadoop is consists of a storage part called the Hadoop Distributed File System(HDFS) and a processing part which is a MapReduce programming model.

Using it on a single node is the best way to start with it. So here we will show you the steps to install Hadoop on a Ubuntu system.

Prerequisites

To follow this article you should have the following –

A system with Ubuntu installed on it
Access to a user account with sudo privileges

Install Java in Ubuntu

The Hadoop framework is written in Java so it requires Java to be installed on your system.

You can use the following command to install it on your system –

sudo apt install default-jdk -y

You can verify the installation of Java by using the following command –

java -version

You can check a complete article on how to install Java in a Ubuntu system.

Create a Hadoop user

We will create a separate user for the Hadoop environment this can improve security and efficiency in managing the cluster.

So use the following command to create a new user ‘hadoop’.

sudo adduser hadoop

Provide the information that it ask and press the enter.

Install OpenSSH on Ubuntu

If SSH is not installed on your system then you can install it by using the following command –

sudo apt install openssh-server openssh-client -y

Enable passwordless SSH for Hadoop user

You need to configure passwordless SSH for the Hadoop user to manage nodes in a cluster or local system.

First, change the user to hadoop by using the given command –

su - hadoop

Now generate SSH key pairs –

ssh-keygen -t rsa

It will ask you to enter the filename and passphrase, just press enter to complete the process.

Now append the generated public keys from id_rsa.pub to authorized_keys –

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now set the proper permissions to authorized_keys –

chmod 640 ~/.ssh/authorized_keys

Verify the SSH authentication using the following command –

ssh localhost

Download and install Hadoop

Go to the official download page of Hadoop and select download the latest binary by clicking on the given link as you can see in the given image –

Alternatively use the wget command to download it from your terminal –

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Once downloaded extract it using the given command –

sudo tar -xvzf hadoop-3.3.1.tar.gz

Rename the extracted directory to hadoop –

sudo mv hadoop-3.3.1 hadoop

Configure Hadoop environment variables

We need to edit the given files in order to configure the Hadoop environment.

bashrc
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site-xml
yarn-site.xml

So let’s start configuring one by one –

Edit bashrc file

First, open the bashrc file using a text editor

sudo nano .bashrc

Add the given lines to the end of this file –

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Save this file and exit from the editor.

Activate the environment variable by executing the following command –

source ~/.bashrc

Edit Hadoop environment variable file

Next, open the Hadoop environment variable file i.e. hadoop-env.sh

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

and set JAVA_HOME variable as given below –

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Save and close the file.

Edit core-site.xml file

First, create the namenode and datanode directories inside the Hadoop home directory by using the given command –

mkdir -p ~/hadoopdata/hdfs/namenode

mkdir -p ~/hadoopdata/hdfs/datanode

Open and edit the core-site.xml file –

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Here change the value as per your hostname –

<configuration>
           <property>
                 <name>fs.defaultFS</name>
                 <value>hdfs://127.0.0.1:9000</value>
           </property>
</configuration>

Save and close this file.

Edit hdfs-site.xml file

Open the hdfs-site.xml file –

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

And change the namenode and datanode directory paths –

<configuration>
           <property>
               <name>dfs.replication</name>
               <value>1</value>
           </property>

           <property>
               <name>dfs.name.dir</name>
               <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
           </property>

           <property>
               <name>dfs.data.dir</name>
               <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
           </property>
</configuration>

Save this file and exit from the editor.

Edit mapred-site.xml

Next, open and edit the mapred-site.xml file –

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the changes as given below –

<configuration>
             <property>
                  <name>mapreduce.framework.name</name>
                  <value>yarn</value>
             </property>
</configuration>

Save and close this file also.

Edit yarn-site.xml

Now edit yarn-site.xml file –

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

And make the given changes –

<configuration>
           <property>
                 <name>yarn.nodemanager.aux-services</name>
                 <value>mapreduce_shuffle</value>
           </property>
</configuration>

Save the file and close the editor.

Start the Hadoop cluster

Before you start the Hadoop cluster it is important to format the namenode.

Execute the following command to format the namenode –

hdfs namenode -format

Once it gets format successfully use the following command to start the Hadoop cluster.

start-dfs.sh

Next, start the YARN service by using the given command –

start-yarn.sh

After starting the above services you can check if these are running or not by using –

jps

Access Hadoop from your browser

Open a browser on your system and enter the given URL to access the Hadoop web UI in your browser.

http://localhost:9870

This will provide a comprehensive overview of the entire cluster.

The default port for datanode is 9864 so to access it use –

http://localhost:9864

The yarn resource manager is available on port number 8088 so to access it use-

http://localhost:8088

Here you can monitor all the processes running in your Hadoop cluster.

Conclusion

You have successfully set up Hadoop on your system. Now if you have a query then write us in the comments below.