Apache Hadoop is a collection of open-source software utilities that facilitates using clusters of computers to process a large volume of data sets. The core of Hadoop is consists of a storage part called the Hadoop Distributed File System(HDFS) and a processing part which is a MapReduce programming model.
Using it on a single node is the best way to start with it. So here we will show you the steps to install Hadoop on a Ubuntu system.
Prerequisites
To follow this article you should have the following –
- A system with Ubuntu installed on it
- Access to a user account with sudo privileges
Install Java in Ubuntu
The Hadoop framework is written in Java so it requires Java to be installed on your system.
You can use the following command to install it on your system –
sudo apt install default-jdk -y
You can verify the installation of Java by using the following command –
java -version
You can check a complete article on how to install Java in a Ubuntu system.
Create a Hadoop user
We will create a separate user for the Hadoop environment this can improve security and efficiency in managing the cluster.
So use the following command to create a new user ‘hadoop’.
sudo adduser hadoop
Provide the information that it ask and press the enter.
Install OpenSSH on Ubuntu
If SSH is not installed on your system then you can install it by using the following command –
sudo apt install openssh-server openssh-client -y
Enable passwordless SSH for Hadoop user
You need to configure passwordless SSH for the Hadoop user to manage nodes in a cluster or local system.
First, change the user to hadoop by using the given command –
su - hadoop
Now generate SSH key pairs –
ssh-keygen -t rsa
It will ask you to enter the filename and passphrase, just press enter to complete the process.
Now append the generated public keys from id_rsa.pub to authorized_keys –
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now set the proper permissions to authorized_keys –
chmod 640 ~/.ssh/authorized_keys
Verify the SSH authentication using the following command –
ssh localhost
Download and install Hadoop
Go to the official download page of Hadoop and select download the latest binary by clicking on the given link as you can see in the given image –
Alternatively use the wget command to download it from your terminal –
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Once downloaded extract it using the given command –
sudo tar -xvzf hadoop-3.3.1.tar.gz
Rename the extracted directory to hadoop –
sudo mv hadoop-3.3.1 hadoop
Configure Hadoop environment variables
We need to edit the given files in order to configure the Hadoop environment.
- bashrc
- hadoop-env.sh
- core-site.xml
- hdfs-site.xml
- mapred-site-xml
- yarn-site.xml
So let’s start configuring one by one –
Edit bashrc file
First, open the bashrc file using a text editor
sudo nano .bashrc
Add the given lines to the end of this file –
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Save this file and exit from the editor.
Activate the environment variable by executing the following command –
source ~/.bashrc
Edit Hadoop environment variable file
Next, open the Hadoop environment variable file i.e. hadoop-env.sh
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
and set JAVA_HOME variable as given below –
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Save and close the file.
Edit core-site.xml file
First, create the namenode
and datanode
directories inside the Hadoop home directory by using the given command –
mkdir -p ~/hadoopdata/hdfs/namenode
mkdir -p ~/hadoopdata/hdfs/datanode
Open and edit the core-site.xml file –
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Here change the value as per your hostname –
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
Save and close this file.
Edit hdfs-site.xml file
Open the hdfs-site.xml file –
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
And change the namenode
and datanode
directory paths –
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Save this file and exit from the editor.
Edit mapred-site.xml
Next, open and edit the mapred-site.xml file –
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Make the changes as given below –
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Save and close this file also.
Edit yarn-site.xml
Now edit yarn-site.xml file –
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
And make the given changes –
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save the file and close the editor.
Start the Hadoop cluster
Before you start the Hadoop cluster it is important to format the namenode
.
Execute the following command to format the namenode
–
hdfs namenode -format
Once it gets format successfully use the following command to start the Hadoop cluster.
start-dfs.sh
Next, start the YARN service by using the given command –
start-yarn.sh
After starting the above services you can check if these are running or not by using –
jps
Access Hadoop from your browser
Open a browser on your system and enter the given URL to access the Hadoop web UI in your browser.
http://localhost:9870
This will provide a comprehensive overview of the entire cluster.
The default port for datanode
is 9864 so to access it use –
http://localhost:9864
The yarn resource manager is available on port number 8088 so to access it use-
http://localhost:8088
Here you can monitor all the processes running in your Hadoop cluster.
Conclusion
You have successfully set up Hadoop on your system. Now if you have a query then write us in the comments below.