Setting up Hadoop with 2 workers

7 min readNov 4, 2020

What is Hadoop?

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Architecture of a multi-node cluster

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only and compute-only worker nodes. These are normally used only in nonstandard applications.
Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and shutdown scripts require that Secure Shell (SSH) be set up between nodes in the cluster.

#1 Configure your network

On your Virtual Machine application go to settings and then network. Allow all in Promiscuous Mode.

#2 Install SSH

First let’s update our machine.

sudo apt-get update

And then, for the SSH instalation:

sudo apt install ssh

#3 Install PDSH

Use the following command:

sudo apt install pdsh

#4 Edit .bashrc

Just type:

nano .bashrc

And then add the following line to the end of the file:

export PDSH_RCMD_TYPE=ssh

It will look like this:

#5 Generating a SSH key

To generate a SSH key, you will have to use the following command. After that press ENTER everytime you have to.

ssh-keygen -t rsa -P ""

#6 Authorize the SSH key

For this step you will have to copy the public key that we just created to the authorized keys file.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

#7 Verify if SSH is working

To make this verification try to connect to the localhost.

ssh localhost

#8 Install Java 8

Use the following command to install Java:

sudo apt install openjdk-8-jdk

#9 Download Hadoop

Download Hadoop using the following command:

sudo wget -P ~ https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Then just unzip the file.

tar xzf hadoop-3.2.1.tar.gz

And then move the unziped file to hadoop, to make it easier to use:

mv hadoop-3.2.1 hadoop

#10 Edit hadoop-env.sh

Use this command:

nano ~/hadoop/etc/hadoop/hadoop-env.sh

And then paste this line to JAVA_HOME:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

Just like this:

#11 Move the folder

Move the hadoop folder from your desktop to /usr/local/hadoop.

sudo mv hadoop /usr/local/hadoop

#12 Edit the environment file

Do it with the following command:

sudo nano /etc/environment

After opening, add the following line:

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"

#13 Create user

Add a user called hadoopuser, with the following command:

sudo adduser hadoopuser

You don’t have to fill all the information, just the password.

Now let’s give premissions to the new user:

sudo usermod -aG hadoopuser hadoopuser
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
sudo adduser hadoopuser sudo

#14 IP addresses

In this step you will have to check your IP address.

ifconfig

Or, in case you don’t have installed the package:

ip addr

The IP address for my main machine (master) is 192.168.1.83.

So my network will look like this:

Master: 192.168.1.83

Slave 1: 192.168.1.84

Slave 2: 192.168.1.85

We will be creating the two slaves soon, but we can assume that those will be their IP addresses.

#15 Edit the hosts file

Now open the hosts file and add your network configuration, like this:

sudo nano /etc/hosts

#16 Create the two workers

First close your VM and open the VirtualBox. Right click on your VM and clone it twice.

Name them slave1 and slave2.

On “MAC Adress Policy” choose “Generate new MAC addresses for all network adapters”.

#17 Naming the machines

Let’s set the name for all three machines.

On the master (hadoop-master):

sudo nano /etc/hostname

On the slave1 (hadoop-slave1):

Finally on the slave2 (hadoop-slave2):

Now you will have to reboot all three.

sudo reboot

#18 SSH configuration

On the master machine, switch user to hadoopuser, like this:

su - hadoopuser

Now let´s create an SSH key (press ENTER when needed):

ssh-keygen -t rsa

After the key is created we have to copy it to all users:

ssh-copy-id hadoopuser@hadoop-master
ssh-copy-id hadoopuser@hadoop-slave1
ssh-copy-id hadoopuser@hadoop-slave2

The output should look like this:

#19 Edit the core-site.xml file

On hadoop-master, use the following command to open the file:

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

And paste this inside <configuartion>, like this:

<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>

#20 Edit the hdfs-site.xml file

Still on hadoop-master, let’s open and add information to the hdfs-site.xml file.

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add this to the file:

<property>
<name>dfs.namenode.name.dir</name><value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name><value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

#21 Edit workers file

Still on hadoop-master, open the workers file:

sudo nano /usr/local/hadoop/etc/hadoop/workers

And add the name of your workers, like this:

hadoop-slave1
hadoop-slave2

#22 Copy hadoop-master configurations

Now you will have to copy the master’s configuration to both slaves. Use the following commands:

scp /usr/local/hadoop/etc/hadoop/* hadoop-slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* hadoop-slave2:/usr/local/hadoop/etc/hadoop/

#23 Format the HDFS file system

Run both commands:

source /etc/environment
hdfs namenode -format

#24 Start HDFS

To start it just type:

cd /usr/local/hadoop/sbin
./start-dfs.sh

In case it doesn’t look like this, check the troubleshooting:

Now let’s check if the resources have started:

jps

The master will look like this:

The slaves will look like this:

In case you don’t have both resources started, check the troublesooting.

#25 Hadoop check

Almost over… check on your browser if it worked:

http://hadoop-master:9870

http://<your-master-ip-address>:9870

#26 Yarn configuration

Type the following commands:

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

Now go to both slaves and open yarn-site.xml.

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

And add this:

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>

It will look like this on both slaves:

And now go to the master to start yarn:

cd /usr/local/hadoop/sbin
./start-yarn.sh

And go to:

http://hadoop-master:8088/cluster

http://<your-master-ip-address>:8088/cluster

Both nodes are up and running!

Troubleshooting:

First make sure that both workers and the master are using the user hadoopuser.

su - hadoopuser

Then let’s generate new ssh keys:

ssh-keygen
cd
cd .ssh
cat id_rsa.pub >> authorized_keys
ssh localhost

Then go to the Hadoop folder:

cd /usr/local/hadoop/sbin
./stop-all.sh
./start-dfs.sh

If this also dosen’t work, do this:

./stop-all.sh
./start-all.sh

Now verify if the resources have started:

jps

Hope it worked!

Bibliography

How To Set Up a Hadoop 3.2.1 Multi-Node Cluster on Ubuntu 18.04 (2 Nodes)

To start: What is Hadoop?

medium.com

Apache Hadoop

Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers…

en.wikipedia.org

Hadoop: start-dfs.sh permission denied

Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share…

stackoverflow.com