Bigdata Session 1 - Hadoop with docker

This blog series includes 4 workshop sessions I conducted at Engineering colleges and in my office.

Hadoop with docker is mainly used for easy prototype and learning purpose, also it helps to set up quickly set up a Hadoop cluster on your laptop, or on multiple machines.

  1. Bigdata Session 1 - Apache Hadoop with docker ← ( You are here now )

  2. Bigdata Session 2 - Apache Spark on Apache Hadoop

  3. Bigdata Session 3 - Apache Drill

  4. Bigdata Session 4 - Apache Spark on Kubernetes

1. Hadoop Cluster

The main motive of the session is how we can easily test the Hadoop cluster and related tools locally or on a small cluster. We will set up the Hadoop in cluster mode. Hadoop cluster mode means, the individual components are runs separately on a single machine on separate JVMs or running on multiple machines.

Here we will use the docker based setup to quickly play with the main features of hadoop.

Main Hadoop services are:-

  1. namenode ( storage )

  2. datanode ( storage )

  3. resource manager ( computation )

  4. node manager ( computation )

One Hadoop cluster can be formed by one namenode and multiple datanodes. The resource manager runs on the same machine as the namenode ( simpler setup). The responsibility of the namenode is to store metadata about the distributed filesystem (hdfs). Datanode actually stores the data in blob form, and nodemanager will be running on each datanode to handle actual computation requests from the resource manager.

2. Prerequisites

As in all cluster environments, network address resolution for each node is a key requirement for a stable setup. Ideally, a local DNS setup that permanently allocates hostnames to all the nodes in the network. Or we can manually set the hostnames without DNS.

This material is verified on Mac and Ubuntu.

2.1. JDK 1.8

Not required if you try Docker-based setup.

Ensure you have jdk 1.8+ available on your machine, oracle jdk is recommended.

2.2. Install Docker

Ensure you have the latest version of docker is set up on your laptop.

Docker Version: 18.x.x

For Ubuntu / Debian machines

Follow this link and install the docker with a correct given bellow,

For Mac

For other distros, please help yourself ;)

2.3. Get hadoop docker image

2.4. From docker hub

docker pull haridasn/hadoop-2.8.5:latest

2.5. Build the docker image locally (Optional)

git clone
cd hadoop-env/docker
docker build -t hadoop-2.8.5:latest

2.6. Set correct hostnames for multi-node cluster setup (Optional)

If you are setting the cluster on the same machine with docker, you can skip this step.

Each node in the host machine can reach each other using the hostname.

Hadoop cluster setup using multiple physical machines.

set hostnames correctly.

All the nodes place the same set of values.

cat /etc/hosts
master  <ip-address>
Set the hostname of the machine to match this address.

edit /etc/hostname edit /etc/hosts and replace any occurrence of old hostname with a new one.

Update and check hostname changed correctly
sudo hostname <hostname>

Cross-check all the machines have the correct set of hostnames before going to next step.

3. Setup cluster on a single machine using docker

We are using the docker container mainly for process isolation, for a simpler setup on a single machine we make use of the same network stack as the host machine.

3.1. Create a docker network

For clean hostname resolution under docker environment, we have to create a docker network; which will internally provide a DNS resolution on the virtual network where all the containers reside.

docker network create hadoop-nw

We will use this network to launch all our containers, which will internally allocate all the containers into this network. So we will get the hostname resolution by default. For the non-docker deployment, we have to set up all these externally.

3.2. Start namenode and resource Manager

docker run -it -d --name namenode --network hadoop-nw haridasn/hadoop-2.8.5:latest namenode

# check container is running
docker ps -a

# Check container logs
docker logs -f namenode

To get the namenode ip, attach to the namenode docker container, We need this for starting the datanodes.

docker exec -it namenode bash

3.3. Start datanode and resource manager

docker run -it -d --name datanode1 \
    --network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip>

docker ps -a

docker logs -f datanode1

# If you want to launch more datanodes.

docker run -it -d --name datanode2 \
    --network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip>

3.4. Get the client tools setup on another docker

The yarn, hdfs client commands used to submit jobs and see the hdfs files respectively are loaded in another docker. Let’s use that as our workbench to play with our Hadoop cluster.

# Start the docker container to test our cluster.
docker run -it --rm --name hadoop-cli --network hadoop-nw haridasn/hadoop-cli:latest

# Get the configuration from running nodes.
docker cp namenode:/opt/hadoop/etc etc
docker cp etc hadoop-cli:/opt/hadoop/

3.5. Check hdfs

./bin/hdfs dfs -ls /

# copy files into hdfs
./bin/hdfs dfs -put /var/log/supervisor /logs
./bin/hdfs dfs -put /etc/passwd /passwd

# Copy files inside hdfs
./bin/hdfs dfs -cp /passwd /passwdr

3.6. Check Resource manager works fine

./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar
pi 1 1

./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar
wordcount /logs/ /out/

4. Other Bigdata tools on hadoop environment

4.1. Pig

A simpler command-oriented interface to do the map-reduce jobs over Hadoop cluster. You can think of this as a bash scripting over hdfs and yarn map-reduce to quickly analyze data on hdfs.

Download and extract it
Setup pig and configure it with hadoop cluster.
export PIG_HOME=<path-to-pig-home>
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=<path-to-hadoop-conf-dir>

Load some data into hdfs
./bin/hdfs dfs -mkdir /pig
./bin/hdfs dfs -put pig/tutorial/data /pig/data
Pig commandline tool
$ pig

raw = LOAD '/pig/data/excite-small.log' USING PigStorage('\t') AS (user, time,query);

user = filter raw by $2=='';

dump user
Go Top