This blog series includes 4 workshop sessions I conducted at Engineering colleges and in my office.
Hadoop with docker is mainly used for easy prototype and learning purpose, also it helps to set up quickly set up a Hadoop cluster on your laptop, or on multiple machines.
Bigdata Session 1 - Apache Hadoop with docker ← ( You are here now )
Bigdata Session 2 - Apache Spark on Apache Hadoop
Bigdata Session 3 - Apache Drill
Bigdata Session 4 - Apache Spark on Kubernetes
1. Hadoop Cluster
The main motive of the session is how we can easily test the Hadoop cluster and related tools locally or on a small cluster. We will set up the Hadoop in cluster mode. Hadoop cluster mode means, the individual components are runs separately on a single machine on separate JVMs or running on multiple machines.
Here we will use the docker based setup to quickly play with the main features of hadoop.
Main Hadoop services are:-
namenode ( storage )
datanode ( storage )
resource manager ( computation )
node manager ( computation )
One Hadoop cluster can be formed by one namenode and multiple datanodes. The resource manager runs on the same machine as the namenode ( simpler setup). The responsibility of the namenode is to store metadata about the distributed filesystem (hdfs). Datanode actually stores the data in blob form, and nodemanager will be running on each datanode to handle actual computation requests from the resource manager.
As in all cluster environments, network address resolution for each node is a key requirement for a stable setup. Ideally, a local DNS setup that permanently allocates hostnames to all the nodes in the network. Or we can manually set the hostnames without DNS.
|This material is verified on Mac and Ubuntu.|
2.1. JDK 1.8
|Not required if you try Docker-based setup.|
Ensure you have jdk 1.8+ available on your machine, oracle jdk is recommended.
2.2. Install Docker
Ensure you have the latest version of docker is set up on your laptop.
For Ubuntu / Debian machines
Follow this link and install the docker with a correct given bellow,
For other distros, please help yourself ;)
2.3. Get hadoop docker image
2.4. From docker hub
docker pull haridasn/hadoop-2.8.5:latest
2.5. Build the docker image locally (Optional)
git clone https://github.com/haridas/hadoop-env cd hadoop-env/docker docker build -t hadoop-2.8.5:latest
2.6. Set correct hostnames for multi-node cluster setup (Optional)
|If you are setting the cluster on the same machine with docker, you can skip this step.|
Each node in the host machine can reach each other using the hostname.
Hadoop cluster setup using multiple physical machines.
set hostnames correctly.
All the nodes place the same set of values.
cat /etc/hosts master <ip-address> node1 node2 node3 .. ..
Set the hostname of the machine to match this address.
/etc/hosts and replace any occurrence of old hostname with a new one.
Update and check hostname changed correctly
sudo hostname <hostname> hostname
Cross-check all the machines have the correct set of hostnames before going to next step.
3. Setup cluster on a single machine using docker
We are using the docker container mainly for process isolation, for a simpler setup on a single machine we make use of the same network stack as the host machine.
3.1. Create a docker network
For clean hostname resolution under docker environment, we have to create a docker network; which will internally provide a DNS resolution on the virtual network where all the containers reside.
docker network create hadoop-nw
We will use this network to launch all our containers, which will internally allocate all the containers into this network. So we will get the hostname resolution by default. For the non-docker deployment, we have to set up all these externally.
3.2. Start namenode and resource Manager
docker run -it -d --name namenode --network hadoop-nw haridasn/hadoop-2.8.5:latest namenode # check container is running docker ps -a # Check container logs docker logs -f namenode
To get the
namenode ip, attach to the namenode docker container,
We need this for starting the datanodes.
docker exec -it namenode bash ifconfig
3.3. Start datanode and resource manager
docker run -it -d --name datanode1 \ --network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip> docker ps -a docker logs -f datanode1 . # If you want to launch more datanodes. docker run -it -d --name datanode2 \ --network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip>
3.4. Get the client tools setup on another docker
hdfs client commands used to submit jobs and see the hdfs
files respectively are loaded in another docker. Let’s use that as our workbench
to play with our Hadoop cluster.
# Start the docker container to test our cluster. docker run -it --rm --name hadoop-cli --network hadoop-nw haridasn/hadoop-cli:latest # Get the configuration from running nodes. docker cp namenode:/opt/hadoop/etc etc docker cp etc hadoop-cli:/opt/hadoop/
3.5. Check hdfs
./bin/hdfs dfs -ls / # copy files into hdfs ./bin/hdfs dfs -put /var/log/supervisor /logs ./bin/hdfs dfs -put /etc/passwd /passwd # Copy files inside hdfs ./bin/hdfs dfs -cp /passwd /passwdr
3.6. Check Resource manager works fine
./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar pi 1 1 ./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /logs/ /out/
4. Other Bigdata tools on hadoop environment
A simpler command-oriented interface to do the map-reduce jobs over Hadoop cluster. You can think of this as a bash scripting over hdfs and yarn map-reduce to quickly analyze data on hdfs.
Download and extract it
Setup pig and configure it with hadoop cluster.
export PIG_HOME=<path-to-pig-home> export PATH=$PATH:$PIG_HOME/bin export PIG_CLASSPATH=<path-to-hadoop-conf-dir> pig
Load some data into hdfs
./bin/hdfs dfs -mkdir /pig ./bin/hdfs dfs -put pig/tutorial/data /pig/data
Pig commandline tool
$ pig raw = LOAD '/pig/data/excite-small.log' USING PigStorage('\t') AS (user, time,query); user = filter raw by $2=='powwow.com'; dump user
SQL interface over the Hadoop system.