This blog series includes 4 worshop session I conducted at Engieneering colleges and in my office.
Hadoop with docker is mainly usable for easy prototype and learning purpose, also it helps to setup quickly setup hadoop cluster on your laptop, or on multiple machines.
Bigdata Session 1 - Apache Hadoop with docker ← ( You are here now )
Bigdata Session 2 - Apache Spark on Apache Hadoop
Bigdata Session 3 - Apache Drill
Bigdata Session 4 - Apache Spark on Kubernetes
1. Hadoop Cluster
Main motive of the session is how we can easily test the hadoop cluster and related tools locally or on a small cluster. We will setup the hadoop in cluster mode. Hadoop cluster mode means, the individual components are runs separately on single machine on separate JVMs or running on multiple machines.
Here we will use the docker based setup to quickly play with main features of hadoop.
Main Hadoop services are:-
namenode ( storage )
datanode ( storage )
resource manager ( computation )
node manager ( computation )
One hadoop cluster can formed by one namenode and multiple datanodes. The resource manager runs on same machine as namenode ( simpler setup). Responsibility of namenode is store metadata about distributed filesystem (hdfs). Datanode actually stores the data in blob form, and nodemanager will be running on each datanode to handle actual computation requests from resource manager.
As all cluster environments, network address resolution for each node is a key requirement for stable setup. Ideally a local DNS setup which permanently allocate a hostnames to all the nodes in the network. Or we can manually set the hostnames without DNS.
|This material is verified on Mac and Ubuntu.|
2.1. JDK 1.8
|Not required if you try Docker based setup.|
Ensure you have jdk 1.8+ available on your machine, oracle jdk is recommended.
2.2. Install Docker
Ensue you have latest version of docker is setup on your laptop.
For Ubuntu / Debian machines
Follow this link and install the docker with correct given bellow,
For other distros, please help yourself ;)
2.3. Get hadoop docker image
2.4. From docker hub
docker pull haridasn/hadoop-2.8.5:latest
2.5. Build the docker image locally (Optional)
git clone https://github.com/haridas/hadoop-env cd hadoop-env/docker docker build -t hadoop-2.8.5:latest
2.6. Set correct hostnames for multi-node cluster setup (Optional)
|If you are setting the cluster on same machine with docker, you can skip this step.|
Each node in the host machine can reach each other using the hostname.
Hadoop cluster setup using multiple physical machines.
set hostnames correctly.
All the nodes place the same set of values.
cat /etc/hosts master <ip-address> node1 node2 node3 .. ..
Set the hostname of the machine to match this address.
/etc/hosts and replace any occurrence of old hostname with new one.
Update and check hostname changed correctly
sudo hostname <hostname> hostname
Cross check all the machines have correct set of hostnames before going to next step.
3. Setup cluster on single machine using docker
We are using the docker container mainly for process isolation, for a simpler setup on single machine we make use of the same network stack as the host machine.
3.1. Create a docker network
For clean hostname resoluation under docker environment, we have to create a docker network; which will internally provide a DNS resoluation on the virtual network where all the containers reside.
docker network create hadoop-nw
We will use this network to launch all our container, which will internally allocate all the containers into this network. So we will get the hostname resoluation by default. For the non-docker deployment we have to setup all these externally.
3.2. Start namenode and resource Manager
docker run -it -d --name namenode --network hadoop-nw haridasn/hadoop-2.8.5:latest namenode # check container is running docker ps -a # Check container logs docker logs -f namenode
To get the
namenode ip, attach to the namenode docker container,
We need this for starting the datanodes.
docker exec -it namenode bash ifconfig
3.3. Start datanode and resource manager
docker run -it -d --name datanode1 \ --network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip> docker ps -a docker logs -f datanode1 . # If you want launch more datanodes. docker run -it -d --name datanode2 \ --network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip>
3.4. Get the client tools setup on another docker
hdfs clinet commands used to submit jobs and see the hdfs
files respectively are loaded in another docker. Lets use that as our workbench
to play with our hadoop cluster.
# Start the docker container to test our cluster. docker run -it --rm --name hadoop-cli --network hadoop-nw haridasn/hadoop-cli:latest # Get the configuration from running nodes. docker cp namenode:/opt/hadoop/etc etc docker cp etc hadoop-cli:/opt/hadoop/
3.5. Check hdfs
./bin/hdfs dfs -ls / # copy files into hdfs ./bin/hdfs dfs -put /var/log/supervisor /logs ./bin/hdfs dfs -put /etc/passwd /passwd # Copy files inside hdfs ./bin/hdfs dfs -cp /passwd /passwdr
3.6. Check Resource manager works fine
./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar pi 1 1 ./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /logs/ /out/
4. Other Bigdata tools on hadoop environment
A simpler command oriented interface to do the map-reduce jobs over hadoop cluster. You can think this as a bash scripting over hdfs and yarn map-reduce to quickly analyse data on hdfs.
Download and extract it
Setup pig and configure it with hadoop cluster.
export PIG_HOME=<path-to-pig-home> export PATH=$PATH:$PIG_HOME/bin export PIG_CLASSPATH=<path-to-hadoop-conf-dir> pig
Load some data into hdfs
./bin/hdfs dfs -mkdir /pig ./bin/hdfs dfs -put pig/tutorial/data /pig/data
Pig commandline tool
$ pig raw = LOAD '/pig/data/excite-small.log' USING PigStorage('\t') AS (user, time,query); user = filter raw by $2=='powwow.com'; dump user
SQL interface over hadoop system.