Bigdata Session 1 - Hadoop with docker
Tweetin data-science · Sun 02 June 2019
in data-science · Sun 02 June 2019
This blog series includes 4 workshop sessions I conducted at Engineering colleges and in my office.
Hadoop with docker is mainly used for easy prototype and learning purpose, also it helps to set up quickly set up a Hadoop cluster on your laptop, or on multiple machines.
Bigdata Session 1 - Apache Hadoop with docker ← ( You are here now )
Bigdata Session 2 - Apache Spark on Apache Hadoop
Bigdata Session 3 - Apache Drill
Bigdata Session 4 - Apache Spark on Kubernetes
The main motive of the session is how we can easily test the Hadoop cluster and related tools locally or on a small cluster. We will set up the Hadoop in cluster mode. Hadoop cluster mode means, the individual components are runs separately on a single machine on separate JVMs or running on multiple machines.
Here we will use the docker based setup to quickly play with the main features of hadoop.
Main Hadoop services are:-
namenode ( storage )
datanode ( storage )
resource manager ( computation )
node manager ( computation )
One Hadoop cluster can be formed by one namenode and multiple datanodes. The resource manager runs on the same machine as the namenode ( simpler setup). The responsibility of the namenode is to store metadata about the distributed filesystem (hdfs). Datanode actually stores the data in blob form, and nodemanager will be running on each datanode to handle actual computation requests from the resource manager.
As in all cluster environments, network address resolution for each node is a key requirement for a stable setup. Ideally, a local DNS setup that permanently allocates hostnames to all the nodes in the network. Or we can manually set the hostnames without DNS.
Note
|
This material is verified on Mac and Ubuntu. |
Note
|
Not required if you try Docker-based setup. |
Ensure you have jdk 1.8+ available on your machine, oracle jdk is recommended.
Ensure you have the latest version of docker is set up on your laptop.
Docker Version: 18.x.x
Follow this link and install the docker with a correct given bellow,
docker pull haridasn/hadoop-2.8.5:latest
git clone https://github.com/haridas/hadoop-env
cd hadoop-env/docker
docker build -t hadoop-2.8.5:latest
Note
|
If you are setting the cluster on the same machine with docker, you can skip this step. |
Each node in the host machine can reach each other using the hostname.
Hadoop cluster setup using multiple physical machines.
All the nodes place the same set of values.
cat /etc/hosts
master <ip-address>
node1
node2
node3
..
..
edit /etc/hostname
edit /etc/hosts
and replace any occurrence of old hostname with a new one.
sudo hostname <hostname>
hostname
Cross-check all the machines have the correct set of hostnames before going to next step.
We are using the docker container mainly for process isolation, for a simpler setup on a single machine we make use of the same network stack as the host machine.
For clean hostname resolution under docker environment, we have to create a docker network; which will internally provide a DNS resolution on the virtual network where all the containers reside.
docker network create hadoop-nw
We will use this network to launch all our containers, which will internally allocate all the containers into this network. So we will get the hostname resolution by default. For the non-docker deployment, we have to set up all these externally.
docker run -it -d --name namenode --network hadoop-nw haridasn/hadoop-2.8.5:latest namenode
# check container is running
docker ps -a
# Check container logs
docker logs -f namenode
To get the namenode
ip, attach to the namenode docker container,
We need this for starting the datanodes.
docker exec -it namenode bash
ifconfig
docker run -it -d --name datanode1 \
--network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip>
docker ps -a
docker logs -f datanode1
.
# If you want to launch more datanodes.
docker run -it -d --name datanode2 \
--network hadoop-nw haridasn/hadoop-2.8.5:latest datanode <name-node-ip>
The yarn
, hdfs
client commands used to submit jobs and see the hdfs
files respectively are loaded in another docker. Let’s use that as our workbench
to play with our Hadoop cluster.
# Start the docker container to test our cluster.
docker run -it --rm --name hadoop-cli --network hadoop-nw haridasn/hadoop-cli:latest
# Get the configuration from running nodes.
docker cp namenode:/opt/hadoop/etc etc
docker cp etc hadoop-cli:/opt/hadoop/
./bin/hdfs dfs -ls /
# copy files into hdfs
./bin/hdfs dfs -put /var/log/supervisor /logs
./bin/hdfs dfs -put /etc/passwd /passwd
# Copy files inside hdfs
./bin/hdfs dfs -cp /passwd /passwdr
./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar
pi 1 1
./bin/yarn jar `pwd`/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar
wordcount /logs/ /out/
A simpler command-oriented interface to do the map-reduce jobs over Hadoop cluster. You can think of this as a bash scripting over hdfs and yarn map-reduce to quickly analyze data on hdfs.
wget http://mirrors.estointernet.in/apache/pig/pig-0.17.0/pig-0.17.0.tar.gz
export PIG_HOME=<path-to-pig-home>
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=<path-to-hadoop-conf-dir>
pig
./bin/hdfs dfs -mkdir /pig
./bin/hdfs dfs -put pig/tutorial/data /pig/data
$ pig
raw = LOAD '/pig/data/excite-small.log' USING PigStorage('\t') AS (user, time,query);
user = filter raw by $2=='powwow.com';
dump user
SQL interface over the Hadoop system.