Bigdata Session 4 - Apache Spark on Kubernetes
Tweetin data-science · Sun 02 June 2019
in data-science · Sun 02 June 2019
This is a series of blog posts about the 4 sessions I conducted at different engineering colleges.
Bigdata Session 1 - Apache Hadoop
Bigdata Session 2 - Apache Spark on Apache Hadoop
Bigdata Session 3 - Apache Drill
Bigdata Session 4 - Apache Spark on Kubernetes ← ( You are here now )
Kubernetes is a Linux container manager, the ideas are similar to how yarn manages the jvm containers in the Hadoop environment. Kubernetes can be used to deploy very heterogeneous workloads, and it can meet the requirements of an entire business. eg; Deploy applications, dev/stage environments, offline or batch processing services, etc. As the Hadoop environment is specific to bigdata processing, here we can use the existing kubernetes cluster to do the things that we did on the hadoop cluster or spark cluster.
This tutorial covers how we can quickly set up a kubernetes cluster and deploy a spark cluster on it so that then we can play on spark. The kubernetes act as one of the spark’s cluster manager, there is no change in other aspects of how spark does its functionalities.
Kubernetes can be deployed in a multi-node or single node environment similar to a hadoop cluster. Here we try the kubernetes set up on a VM.
We will be using the minikube
tool to set up a kubernetes cluster. Minikube provides an easy way to set up a Kubernetes cluster on a VM for testing and experiment purposes.
Install VirtualBox
Follow this link to install minikube https://kubernetes.io/docs/tasks/tools/install-minikube/#install-minikube
Install kubectl - https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl-binary-using-native-package-management
Start the minikube to launch the kubernetes cluster.
minikube start --cpus 3 --memory 6000
brew cask install minikube
brew install docker-machine-driver-hyperkit
minikube start --vm-driver hyperkit --cpus 3 --memory 6000
kubectl
command available in mac if you already have the docker installed.
kubectl cluster-info
Copy the master URL, we need it below to submit spark jobs.
kubectl create namespace spark1
kubectl create serviceaccount jumppod -n spark1
kubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark1:jumppod -n spark1
kubectl run jump-1 -ti --rm=true -n spark1 --image=brainlounge/jumppod:ubuntu-18.04 --serviceaccount='jumppod'
Kubernetes is a container manager or orchestrator, we need to package the spark in
docker image form to deploy them on a Kubernetes cluster.
spark-submit
command supports the spark job submission into kubernetes cluster
by just changing the --master url
Note
|
Run this after the kubernetes cluster is up. |
# Download a copy of the spark binary into your laptop and build the docker
# image from it.
wget http://mirrors.estointernet.in/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
cd spark-2.4.0-bin-hadoop2.7
./bin/docker-image-tool.sh -m -t 2.4.0 build
# Get the kubernetes master url, will be in this form 'https://<host:port>'.
minikube cluster-info
cd spark-2.4.0-bin-hadoop2.7
./bin/spark-shell --master k8s://https://<host:port> --name spark-kube-cli --deploy-mode client \
--conf spark.kubernetes.container.image=spark:2.4.0
Now we can try with all the spark command features available, only change here
is the --master
param and extra --conf
with image name for the spark.
Kubernetes can work with a wide variety of application clusters, your entire application stack it can host.
Easy to deploy and manage different types of applications and their different stages.
If you have a kubernetes cluster in your infrastructure, this is the best option available to run a spark.