This is a series of blog posts about the 4 sessions I conducted at different engineering colleges.
Bigdata Session 1 - Apache Hadoop
Bigdata Session 2 - Apache Spark on Apache Hadoop
Bigdata Session 3 - Apache Drill
Bigdata Session 4 - Apache Spark on Kubernetes ← ( You are here now )
1. Spark on Kubernetes
Kubernetes is a Linux container manager, the ideas are similar to how yarn manages the jvm containers in the Hadoop environment. Kubernetes can be used to deploy very heterogeneous workloads, and it can meet the requirements of an entire business. eg; Deploy applications, dev/stage environments, offline or batch processing services, etc. As the Hadoop environment is specific to bigdata processing, here we can use the existing kubernetes cluster to do the things that we did on the hadoop cluster or spark cluster.
This tutorial covers how we can quickly set up a kubernetes cluster and deploy a spark cluster on it so that then we can play on spark. The kubernetes act as one of the spark’s cluster manager, there is no change in other aspects of how spark does its functionalities.
2. Setup spark on kubernetes
Kubernetes can be deployed in a multi-node or single node environment similar to a hadoop cluster. Here we try the kubernetes set up on a VM.
We will be using the
minikube tool to set up a kubernetes cluster. Minikube provides an easy way to set up a Kubernetes cluster on a VM for testing and experiment purposes.
2.1. Gnu/Linux Environment:-
Follow this link to install minikube https://kubernetes.io/docs/tasks/tools/install-minikube/#install-minikube
Start the minikube to launch the kubernetes cluster.
minikube start --cpus 3 --memory 6000
2.2. Mac environment:-
brew cask install minikube brew install docker-machine-driver-hyperkit minikube start --vm-driver hyperkit --cpus 3 --memory 6000
kubectl command available in mac if you already have the docker installed.
2.3. Test kubernetes cluster is up
Copy the master URL, we need it below to submit spark jobs.
kubectl create namespace spark1 kubectl create serviceaccount jumppod -n spark1 kubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark1:jumppod -n spark1 kubectl run jump-1 -ti --rm=true -n spark1 --image=brainlounge/jumppod:ubuntu-18.04 --serviceaccount='jumppod'
2.4. Ensure you have JDK 1.8 installed on your laptop
2.5. Prepare spark container images for kubernetes cluster
Kubernetes is a container manager or orchestrator, we need to package the spark in
docker image form to deploy them on a Kubernetes cluster.
spark-submit command supports the spark job submission into kubernetes cluster
by just changing the
|Run this after the kubernetes cluster is up.|
# Download a copy of the spark binary into your laptop and build the docker # image from it. wget http://mirrors.estointernet.in/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz cd spark-2.4.0-bin-hadoop2.7 ./bin/docker-image-tool.sh -m -t 2.4.0 build
2.6. Start a spark client on kubernetes cluster manager
# Get the kubernetes master url, will be in this form 'https://<host:port>'. minikube cluster-info cd spark-2.4.0-bin-hadoop2.7 ./bin/spark-shell --master k8s://https://<host:port> --name spark-kube-cli --deploy-mode client \ --conf spark.kubernetes.container.image=spark:2.4.0
Now we can try with all the spark command features available, only change here
--master param and extra
--conf with image name for the spark.
3. Why spark on kubernetes
Kubernetes can work with a wide variety of application clusters, your entire application stack it can host.
Easy to deploy and manage different types of applications and their different stages.
If you have a kubernetes cluster in your infrastructure, this is the best option available to run a spark.