This is a series of blog posts about the 4 session I conducted at different engieneering colleges.
Bigdata Session 1 - Apache Hadoop
Bigdata Session 2 - Apache Spark on Apache Hadoop
Bigdata Session 3 - Apache Drill
Bigdata Session 4 - Apache Spark on Kubernetes ← ( You are here now )
1. Spark on Kubernetes
Kubernetes is a linux container manager, the ideas is similar to how yarn manages the jvm containers in hadoop environment. Kubernetes can be used to deploy very hetrogenious workloads, and it can meet a requirements of an entire business. eg; Deploy applications , dev/stage environments, offline or batch processing services etc. As the hadoop environment is specific to bigdata processing, here we can use existing kubernetes cluster to do the things that we done on hadoop cluster or spark cluster.
This tutorial covers how we can quickly setup a kubernetes cluster and deploy a spark cluster on it, so that then we can play on spark. The kubernetes act as one of the spark’s cluster manager, there is no change in other aspects of how spark does it’s functionalities.
2. Setup spark on kubernetes
Kubernet can be deploy in multi-node or single node environment similar to hadoop cluster. Here we try the kubernetes setup on a VM.
We will be using
minikube tool to setup kubernetes cluster. Minikube provide an easy
way to setup kubernetes cluster on a VM for testing and experiment purpose.
2.1. Gnu/Linux Environment:-
Follow this link to install minikube https://kubernetes.io/docs/tasks/tools/install-minikube/#install-minikube
Start the minikube to launch the kubernetes cluster.
minikube start --cpus 3 --memory 6000
2.2. Mac environment:-
brew cask install minikube brew install docker-machine-driver-hyperkit minikube start --vm-driver hyperkit --cpus 3 --memory 6000
kubectl command available in mac if you already have the docker installed.
2.3. Test kubernetes cluster is up
Copy the master URL, we need it bellow to submit spark jobs.
kubectl create namespace spark1 kubectl create serviceaccount jumppod -n spark1 kubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark1:jumppod -n spark1 kubectl run jump-1 -ti --rm=true -n spark1 --image=brainlounge/jumppod:ubuntu-18.04 --serviceaccount='jumppod'
2.4. Ensure you have JDK 1.8 installed in your laptop
2.5. Prepare spark container images for kubernetes cluster
Kubernetes is a container manager or orchestrator, we need to package the spark in
docker image form to deploy them on kubernetes cluster.
spark-submit command support the spark job submission into kubernetes cluster
by just changing the
|Run this after the kubernetes cluster is up.|
# Download a copy of the spark binary into your laptop and build the docker # image from it. wget http://mirrors.estointernet.in/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz cd spark-2.4.0-bin-hadoop2.7 ./bin/docker-image-tool.sh -m -t 2.4.0 build
2.6. Start a spark client on kubernetes cluster manager
# Get the kubernetes master url, will be in this form 'https://<host:port>'. minikube cluster-info cd spark-2.4.0-bin-hadoop2.7 ./bin/spark-shell --master k8s://https://<host:port> --name spark-kube-cli --deploy-mode client \ --conf spark.kubernetes.container.image=spark:2.4.0
Now we can try with all the spark command features available, only change here
--master param and extra
--conf with image name for the spark.
3. Why spark on kubernetes
Kubernetes can work with wide variety of application cluster, your entire application stack it can host.
Easy to deploy and manager different types of applications and its different stages.
If you have a kubernetes cluster in your infrastructure, this is the best option available to run a spark.