Bigdata Session 4 - Apache Spark on Kubernetes

This is a series of blog posts about the 4 sessions I conducted at different engineering colleges.

  1. Bigdata Session 1 - Apache Hadoop

  2. Bigdata Session 2 - Apache Spark on Apache Hadoop

  3. Bigdata Session 3 - Apache Drill

  4. Bigdata Session 4 - Apache Spark on Kubernetes ← ( You are here now )

1. Spark on Kubernetes

Kubernetes is a Linux container manager, the ideas are similar to how yarn manages the jvm containers in the Hadoop environment. Kubernetes can be used to deploy very heterogeneous workloads, and it can meet the requirements of an entire business. eg; Deploy applications, dev/stage environments, offline or batch processing services, etc. As the Hadoop environment is specific to bigdata processing, here we can use the existing kubernetes cluster to do the things that we did on the hadoop cluster or spark cluster.

This tutorial covers how we can quickly set up a kubernetes cluster and deploy a spark cluster on it so that then we can play on spark. The kubernetes act as one of the spark’s cluster manager, there is no change in other aspects of how spark does its functionalities.

2. Setup spark on kubernetes

Kubernetes can be deployed in a multi-node or single node environment similar to a hadoop cluster. Here we try the kubernetes set up on a VM.

We will be using the minikube tool to set up a kubernetes cluster. Minikube provides an easy way to set up a Kubernetes cluster on a VM for testing and experiment purposes.

2.1. Gnu/Linux Environment:-

Start the minikube to launch the kubernetes cluster.

minikube start --cpus 3 --memory 6000

2.2. Mac environment:-

brew cask install minikube
brew install docker-machine-driver-hyperkit
minikube start --vm-driver hyperkit --cpus 3 --memory 6000

kubectl command available in mac if you already have the docker installed.

2.3. Test kubernetes cluster is up

kubectl cluster-info

Copy the master URL, we need it below to submit spark jobs.

kubectl create namespace spark1
kubectl create serviceaccount jumppod -n spark1
kubectl create rolebinding jumppod-rb --clusterrole=admin --serviceaccount=spark1:jumppod -n spark1
kubectl run jump-1 -ti --rm=true -n spark1 --image=brainlounge/jumppod:ubuntu-18.04 --serviceaccount='jumppod'

2.4. Ensure you have JDK 1.8 installed on your laptop

2.5. Prepare spark container images for kubernetes cluster

Kubernetes is a container manager or orchestrator, we need to package the spark in docker image form to deploy them on a Kubernetes cluster. spark-submit command supports the spark job submission into kubernetes cluster by just changing the --master url

Note
Run this after the kubernetes cluster is up.
# Download a copy of the spark binary into your laptop and build the docker
# image from it.
wget http://mirrors.estointernet.in/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
cd spark-2.4.0-bin-hadoop2.7
./bin/docker-image-tool.sh -m -t 2.4.0 build

2.6. Start a spark client on kubernetes cluster manager

# Get the kubernetes master url, will be in this form 'https://<host:port>'.
minikube cluster-info

cd spark-2.4.0-bin-hadoop2.7
./bin/spark-shell --master k8s://https://<host:port> --name spark-kube-cli --deploy-mode client \
    --conf spark.kubernetes.container.image=spark:2.4.0

Now we can try with all the spark command features available, only change here is the --master param and extra --conf with image name for the spark.

3. Why spark on kubernetes

  1. Kubernetes can work with a wide variety of application clusters, your entire application stack it can host.

  2. Easy to deploy and manage different types of applications and their different stages.

  3. If you have a kubernetes cluster in your infrastructure, this is the best option available to run a spark.

4. Presentation

Go Top