Apache Toree notebook for Spark
Tweetin data-science · Sun 09 December 2018
in data-science · Sun 09 December 2018
Apache Toree is a jupyter kernel. It runs the spark application in client mode, so that we can interact with the application via console. By enabling this in standard jupyter notebook we can easily connect with any spark cluster or standalone servers, via this we get all the flexibility of the jupyter notebook. Toree notebook supports pyspark and R also with special magic constructs.
Here I’m explaining how we can install and configure toree for spark based testing environment.
# Current latest version, pick the
pip install toree==0.3.0
jupyter toree install --user --spark_home=/mnt/haridas/packages/spark-2.1.2-bin-hadoop2.7
To confirm the new kernel got installed to jupyter correctly,
jupyter kernelspec list
Available kernels:
apache_toree_scala /home/haridas/.local/share/jupyter/kernels/apache_toree_scala
python3 /home/haridas/ENV3/share/jupyter/kernels/python3
If above commands ran successful then we are good to go with running new kernel with jupyter web UI or on the console mode.
jupyter notebook --no-browser --ip 0.0.0.0 --port 8080
Go to browser on this jupyter application, when creating the new notebook, you can now have option to create toree notebook also along with python standard notebook type.
Sometime you might need to directly work on small scripts ore test few things against spark cluster or checking the scala magics ;)
Make use of the console version jupyter with the new kernel. Now you get the standard ipython like console via that we can interact with the spark cluster or write scala codes.
jupyter console --kernel=apache_toree_scala
Default settings of the spark application is very minimal, which may be not enough to test with big files or make use of the available resources in your machine. To do that you need to update the default configurations.
By default spark and driver program make uses only 1GB of heap size, and number of executors will be 1.
To change the configuration it’s pretty easy, you can check the available options
from spark-submit --help
command, and pass them on bellow environment variable
before running the toree notebook.
When picking options from spark-submit
ensure your cluster is of type standalone,
or standalone-cluster-mode or yarn cluster mode. The options are bit different between
the cluster managers.
export SPARK_OPTS="--jars /home/haridas/custom1.jar:/home/haridas/custom2.jar \
--driver-memory=5g \
--executor-memory=5g \
--num-executors 3"
jupyter console --kernel=apache_toree_scala
To save the spark options permanently for the given toree kernel, we can use the kernel configuration file to update the same.
# List available jupyter kernels
(ENV3)$ jupyter kernelspec list
Available kernels:
apache_toree_scala /home/ubuntu/.local/share/jupyter/kernels/apache_toree_scala
python3 /home/ubuntu/ENV3/share/jupyter/kernels/python3
# Edit the `apache_toree_scala` 's kernel.json file
Open /home/ubuntu/.local/share/jupyter/kernels/apache_toree_scala/kernel.json
{
...
"env": {
"__TOREE_SPARK_OPTS__": "--master spark://<ip>:7077 --executor-cores 1 ..."
}
...
}
Once the kernel.json
updated with the spark configuration, when we start a
toree notebook it automatically uses this SPARK_OPTS
.
Here only we only need to make use of the spark-submit arguments.
export SPARK_OPTS="--master spark://<host>:7077 ..."
By making use of the special magic types you can run the any of these codes that is supported by the spark execution environment.
Thats all for now, have fun with spark cluster and get the same flexibility of ipython notebook !
Here we are accessing the spark via python interface without toree kernel support,
pyspark
library provide all the necessary tooling to start the driver program
and py4j services. Use this if you want to interact with python without toree
kernel.
export SPARK_HOME=/mnt/spark-2.4.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=/home/haridas/ENV3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='console'
pyspark
Now the pyspark uses the jupyter console interface, and we get the native console interface instead of web-console, some cases backend console is better.