Apache Toree notebook for Spark

Apache Toree is a jupyter kernel. It runs the spark application in client mode, so that we can interact with the application via console. By enabling this in standard jupyter notebook we can easily connect with any spark cluster or standalone servers, via this we get all the flexibility of the jupyter notebook. Toree notebook supports pyspark and R also with special magic constructs.

Here I’m explaining how we can install and configure toree for spark based testing environment.

1. Configure toree with jupyter notebook

# Current latest version, pick the
pip install toree==0.3.0
jupyter toree install --user --spark_home=/mnt/haridas/packages/spark-2.1.2-bin-hadoop2.7

To confirm the new kernel got installed to jupyter correctly,

jupyter kernelspec list

Available kernels:
  apache_toree_scala    /home/haridas/.local/share/jupyter/kernels/apache_toree_scala
  python3               /home/haridas/ENV3/share/jupyter/kernels/python3

If above commands ran successful then we are good to go with running new kernel with jupyter web UI or on the console mode.

2. Access the Toree on jupyter

2.1. Notebook mode

jupyter notebook --no-browser --ip 0.0.0.0 --port 8080

Go to browser on this jupyter application, when creating the new notebook, you can now have option to create toree notebook also along with python standard notebook type.

2.2. Console mode

Sometime you might need to directly work on small scripts ore test few things against spark cluster or checking the scala magics ;)

Make use of the console version jupyter with the new kernel. Now you get the standard ipython like console via that we can interact with the spark cluster or write scala codes.

jupyter console --kernel=apache_toree_scala

2.3. Customize toree configuration

Default settings of the spark application is very minimal, which may be not enough to test with big files or make use of the available resources in your machine. To do that you need to update the default configurations.

By default spark and driver program make uses only 1GB of heap size, and number of executors will be 1.

To change the configuration it’s pretty easy, you can check the available options from spark-submit --help command, and pass them on bellow environment variable before running the toree notebook.

When picking options from spark-submit ensure your cluster is of type standalone, or standalone-cluster-mode or yarn cluster mode. The options are bit different between the cluster managers.

export SPARK_OPTS="--jars /home/haridas/custom1.jar:/home/haridas/custom2.jar \
    --driver-memory=5g \
    --executor-memory=5g \
    --num-executors 3"
jupyter console --kernel=apache_toree_scala

To save the spark options permanently for the given toree kernel, we can use the kernel configuration file to update the same.

# List available jupyter kernels

(ENV3)$ jupyter kernelspec list
Available kernels:
  apache_toree_scala    /home/ubuntu/.local/share/jupyter/kernels/apache_toree_scala
  python3               /home/ubuntu/ENV3/share/jupyter/kernels/python3


# Edit the `apache_toree_scala` 's kernel.json file

Open /home/ubuntu/.local/share/jupyter/kernels/apache_toree_scala/kernel.json

{
    ...

    "env": {
        "__TOREE_SPARK_OPTS__": "--master spark://<ip>:7077 --executor-cores 1 ..."
     }
     ...
}

Once the kernel.json updated with the spark configuration, when we start a toree notebook it automatically uses this SPARK_OPTS.

3. Connect Toree with remote spark cluster

Here only we only need to make use of the spark-submit arguments.

export SPARK_OPTS="--master spark://<host>:7077 ..."

4. Write SQL/python/R on the same toree notebook

By making use of the special magic types you can run the any of these codes that is supported by the spark execution environment.

Thats all for now, have fun with spark cluster and get the same flexibility of ipython notebook !

5. Pyspark on ipython console

Here we are accessing the spark via python interface without toree kernel support, pyspark library provide all the necessary tooling to start the driver program and py4j services. Use this if you want to interact with python without toree kernel.

export SPARK_HOME=/mnt/spark-2.4.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=/home/haridas/ENV3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='console'


pyspark

Now the pyspark uses the jupyter console interface, and we get the native console interface instead of web-console, some cases backend console is better.

Go Top