Walkthrough for deploying Apache Zeppelin on Kubernetes

Apache Zeppelin is a “web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more”. As a Data Scientist, I love Apache Zeppelin because of its versatility and flexibility — from a Kubernetes administrator’s perspective, Zeppelin has given me some headaches. However, in my opinion, this is not because Zeppelin’s Kubernetes support is bad. Rather, I am not convinced by the current official documentation and the available sources on this subject, which brings us to the motivation for this article.

grothesk
7 min readApr 12, 2021
Some kind of zeppelin

Apache Zeppelin on Kubernetes

So, how does Zeppelin work on Kubernetes? Basically, the whole thing is based on simple principles: there is a web server brought in as a Kubernetes deployment. While you are interacting with a Zeppelin note, the server creates pods for each of the interpreters in a lazy fashion. The interpreter pods all use the same container image.

Building and providing Images

That said, when it comes to building and providing the required images, the following implies: there is need for both an image containing the web server and another image used by the interpreter pods.
The creation of these images is typically done by multistage builds. In this context, a common distribution image is utilized for both images, from which the relevant parts of Apache Zeppelin are taken. Accordingly, the first step of building the desired images is to provide said distribution image.

Building the Distribution Image

One option for providing the distribution image is to use the code from the official repository, which can be obtained by using git:

git clone https://github.com/apache/zeppelin.git

The Dockerfile suitable for building the distribution image is located directly in the root directory of the project. So, looks like you can get started right away, but before starting the build, you should consider the following aspects:

The distribution image contains the complete build of Zeppelin and is about 2GB in size. Therefore, the creation of the image will take some time. Furthermore, you have to make sure that Docker has enough resources available. If you are a Windows or Mac user, you may have to resize the virtual machine Docker is using.

This in mind, we can change to the root directory of the project and start the build. Thereby it makes sense to use the tag that is referred in the Dockerfiles of the said images inheriting from the distribution image:

cd zeppelin
docker build -t zeppelin-distribution:latest .

Building the Server Image

Apache Zeppelin’s Github project provides some useful Dockerfiles. A Dockerfile for building the server image is located in the scripts/docker/zeppelin-server folder of the project. Before the image can be built, a few adaptions may need to be made.

First thing to do is to set the version number of the distribution. The version number of the distribution can be obtained by considering the war files in the distribution image:

docker run -i zeppelin-distribution:latest ls /opt/zeppelin

You can set the version number for the server image by changing the default value of the versionargument in the Dockerfile, e.g. like this:

Another option is to add the version number as an argument to the docker build command (see below).

If you take a closer look at the Dockerfile, there are hints in the comments for how to configure the image. For example, you can determine that only certain interpreters are supported by the zeppelin server. According to the following example snippet, only support for JDBC, Markdown, Python and Cassandra will be integrated in the image:

Since local files are involved, the build of the image has to be run from the corresponding folder. Apart from that, you should tag the image to refer to the associated registry Kubernetes will access later on:

cd scripts/docker/zeppelin-server
docker build -t grothesk/zeppelin-server-custom:0.10.0-SNAPSHOT \
--build-arg version=0.10.0-SNAPSHOT .

Pushing the image to the registry then is done like this:

docker push grothesk/zeppelin-server-custom:0.10.0-SNAPSHOT

Building the Interpreter Image

A corresponding Dockerfile for the interpreter image can be found in the scripts/docker/zeppelin-interpreterfolder. The procedure is very similar to the one dealing with the server image considered before: the version has to be set and you have to choose which interpreters should be supported.

For example, if you want to provide support for JDBC, Markdown, Python and Cassandra the interpreter's Dockerfile should contain the following lines:

If certain Python packages are required for the interpreter, these can be specified via conda_packages.txt or pip_packages.txt located in the same folder as the Dockerfile.

After configuring the Dockerfile, building and deploying the image is done as usual:

cd scripts/docker/zeppelin-interpreter
docker build -t grothesk/zeppelin-interpreter-custom:0.10.0-SNAPSHOT
docker push grothesk/zeppelin-interpreter-custom:0.10.0-SNAPSHOT

Deploying the Kubernetes resources

For the deployment of Zeppelin there is a yaml file in the project that contains the relevant manifests of the resources: k8s/zeppelin-server.yaml.
The customized images have to be referred within the manifests as follows.

The name of the server image is inserted into the corresponding field of the deployment controller:

The interpreter image is set in the ConfigMap zeppelin-server-conf-map:

zeppelin-server-conf-map is used to inject the included values as environment variables into the server container. Consequently, a restart of the server pods is necessary if the value of ZEPPELIN_K8S_CONTAINER_IMAGE has been changed.

Ok, let’s get to the end: after configuring k8s/zeppelin-server.yaml you can deploy Apache Zeppelin like this:

kubectl apply -f k8s/zeppelin-server.yaml

And that’s it.

Deploying Apache Zeppelin on Minikube

Last but not least, here is a test run of the deployment using a Minikube cluster. In this example, the required images are provided by my account on DockerHub. You can obtain the corresponding Dockerfiles and manifests here: https://github.com/deepshore/walkthrough-for-deploying-apache-zeppelin-on-kubernetes. These files were created according to the steps above.

First, we need a cluster that meets the performance requirements. The requirements depend strongly on the tasks that are to be performed by the interpreters. For example, a fairly generously sized cluster could be laid out as follows:

minikube start -p zeppelin --driver=virtualbox --memory=6g --cpus=4 --disk-size=20000mb

To be able to test the Apache Cassandra support of the interpreter image, you should deploy Cassandra. Manifests for deploying Cassandra on Kubernetes can be taken from the said repo like this:

kubectl apply -f https://raw.githubusercontent.com/deepshore/walkthrough-for-deploying-apache-zeppelin-on-kubernetes/main/cassandra/cassandra-service.yamlkubectl apply -f https://raw.githubusercontent.com/deepshore/walkthrough-for-deploying-apache-zeppelin-on-kubernetes/main/cassandra/cassandra-statefulset.yaml

Depending on the size of the interpreter image, lazy loading may lead to a timeout. In this case I recommend to load the image in advance.
Using a Minikube cluster with one node, this can be done as follows:

eval $(minikube -p zeppelin docker-env)
docker pull grothesk/zeppelin-interpreter-custom:0.10.0-SNAPSHOT

As a next step, we kick off the deployment of Zeppelin:

kubectl apply -f https://raw.githubusercontent.com/deepshore/walkthrough-for-deploying-apache-zeppelin-on-kubernetes/main/k8s/zeppelin-server.yaml

After that we wait for the server to be ready by checking the pod status:

kubectl get pods -w

In order to access the web service, port-forwarding can be used like this:

kubectl port-forward zeppelin-server-54c44df9bc-m7rsk 8080:80

When port forwarding is set up, we should be able to communicate with Zeppelin via localhost:8080:

Welcome to Zeppelin

The following pictures show the steps for configuring the Cassandra interpreter. According to our Cassandra deployment, changes to the configuration only affects the hosts (cassandra) and cluster (K8sDemo) properties.

Configuring hosts and cluster
Saving the configuration
Restarting with new settings

After the Cassandra interpreter has been configured, we create a new note like this:

Creating a new Note
Setting Note Name and Default Interpreter

In order to test different interpreters, we should add and run some paragraphs:

Creating paragraphs
Running paragraphs

Finally, if we have a look at the pods, we can see that for each interpreter used, a corresponding pod has been created:

kubectl get pods
NAME READY STATUS RESTARTS AGE
cassandra-0 1/1 Running 0 52m
cassandra-hdwtrn 1/1 Running 0 70s
md-gangug 1/1 Running 0 80s
python-ocxhqu 1/1 Running 0 76s
zeppelin-server-54c44df9bc-m7rsk 3/3 Running 0 18m

I am convinced that you can manage on your own from here on.

Additions

I have created a distribution image which you can pull from DockerHub:

docker pull grothesk/zeppelin-distribution:e0e2ca5f8087d8f47a9fba4bfe736b53a565cb11

It’s based on commit e0e2ca5f8087d8f47a9fba4bfe736b53a565cb11. Feel free to use it.

Sources

--

--