KubeFlow Custom Jupyter Image (+ github for notebook source control)

I’ve been playing around a bit with KubeFlow a bit lately and found that a lot of the tutorials and examples of Jupyter notebooks on KubeFlow do a lot of the pip install and other sort of setup and config stuff in the notebook itself which feels icky.

But, in reality, if you were working in Jupyter notebooks on KubeFlow for real you’d want to build a lot of this into the image used to build the notebook server. Luckily, as with most of KubeFlow, its pretty flexible to customize and extend as you want, in this case by adding custom jupyter images.

Two main example use cases you’d want to do this are for ensuring some custom python package (e.g. my_utils) you have built is readily available in all your notebooks, and other external libraries that you use all the time are also available – e.g. kubeflow pipelines.

To that end, here is a Dockerfile that illustrates this (and here is corresponding image on docker hub).

ARG BASE_CONTAINER=gcr.io/kubeflow-images-public/tensorflow-1.13.1-notebook-cpu:v0.5.0
FROM $BASE_CONTAINER
LABEL maintainer="andrewm4894@gmail.com"
LABEL version="01"
RUN pip3 install git+https://github.com/andrewm4894/my_utils.git#egg=my_utils
RUN pip3 install kfp –upgrade
ENV NB_PREFIX /
CMD ["sh","-c", "jupyter notebook –notebook-dir=/home/jovyan –ip=0.0.0.0 –no-browser –allow-root –port=8888 –NotebookApp.token='' –NotebookApp.password='' –NotebookApp.allow_origin='*' –NotebookApp.base_url=${NB_PREFIX}"]
view raw Dockerfile hosted with ❤ by GitHub

Once you have such a custom image building fine it’s pretty easy to just point KubeFlow at it when creating a Jupyter notebook server.

Just specify your custom image

Now when you create a new workbook on that jupyter server you have all your custom goodness ready to go.

Github for notebooks

As i was looking around it seems like there is currently plans to implement some git functionality into the notebooks on KubeFlow in a bit more of a native way (see this issue).

For now i decided to just create a ssh key (help docs) for the persistent workspace volume connected to the notebook server (see step 10 here).

Then once you want to git push from your notebook server you can just hack together a little notebook like this that you can use as a poor man’s git ui 🙂

Parallel Jupyter Notebooks

I have become master of the notebooks, they bend at my will and exist to serve my data science needs!

Ok i might be getting a bit carried away, but i recently discovered papermill and have been finding it very useful in conjunction with Python multiprocessing to speed up a lot of data science experimental type work. So useful in fact, i was motivated to write a post on a Saturday night!

Note: All the code for this post is here.

One problem with notebooks

I’m generally (have swayed back and forth) a fan of notebooks but am wary of some of the downsides or costs they can impose. When doing experimental type work, if your not careful, you can end up with lots of duplicated code or what i think of as “notebook instances”, where you have ran your notebook many times on different (but similar) datasets and with different (but similar) parameters.

Aside: Great talk and deck from @joelgrus (who is great – and who’s meme game is very strong) on some drawbacks of notebooks.

Having the executed notebooks themselves become self documenting artifacts relating to the experiment is really useful – the code you ran and its outputs in one place. But when you start building new features on top of these “notebook instances” as you iterate on the research, things can quickly get messy.

Where I’ve found papermill to be very useful is in basically template-ing up your notebooks in one single place and paramaterizing them such that the actual living notebook code and the executed “notebook instances” have a much cleaner separation.

I’ll try make this clearer with an example.

data_explorer Notebook

Lets suppose you have a notebook that you often use on new datasets (in reality it’s more likely to be some more complicated ml pipeline type notebook for quickly experimenting on updated datasets with while maintaining some common structure in how you go about things).

In this example its a simple notebook to download a dataset and just do some descriptive stats and plotting.

The main idea here is to paramaterize the whole notebook as much as possible. This is done with a json dictionary called “config”. So the idea is that everything the notebook needs is pretty much defined in the first cell.

config = {        
"data_url" :"https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/titanic.csv"
}

In this case, the data_explorer notebook just takes in one parameter called “data_url”. It then downloads from this url into a pandas dataframe and does some basic plotting. In reality this “config” dict can contain all the input parameters you need to define and execute you notebook. For example it could be defining the type of models to build against your data, what data to use, model parameters, where to store outputs etc. anything and everything really.

Enter Papermill

So lets say you now have a number of different datasets that you want to run through your data_explorer notebook. You could manually update the config and then just rerun the notebook 3 times (making sure to restart the kernel and clear all each time), maybe saving outputs into specific locations. Or worse you could make 3 copies of your notebook and just run them each individually (don’t do this, future you will hate it).

Much better is to let papermill kick off the execution of the notebooks so you have a clear separation between the notebooks your code lives in (in this case, the notebooks folder of the repo) and the outputs or “notebook instances” of running the same notebooks multiple times against different data or the same data but with slightly different parameters (in this case the papermill_outputs folder according to a convention you can control).

Two things let us do this, a python script (run_nb_batch.py) that uses papermill and multiprocessing to kick of parallel notebook executions as defined in a json file defining the notebooks to be run and their configs to be run with configs.json.

run_nb_batch.py:

import papermill as pm
import multiprocessing
import os
import argparse
import json
def run_papermill(config):
''' Function to run notebook(s) in paralell using papermill.
'''
# get some variables from the config being run
config = config['config'] # a bit ugly
notebook = config['notebook']
output_label = config["output_label"]
# get name of notebook
notebook_name = notebook.split('/')[1].replace('.ipynb','')
output_dir = f'papermill_outputs/{notebook_name}/{output_label}'
# print config to be run
print("-"*50)
print(config)
print("-"*50)
# make output dir if need to
if not os.path.exists(output_dir):
os.makedirs(output_dir)
output_path = f'{output_dir}/{notebook_name}_{output_label}.ipynb'
output_path_backup = output_path.replace('.ipynb','_backup.ipynb')
# rename existing output file if need to
if os.path.exists(output_path):
# remove existing backup file if there is one
if os.path.exists(output_path):
os.remove(output_path_backup)
# rename existing output file
os.rename(output_path,output_path_backup)
# run notebook using papermill
pm.execute_notebook(
notebook,
output_path,
parameters=dict(config=config)
)
# add args
parser = argparse.ArgumentParser(description='Batch run some notebooks.')
parser.add_argument(
'–config_file',
type=str,
default='configs.json',
help='point to the config file you want to use.'
)
parser.add_argument(
'–run_mode',
type=str,
default='parallel',
help="If set to 'parallel', then run using multiprocessing, just sequential for any other value."
)
# parse args
args = parser.parse_args()
config_file = args.config_file
run_mode = args.run_mode
# read in config_file
with open(config_file) as json_file:
configs = json.load(json_file)
if __name__ == '__main__':
# loop over each config
for config in configs:
# pass the config keys in a dict with known name for unpacking by the run_papermill function
config_dict = [{'config':configs[config]}]
if run_mode == 'parallel':
p = multiprocessing.Process(
target=run_papermill,
args=(config_dict)
)
p.start()
else:
run_papermill(config_dict)
view raw run_nb_batch.py hosted with ❤ by GitHub

configs.json

{
"config_bank": {
"notebook": "notebooks/data_explorer.ipynb",
"data_url": "https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/bank-full.csv",
"output_label": "bank"
},
"config_adult": {
"notebook": "notebooks/data_explorer.ipynb",
"data_url": "https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/adult.csv",
"output_label": "adult"
},
"config_titanic": {
"notebook": "notebooks/data_explorer.ipynb",
"data_url": "https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/titanic.csv",
"output_label": "titanic"
}
}
view raw configs.json hosted with ❤ by GitHub

The idea is to loop through each config in the configs.json file and execute the specified notebook with the specified configuration. Executed notebooks then go to a predefined output file such as papermill_outputs/data_explorer/adult/data_explorer_adult.ipynb.

In this case i’ve chosen the naming convention of /papermill_outputs/<notebook_name>/<output_label>/<notebook_name>_<output_label> .ipynb but obviously you can chose whatever you want.

That’s pretty much it for this one. Feel free to clone the repo and play around with it or add improvements as you like.

I’ve been finding that this sort of approach to template-ing up core notebooks you end up using quite a lot (albeit with slightly different params etc.) along with a standardized approach using something like mlflow to further instrument and store artifacts of your notebook runs can make running multiple ‘experiments’ on your data in parallel much easier and overall help make you a bit more productive.

Update: I decided to make a quick video as sometimes easier to just see what we are doing. (Sorry audio quality a bit bad (and poor resolution), first time :))