( 0 – 0 ) / 0 != 0

Arrrgghh – I just wasted the best part of an afternoon chasing this one down. If i can knock out a quick post on it then at least i’ll feel i’ve gotten something out of it.

Here’s the story – somewhere in an admittedly crazy ETL type pipeline i was using pandas pct_change() as a data transformation prior to some downstream modelling. Problem was i was also using dropna() later to ensure only valid data going into the model. I noticed a lot of rows falling out of the data pipeline (print(df.shape) is your friend!).

After a lot of debugging, chasing this dead end down a rabbit hole, cleaning and updating my conda environment (and then dealing with the tornado issues that always seems to cause) i realized the answer was staring me in the face the whole time.

Its the 0’s!

Turns out my data has lots of zero’s and i’m an idiot who assumed (0-0) / 0 = 0 and so the NaN’s above should be 0.

But am i really an idiot? I checked…

I guess so.

So let that be a lesson – watch out for 0’s when using pct_change() – don’t be like andrewm4894.

p.s. I could not just fillna() them as i had valid NaN’s at the top of my dataframe from the lagged values. So i ended up in a state where some NaN’s were valid and some i wanted to be 0 instead. Am guessing i’ll have to write a custom pct_change() type function (which seems a bit crazy). Would be great if was a way to tell pandas pct_change() i want (0-0)/0=0. Maybe if i get brave enough i’ll make a pull request 🙂

p.p.s. Seems like i’m not the only one: https://brilliant.org/wiki/what-is-0-0/ (though still wrong)

hmmm – maybe i’ve just wasted more time on this post. Think i need to pack it in for today.


Everything is always clearer in the morning, here is a one liner that pretty much behaves as i wanted. I can get back to my life now.

Parallel Jupyter Notebooks

I have become master of the notebooks, they bend at my will and exist to serve my data science needs!

Ok i might be getting a bit carried away, but i recently discovered papermill and have been finding it very useful in conjunction with Python multiprocessing to speed up a lot of data science experimental type work. So useful in fact, i was motivated to write a post on a Saturday night!

Note: All the code for this post is here.

One problem with notebooks

I’m generally (have swayed back and forth) a fan of notebooks but am wary of some of the downsides or costs they can impose. When doing experimental type work, if your not careful, you can end up with lots of duplicated code or what i think of as “notebook instances”, where you have ran your notebook many times on different (but similar) datasets and with different (but similar) parameters.

Aside: Great talk and deck from @joelgrus (who is great – and who’s meme game is very strong) on some drawbacks of notebooks.

Having the executed notebooks themselves become self documenting artifacts relating to the experiment is really useful – the code you ran and its outputs in one place. But when you start building new features on top of these “notebook instances” as you iterate on the research, things can quickly get messy.

Where I’ve found papermill to be very useful is in basically template-ing up your notebooks in one single place and paramaterizing them such that the actual living notebook code and the executed “notebook instances” have a much cleaner separation.

I’ll try make this clearer with an example.

data_explorer Notebook

Lets suppose you have a notebook that you often use on new datasets (in reality it’s more likely to be some more complicated ml pipeline type notebook for quickly experimenting on updated datasets with while maintaining some common structure in how you go about things).

In this example its a simple notebook to download a dataset and just do some descriptive stats and plotting.

The main idea here is to paramaterize the whole notebook as much as possible. This is done with a json dictionary called “config”. So the idea is that everything the notebook needs is pretty much defined in the first cell.

config = {        
"data_url" :"https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/titanic.csv"

In this case, the data_explorer notebook just takes in one parameter called “data_url”. It then downloads from this url into a pandas dataframe and does some basic plotting. In reality this “config” dict can contain all the input parameters you need to define and execute you notebook. For example it could be defining the type of models to build against your data, what data to use, model parameters, where to store outputs etc. anything and everything really.

Enter Papermill

So lets say you now have a number of different datasets that you want to run through your data_explorer notebook. You could manually update the config and then just rerun the notebook 3 times (making sure to restart the kernel and clear all each time), maybe saving outputs into specific locations. Or worse you could make 3 copies of your notebook and just run them each individually (don’t do this, future you will hate it).

Much better is to let papermill kick off the execution of the notebooks so you have a clear separation between the notebooks your code lives in (in this case, the notebooks folder of the repo) and the outputs or “notebook instances” of running the same notebooks multiple times against different data or the same data but with slightly different parameters (in this case the papermill_outputs folder according to a convention you can control).

Two things let us do this, a python script (run_nb_batch.py) that uses papermill and multiprocessing to kick of parallel notebook executions as defined in a json file defining the notebooks to be run and their configs to be run with configs.json.


import papermill as pm
import multiprocessing
import os
import argparse
import json
def run_papermill(config):
''' Function to run notebook(s) in paralell using papermill.
# get some variables from the config being run
config = config['config'] # a bit ugly
notebook = config['notebook']
output_label = config["output_label"]
# get name of notebook
notebook_name = notebook.split('/')[1].replace('.ipynb','')
output_dir = f'papermill_outputs/{notebook_name}/{output_label}'
# print config to be run
# make output dir if need to
if not os.path.exists(output_dir):
output_path = f'{output_dir}/{notebook_name}_{output_label}.ipynb'
output_path_backup = output_path.replace('.ipynb','_backup.ipynb')
# rename existing output file if need to
if os.path.exists(output_path):
# remove existing backup file if there is one
if os.path.exists(output_path):
# rename existing output file
# run notebook using papermill
# add args
parser = argparse.ArgumentParser(description='Batch run some notebooks.')
help='point to the config file you want to use.'
help="If set to 'parallel', then run using multiprocessing, just sequential for any other value."
# parse args
args = parser.parse_args()
config_file = args.config_file
run_mode = args.run_mode
# read in config_file
with open(config_file) as json_file:
configs = json.load(json_file)
if __name__ == '__main__':
# loop over each config
for config in configs:
# pass the config keys in a dict with known name for unpacking by the run_papermill function
config_dict = [{'config':configs[config]}]
if run_mode == 'parallel':
p = multiprocessing.Process(
view raw run_nb_batch.py hosted with ❤ by GitHub


"config_bank": {
"notebook": "notebooks/data_explorer.ipynb",
"data_url": "https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/bank-full.csv",
"output_label": "bank"
"config_adult": {
"notebook": "notebooks/data_explorer.ipynb",
"data_url": "https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/adult.csv",
"output_label": "adult"
"config_titanic": {
"notebook": "notebooks/data_explorer.ipynb",
"data_url": "https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/titanic.csv",
"output_label": "titanic"
view raw configs.json hosted with ❤ by GitHub

The idea is to loop through each config in the configs.json file and execute the specified notebook with the specified configuration. Executed notebooks then go to a predefined output file such as papermill_outputs/data_explorer/adult/data_explorer_adult.ipynb.

In this case i’ve chosen the naming convention of /papermill_outputs/<notebook_name>/<output_label>/<notebook_name>_<output_label> .ipynb but obviously you can chose whatever you want.

That’s pretty much it for this one. Feel free to clone the repo and play around with it or add improvements as you like.

I’ve been finding that this sort of approach to template-ing up core notebooks you end up using quite a lot (albeit with slightly different params etc.) along with a standardized approach using something like mlflow to further instrument and store artifacts of your notebook runs can make running multiple ‘experiments’ on your data in parallel much easier and overall help make you a bit more productive.

Update: I decided to make a quick video as sometimes easier to just see what we are doing. (Sorry audio quality a bit bad (and poor resolution), first time :))