Arrrgghh – I just wasted the best part of an afternoon chasing this one down. If i can knock out a quick post on it then at least i’ll feel i’ve gotten something out of it.
Here’s the story – somewhere in an admittedly crazy ETL type pipeline i was using pandas pct_change() as a data transformation prior to some downstream modelling. Problem was i was also using dropna() later to ensure only valid data going into the model. I noticed a lot of rows falling out of the data pipeline (print(df.shape) is your friend!).
After a lot of debugging, chasing this dead end down a rabbit hole, cleaning and updating my conda environment (and then dealing with the tornado issues that always seems to cause) i realized the answer was staring me in the face the whole time.
Its the 0’s!
Turns out my data has lots of zero’s and i’m an idiot who assumed (0-0) / 0 = 0 and so the NaN’s above should be 0.
But am i really an idiot? I checked…
I guess so.
So let that be a lesson – watch out for 0’s when using pct_change() – don’t be like andrewm4894.
p.s. I could not just fillna() them as i had valid NaN’s at the top of my dataframe from the lagged values. So i ended up in a state where some NaN’s were valid and some i wanted to be 0 instead. Am guessing i’ll have to write a custom pct_change() type function (which seems a bit crazy). Would be great if was a way to tell pandas pct_change() i want (0-0)/0=0. Maybe if i get brave enough i’ll make a pull request 🙂
I have become master of the notebooks, they bend at my will and exist to serve my data science needs!
Ok i might be getting a bit carried away, but i recently discovered papermill and have been finding it very useful in conjunction with Python multiprocessing to speed up a lot of data science experimental type work. So useful in fact, i was motivated to write a post on a Saturday night!
I’m generally (have swayed back and forth) a fan of notebooks but am wary of some of the downsides or costs they can impose. When doing experimental type work, if your not careful, you can end up with lots of duplicated code or what i think of as “notebook instances”, where you have ran your notebook many times on different (but similar) datasets and with different (but similar) parameters.
Aside: Great talk and deck from @joelgrus (who is great – and who’s meme game is very strong) on some drawbacks of notebooks.
Having the executed notebooks themselves become self documenting artifacts relating to the experiment is really useful – the code you ran and its outputs in one place. But when you start building new features on top of these “notebook instances” as you iterate on the research, things can quickly get messy.
Where I’ve found papermill to be very useful is in basically template-ing up your notebooks in one single place and paramaterizing them such that the actual living notebook code and the executed “notebook instances” have a much cleaner separation.
Lets suppose you have a notebook that you often use on new datasets (in reality it’s more likely to be some more complicated ml pipeline type notebook for quickly experimenting on updated datasets with while maintaining some common structure in how you go about things).
In this example its a simple notebook to download a dataset and just do some descriptive stats and plotting.
The main idea here is to paramaterize the whole notebook as much as possible. This is done with a json dictionary called “config”. So the idea is that everything the notebook needs is pretty much defined in the first cell.
In this case, the data_explorer notebook just takes in one parameter called “data_url”. It then downloads from this url into a pandas dataframe and does some basic plotting. In reality this “config” dict can contain all the input parameters you need to define and execute you notebook. For example it could be defining the type of models to build against your data, what data to use, model parameters, where to store outputs etc. anything and everything really.
So lets say you now have a number of different datasets that you want to run through your data_explorer notebook. You could manually update the config and then just rerun the notebook 3 times (making sure to restart the kernel and clear all each time), maybe saving outputs into specific locations. Or worse you could make 3 copies of your notebook and just run them each individually (don’t do this, future you will hate it).
Much better is to let papermill kick off the execution of the notebooks so you have a clear separation between the notebooks your code lives in (in this case, the notebooks folder of the repo) and the outputs or “notebook instances” of running the same notebooks multiple times against different data or the same data but with slightly different parameters (in this case the papermill_outputs folder according to a convention you can control).
Two things let us do this, a python script (run_nb_batch.py) that uses papermill and multiprocessing to kick of parallel notebook executions as defined in a json file defining the notebooks to be run and their configs to be run with configs.json.
In this case i’ve chosen the naming convention of /papermill_outputs/<notebook_name>/<output_label>/<notebook_name>_<output_label> .ipynb but obviously you can chose whatever you want.
That’s pretty much it for this one. Feel free to clone the repo and play around with it or add improvements as you like.
I’ve been finding that this sort of approach to template-ing up core notebooks you end up using quite a lot (albeit with slightly different params etc.) along with a standardized approach using something like mlflow to further instrument and store artifacts of your notebook runs can make running multiple ‘experiments’ on your data in parallel much easier and overall help make you a bit more productive.
Update: I decided to make a quick video as sometimes easier to just see what we are doing. (Sorry audio quality a bit bad (and poor resolution), first time :))
I’ve recently been learning java for machine learning related work (a long story to do with mainframes, and you know, why not).
Decided to stick down in here some resources i found very useful as i found i still needed to do a bit of Googling to find ML related Java tutorials and moocs i found useful. Will add to this post as i discover more stuff.
This was one of the hardest parts – finding my way around IDE’s like eclipse and intelij. They are super powerful but a very different experience to something like jupyter lab or visual studio with python.
I’m using intelij IDEA and they have some good videos here.