Custom Python Packages in AWS Lambda

It’s True.

I’m pretty sure i’ll be looking this up again at some stage so that passed one of my main thresholds for a blog post.

I’ve recently been porting some data and model development pipelines over to AWS Lambda and was mildly horrified to see how clunky the whole process for adding custom python packages to your Lambda was (see docs here).

This was probably the best post i found but it still did not quite cover custom python packages you might need to include beyond just the more typical pypi ones like numpy, pandas, etc. (p.s. this video was really useful if you are working in Cloud9).

So i set out to hack together a process that would automate 90% of the work in packaging up any python packages you might want to make available to your AWS Lambda including local custom python packages you might have built yourself.

The result involves a Docker container to build your packages in (i have to use this as using windows based python package local install does not work in Lambda as the install contains some windows stuff Lambda won’t like), and a jupyter notebook (of course there is some jupyter 🙂 ) to take some inputs (what packages you want, what to call the AWS Layer, etc), build local installs of the packages, add them to a zip file, load zip file to S3 and then finally use awscli to make a new layer from said S3 zip file.

Dockerfile

The first place to start is with the below Dockerfile that creates a basic conda ready docker container with jupyter installed. Note it also includes conda-build and copies over the packages/ folder into the container (required as i wanted to install my “my_utils” package and have it available to the jupyter notebook).

Build this with:

$ docker build -t my-aws-python-packages -f ./Dockerfile ./

And then run it with:

$ docker run -it --name my-aws-python-packages 
    -p 8888:8888
    --mount type=bind,source="$(pwd)/work",target=/home/jovyan/work
    --mount type=bind,source="$(pwd)/packages",target=/home/jovyan/packages 
    -e AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id)
    -e AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key)
    my-aws-python-packages

The above runs the container, port forwards 8888 (for jupyter), mounts both the /packages and /work folders (as for these files we want changes from outside docker or inside to be reflected and vice versa), and passes in my AWS credentials as environment variables to the container (needed for the asw cli commands we will run inside the container). Its last step is to then launch jupyter lab which you then should be able to get to at http://localhost:8888/lab using the token provided by jupyter.

Notebook time – make_layer.ipynb

Once the docker container is running and you are in jupyter the make_layer notebook automates the local installation of a list of python packages, zipping them to /work/python.zip folder as expected by AWS Layers (when unzipped your root folder needs to be /python/…), loading it to an S3 location, and then using awscli to add a new layer or version (if the layer already exists).

The notebook itself is not that big so i’ve included it below.

For this example i’ve included two custom packages along with pandas into my AWS Layer. The custom packages are just two little basic hello_world() type packages (one actually creates the subprocess_execute() function used in the make_layer notebook). I’ve included pandas then as well to illustrate how to include a pypi package.

Serverless Deploy!

To round off the example we then also need to create a little AWS Lambda function to validate that the packages installed in our layer can actually be used by Lambda.

To that end, i’ve adapted the serverless example cron lamdba from here into my own little lambda using both my custom packages and pandas.

Here is the handler.py that uses my packages:

And the serverless.yml used to configure and deploy the lambda:

We then deploy this function (from here) with:

$ serverless deploy

And we can then go into the AWS console to the Lamdba function we just created. We can test it in the UI and see the expected output whereby our custom functions work as expected as does Pandas:

Success!

That’s it for this one, i’m hoping someone might find this useful as i was really surprised by how painful it was to get a simple custom package or even pypi packages for that matter available to your AWS Lambda functions.

If you wanted you could convert the ipynb notebook into a python script and automate the whole thing. Although i’m pretty sure Amazon will continue to make the whole experience a bit more seamless and easier over time.

Parallelize a wide df in Pandas

I was going to make a pretty picture.

Sometimes you end up with a very wide pandas dataframe and you are interested in doing the same types of operations (data processing, building a model etc.) but focused on subsets of the columns.

For example if we had a wide df with different time series kpi’s represented as columns then we might want to do something like look at each kpi at a time, apply some pre-processing and build something like an ARIMA time series model perhaps.

This is the situation i found myself in recently and it took me best part of an afternoon to figure out. Usually when i find myself in that situation i try and squeeze out a blog post in case might be useful for someone else or future me.

All the code in one glorious screenshot!

Note: repository with all code is here. p.s. thanks to this and this post that i built off of.

For this example i’m afraid i’m going to use the Iris dataset :0 . This example is as minimal and easy as i could throw together, basically the aim of the code is to:

  1. Build some function to take in a df, do some processing and spit out a new df.
  2. Have that function be parameterized in some way as might be needed (e.g if you wanted to do slightly different work for one subset of columns).
  3. Apply that function in parallel across the different subsets of your df that you want to process.

There are two main functions of interest here parallelize_dataframe() and do_work() both of which live in their own file called my_functions.py which can be imported into your jupyter notebook.

parallelize_dataframe() does the below things:

  1. Break out df into a list of df’s based on the col_subsets list passed in as a parameter.
  2. Wrap the function that was passed in into a partial along with the kwargs (this is how your parameters make it into the do_work() function).
  3. Use map() from multiprocessing to apply the func (along with the args you want to send it) to each subset of columns from your df in parallel.
  4. Reduce all this back into on final df by joining all the resulting df’s from the map() output into one wide df again (note the assumption here of joining back on the df indexes – they need to be stable and meaningful).

The do_work() function in this example is just a simple function to add some new columns as examples of types of pandas (or any other) goodness you might want to do. In reality in my case it would be more like a apply_model() type function that would take each subset of columns, do some feature extraction, train a model and then also score the data as needed to.

Having the ability to do this for multiple subsets of columns in your wide df can really free up your time to focus on the more important things like dickying around with model parameters and different pre-processing steps 🙂

That’s pretty much it, a productive afternoon (in the play center with kids i might add) and am quite pleased with myself.

Update: One addition i made to this as things got more complicated when i went to implement it was the ability to apply different function params to each subset df. For example if you wanted to pass in different parameters to the function for different columns. In the do_parallel_zip.ipynb and corresponding my_functions_zip.py (i’m calling them “_zip” as they use zip() to “zip” up both the df_list and the corresponding kwargs to go with it to be unpacked later by do_work_zip()).

To be concrete, if we wanted to multiply the “sepal_…” cols by 100 and the “petal_..” cols by 0.5. We could use the “zip” approach like below (notebook here):

Which is using the “zip” approach in parallelize_dataframe_zip()

Where the zipped iterable is then unpacked as needed by the do_work_zip() function:

( 0 – 0 ) / 0 != 0

Arrrgghh – I just wasted the best part of an afternoon chasing this one down. If i can knock out a quick post on it then at least i’ll feel i’ve gotten something out of it.

Here’s the story – somewhere in an admittedly crazy ETL type pipeline i was using pandas pct_change() as a data transformation prior to some downstream modelling. Problem was i was also using dropna() later to ensure only valid data going into the model. I noticed a lot of rows falling out of the data pipeline (print(df.shape) is your friend!).

After a lot of debugging, chasing this dead end down a rabbit hole, cleaning and updating my conda environment (and then dealing with the tornado issues that always seems to cause) i realized the answer was staring me in the face the whole time.

Its the 0’s!

Turns out my data has lots of zero’s and i’m an idiot who assumed (0-0) / 0 = 0 and so the NaN’s above should be 0.

But am i really an idiot? I checked…

I guess so.

So let that be a lesson – watch out for 0’s when using pct_change() – don’t be like andrewm4894.

p.s. I could not just fillna() them as i had valid NaN’s at the top of my dataframe from the lagged values. So i ended up in a state where some NaN’s were valid and some i wanted to be 0 instead. Am guessing i’ll have to write a custom pct_change() type function (which seems a bit crazy). Would be great if was a way to tell pandas pct_change() i want (0-0)/0=0. Maybe if i get brave enough i’ll make a pull request 🙂

p.p.s. Seems like i’m not the only one: https://brilliant.org/wiki/what-is-0-0/ (though still wrong)

hmmm – maybe i’ve just wasted more time on this post. Think i need to pack it in for today.

Update

Everything is always clearer in the morning, here is a one liner that pretty much behaves as i wanted. I can get back to my life now.

Java Weka API: Adding List To Instances Object

This is just a quick one to save anyone else new to the Weka api in Java spending as much time as i did figuring this one out.

Lets suppose you have a weka instances object and a new list of values you want to add into it as a new attribute (weka slang for a new column in your data).

Below gist shows a small reproducible example of this.

And you should see some output like this:

====== BEFORE ======
sepallength,sepalwidth,petallength,petalwidth,class
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5,3.6,1.4,0.2,Iris-setosa

====== AFTER ======
sepallength,sepalwidth,petallength,petalwidth,class,randomDouble
5.1,3.5,1.4,0.2,Iris-setosa,0.173003
4.9,3,1.4,0.2,Iris-setosa,0.304703
4.7,3.2,1.3,0.2,Iris-setosa,0.925626
4.6,3.1,1.5,0.2,Iris-setosa,0.733839
5,3.6,1.4,0.2,Iris-setosa,0.710073

Above we can see the new attribute “randomDouble” has been added to the instances object.

I’m learning the weka java api at the moment so will try post little tidbits like this that could be useful for others of even myself 6 months from now 🙂

Parallel Jupyter Notebooks

I have become master of the notebooks, they bend at my will and exist to serve my data science needs!

Ok i might be getting a bit carried away, but i recently discovered papermill and have been finding it very useful in conjunction with Python multiprocessing to speed up a lot of data science experimental type work. So useful in fact, i was motivated to write a post on a Saturday night!

Note: All the code for this post is here.

One problem with notebooks

I’m generally (have swayed back and forth) a fan of notebooks but am wary of some of the downsides or costs they can impose. When doing experimental type work, if your not careful, you can end up with lots of duplicated code or what i think of as “notebook instances”, where you have ran your notebook many times on different (but similar) datasets and with different (but similar) parameters.

Aside: Great talk and deck from @joelgrus (who is great – and who’s meme game is very strong) on some drawbacks of notebooks.

Having the executed notebooks themselves become self documenting artifacts relating to the experiment is really useful – the code you ran and its outputs in one place. But when you start building new features on top of these “notebook instances” as you iterate on the research, things can quickly get messy.

Where I’ve found papermill to be very useful is in basically template-ing up your notebooks in one single place and paramaterizing them such that the actual living notebook code and the executed “notebook instances” have a much cleaner separation.

I’ll try make this clearer with an example.

data_explorer Notebook

Lets suppose you have a notebook that you often use on new datasets (in reality it’s more likely to be some more complicated ml pipeline type notebook for quickly experimenting on updated datasets with while maintaining some common structure in how you go about things).

In this example its a simple notebook to download a dataset and just do some descriptive stats and plotting.

The main idea here is to paramaterize the whole notebook as much as possible. This is done with a json dictionary called “config”. So the idea is that everything the notebook needs is pretty much defined in the first cell.

config = {        
"data_url" :"https://raw.githubusercontent.com/andrewm4894/papermill_dev/master/data/titanic.csv"
}

In this case, the data_explorer notebook just takes in one parameter called “data_url”. It then downloads from this url into a pandas dataframe and does some basic plotting. In reality this “config” dict can contain all the input parameters you need to define and execute you notebook. For example it could be defining the type of models to build against your data, what data to use, model parameters, where to store outputs etc. anything and everything really.

Enter Papermill

So lets say you now have a number of different datasets that you want to run through your data_explorer notebook. You could manually update the config and then just rerun the notebook 3 times (making sure to restart the kernel and clear all each time), maybe saving outputs into specific locations. Or worse you could make 3 copies of your notebook and just run them each individually (don’t do this, future you will hate it).

Much better is to let papermill kick off the execution of the notebooks so you have a clear separation between the notebooks your code lives in (in this case, the notebooks folder of the repo) and the outputs or “notebook instances” of running the same notebooks multiple times against different data or the same data but with slightly different parameters (in this case the papermill_outputs folder according to a convention you can control).

Two things let us do this, a python script (run_nb_batch.py) that uses papermill and multiprocessing to kick of parallel notebook executions as defined in a json file defining the notebooks to be run and their configs to be run with configs.json.

run_nb_batch.py:

configs.json

The idea is to loop through each config in the configs.json file and execute the specified notebook with the specified configuration. Executed notebooks then go to a predefined output file such as papermill_outputs/data_explorer/adult/data_explorer_adult.ipynb.

In this case i’ve chosen the naming convention of /papermill_outputs/<notebook_name>/<output_label>/<notebook_name>_<output_label> .ipynb but obviously you can chose whatever you want.

That’s pretty much it for this one. Feel free to clone the repo and play around with it or add improvements as you like.

I’ve been finding that this sort of approach to template-ing up core notebooks you end up using quite a lot (albeit with slightly different params etc.) along with a standardized approach using something like mlflow to further instrument and store artifacts of your notebook runs can make running multiple ‘experiments’ on your data in parallel much easier and overall help make you a bit more productive.

Update: I decided to make a quick video as sometimes easier to just see what we are doing. (Sorry audio quality a bit bad (and poor resolution), first time :))

Java for Machine Learning

I’ve recently been learning java for machine learning related work (a long story to do with mainframes, and you know, why not).

Decided to stick down in here some resources i found very useful as i found i still needed to do a bit of Googling to find ML related Java tutorials and moocs i found useful. Will add to this post as i discover more stuff.

General Java Intro Stuff

Machine Learning Related

Testing in Java

Found this video pretty good on junit. And this one from intelij if that’s your IDE.

Java IDE’s

This was one of the hardest parts – finding my way around IDE’s like eclipse and intelij. They are super powerful but a very different experience to something like jupyter lab or visual studio with python.

I’m using intelij IDEA and they have some good videos here.

Maven

Other Useful/Interesting Stuff