Github Webhook -> Cloud Function -> BigQuery

I have recently needed to watch and track various activities on specific github repos i’m working on, however the rest api from Gtihub can sometimes be a bit limited (for example, best i could see, if you want to get the most recent list of people who began watching your repo you need to make a lot of paginated api calls and do battle with rate limiting 💩).

This is where Github Webhooks can be a very useful alternative way to trigger certain events of interest to some endpoint where you can then handle the data as you need. The use case i was interested in was triggering an event any time someone starred, unstarred, watched or forked a specific repository. I wanted to then store that info in a table in Google BigQuery where it can be used to track repository activity over time for various reasons you might want (outreach to the community around the repository, or just tracking growth over time).

After the usual few hours of googling around i landed upon the idea of having the webhook for Github send events to a Google Cloud Function, from there my cloud function can process and append the data onto a BigQuery table. To make developing and maintaining the cloud function easy i used Serverless and built on this example in particular.

p.s. i also found this repository very useful as well as this one from Bloomberg. Also i think you could maybe get something similar done without any code using something like Zapier (although i don’t think they have all the Github Webhook events available).

p.p.s all the code is in this repo.

Step 1 – Serverless

We start by leveraging this Serverless example to create the bare bones structure for our cloud function.

In a folder where we want the code to live we run the below to install Serverless if needed, and pull down the google-python-simple-http-endpoint template and save it into a new Serverless project called handle-github-events.

The approach i am taking also depends on using a .env file to handle secrets and enviornmental variables so we also need to install the serverless-dotenv-plugin, and run npm install for everything else we need.

Step 2 – Cloud Function

Once we have the bare bones serverless template in place we can build on it to create the function we want for handling incoming requests from the Github webhook. All the code is in this repository and i’ll walk through the main points below.

The core of what we want to do in our Cloud function is in main.py. What it tries to do is:

  1. Validate that the request is coming from a known Github ip address.
  2. Validate that the hashed secret key stored in Github when you create your webhook matches what is expected by the cloud function as pulled from the GITHUB_WEBHOOK_SECRET environment variable.
  3. Parse the json received from the Github request and append it to a table somewhere in BigQuery.
  4. Return as the response to Github some info about the event.

Our serverless.yml file looks like below. Note that it is pulling environment variables required for serverless to deploy from a .env file you would need to create yourself (here is an example in the repo).

Step 3 – Deploy

Once we are ready we run `serverless deploy` and if all goes well see output like below:

>serverless deploy -v
Serverless: DOTENV: Loading environment variables from .env:
Serverless:      - GITHUB_WEBHOOK_SECRET
Serverless:      - GCP_KEY_FILE
Serverless:      - GCP_PROJECT_NAME
Serverless:      - GCP_REGION_NAME
Serverless:      - BQ_DATASET_NAME
Serverless:      - BQ_TABLE_NAME
Serverless:      - BQ_IF_EXISTS
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Compiling function "github_event"...
Serverless: Uploading artifacts...
Serverless: Artifacts successfully uploaded...
Serverless: Updating deployment...
Serverless: Checking deployment update progress...
....................
Serverless: Done...
Service Information
service: handle-github-events
project: <your project name will be here>
stage: dev
region: <your region will be here>

Deployed functions
github_event
  https://<your-region>-<your-project-name>.cloudfunctions.net/github_event

Serverless: Removing old artifacts...

Now you should have a cloud function alive at some url like https://your-region-your-project-name.cloudfunctions.net/github_event.

Step 4 – Github Webhook

Once your function is deployed (or in reality you might make the Gtibhub webhook first and then iterate on the function to get it doing what you want) you can create and test Github Webhook you want to send events from.

In my case and for this post i’m going to add the webhook to my andrewm4894/random repository for illustration. Payload URL is the url of the cloud function we created and Secret should be the same string you are storing in your .env file as “GITHUB_WEBHOOK_SECRET”.

Check whatever events you want to trigger on – i’m my case it was star, watch and fork events (Note: the function might not work if you were to send all events or different events – you would just need to adapt it accordingly).

Fingers Crossed

Now we can try see if it works by triggering some events. In this example i logged on as a second username i have and pressed some star, watch, and fork buttons to see what happened.

You can see recent triggers of the webhook in Github and this can be very useful for debugging things and while developing.

An example request sent to the cloud function.

And you can also see the response received from the cloud function. In this case showing that “andrewm4894netdata” (my other user) deleted a star from the “andrewm4894/random” repository 😔.

Example response back from our cloud function.

And then finally we can see the stored events in our table in BigQuery:

We have the data!!

And that’s it! We have our Github Webhook sending events to our Google Cloud Function which is in turn appending them onto a daily table in BigQuery. Go Webhooks!

Custom Python Packages in AWS Lambda

It’s True.

I’m pretty sure i’ll be looking this up again at some stage so that passed one of my main thresholds for a blog post.

I’ve recently been porting some data and model development pipelines over to AWS Lambda and was mildly horrified to see how clunky the whole process for adding custom python packages to your Lambda was (see docs here).

This was probably the best post i found but it still did not quite cover custom python packages you might need to include beyond just the more typical pypi ones like numpy, pandas, etc. (p.s. this video was really useful if you are working in Cloud9).

So i set out to hack together a process that would automate 90% of the work in packaging up any python packages you might want to make available to your AWS Lambda including local custom python packages you might have built yourself.

The result involves a Docker container to build your packages in (i have to use this as using windows based python package local install does not work in Lambda as the install contains some windows stuff Lambda won’t like), and a jupyter notebook (of course there is some jupyter 🙂 ) to take some inputs (what packages you want, what to call the AWS Layer, etc), build local installs of the packages, add them to a zip file, load zip file to S3 and then finally use awscli to make a new layer from said S3 zip file.

Dockerfile

The first place to start is with the below Dockerfile that creates a basic conda ready docker container with jupyter installed. Note it also includes conda-build and copies over the packages/ folder into the container (required as i wanted to install my “my_utils” package and have it available to the jupyter notebook).

Build this with:

$ docker build -t my-aws-python-packages -f ./Dockerfile ./

And then run it with:

$ docker run -it --name my-aws-python-packages 
    -p 8888:8888
    --mount type=bind,source="$(pwd)/work",target=/home/jovyan/work
    --mount type=bind,source="$(pwd)/packages",target=/home/jovyan/packages 
    -e AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id)
    -e AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key)
    my-aws-python-packages

The above runs the container, port forwards 8888 (for jupyter), mounts both the /packages and /work folders (as for these files we want changes from outside docker or inside to be reflected and vice versa), and passes in my AWS credentials as environment variables to the container (needed for the asw cli commands we will run inside the container). Its last step is to then launch jupyter lab which you then should be able to get to at http://localhost:8888/lab using the token provided by jupyter.

Notebook time – make_layer.ipynb

Once the docker container is running and you are in jupyter the make_layer notebook automates the local installation of a list of python packages, zipping them to /work/python.zip folder as expected by AWS Layers (when unzipped your root folder needs to be /python/…), loading it to an S3 location, and then using awscli to add a new layer or version (if the layer already exists).

The notebook itself is not that big so i’ve included it below.

For this example i’ve included two custom packages along with pandas into my AWS Layer. The custom packages are just two little basic hello_world() type packages (one actually creates the subprocess_execute() function used in the make_layer notebook). I’ve included pandas then as well to illustrate how to include a pypi package.

Serverless Deploy!

To round off the example we then also need to create a little AWS Lambda function to validate that the packages installed in our layer can actually be used by Lambda.

To that end, i’ve adapted the serverless example cron lamdba from here into my own little lambda using both my custom packages and pandas.

Here is the handler.py that uses my packages:

And the serverless.yml used to configure and deploy the lambda:

We then deploy this function (from here) with:

$ serverless deploy

And we can then go into the AWS console to the Lamdba function we just created. We can test it in the UI and see the expected output whereby our custom functions work as expected as does Pandas:

Success!

That’s it for this one, i’m hoping someone might find this useful as i was really surprised by how painful it was to get a simple custom package or even pypi packages for that matter available to your AWS Lambda functions.

If you wanted you could convert the ipynb notebook into a python script and automate the whole thing. Although i’m pretty sure Amazon will continue to make the whole experience a bit more seamless and easier over time.