Github Webhook -> Cloud Function -> BigQuery

I have recently needed to watch and track various activities on specific github repos i’m working on, however the rest api from Gtihub can sometimes be a bit limited (for example, best i could see, if you want to get the most recent list of people who began watching your repo you need to make a lot of paginated api calls and do battle with rate limiting 💩).

This is where Github Webhooks can be a very useful alternative way to trigger certain events of interest to some endpoint where you can then handle the data as you need. The use case i was interested in was triggering an event any time someone starred, unstarred, watched or forked a specific repository. I wanted to then store that info in a table in Google BigQuery where it can be used to track repository activity over time for various reasons you might want (outreach to the community around the repository, or just tracking growth over time).

After the usual few hours of googling around i landed upon the idea of having the webhook for Github send events to a Google Cloud Function, from there my cloud function can process and append the data onto a BigQuery table. To make developing and maintaining the cloud function easy i used Serverless and built on this example in particular.

p.s. i also found this repository very useful as well as this one from Bloomberg. Also i think you could maybe get something similar done without any code using something like Zapier (although i don’t think they have all the Github Webhook events available).

p.p.s all the code is in this repo.

Step 1 – Serverless

We start by leveraging this Serverless example to create the bare bones structure for our cloud function.

In a folder where we want the code to live we run the below to install Serverless if needed, and pull down the google-python-simple-http-endpoint template and save it into a new Serverless project called handle-github-events.

npm install serverless -g
serverless install -u https://github.com/serverless/examples/tree/master/google-python-simple-http-endpoint -n handle-github-events
view raw start.sh hosted with ❤ by GitHub

The approach i am taking also depends on using a .env file to handle secrets and enviornmental variables so we also need to install the serverless-dotenv-plugin, and run npm install for everything else we need.

cd handle-github-errors
npm i -D serverless-dotenv-plugin
npm install
view raw do_npm_stuff.sh hosted with ❤ by GitHub

Step 2 – Cloud Function

Once we have the bare bones serverless template in place we can build on it to create the function we want for handling incoming requests from the Github webhook. All the code is in this repository and i’ll walk through the main points below.

The core of what we want to do in our Cloud function is in main.py. What it tries to do is:

  1. Validate that the request is coming from a known Github ip address.
  2. Validate that the hashed secret key stored in Github when you create your webhook matches what is expected by the cloud function as pulled from the GITHUB_WEBHOOK_SECRET environment variable.
  3. Parse the json received from the Github request and append it to a table somewhere in BigQuery.
  4. Return as the response to Github some info about the event.
"""
GCP HTTP Cloud Function to handle github webhook events.
Some code stolen from here: https://github.com/carlos-jenkins/python-github-webhooks/blob/master/webhooks.py
"""
# -*- coding: utf-8 -*-
import hmac
import json
import datetime
import logging
import os
from ipaddress import ip_address, ip_network
import pandas as pd
import requests
def validate_request_ip(request):
"""Function to validate that request comes from a known github ip"""
# get ip of request
request_ip_address = ip_address(u'{}'.format(request.access_route[0]))
# get whitelist of valid ip's from github
github_ip_whitelist = requests.get('https://api.github.com/meta').json()['hooks']
# check if ip is a valid one from github
for valid_ip in github_ip_whitelist:
if request_ip_address in ip_network(valid_ip):
break
else:
error_msg = 'IP {} not allowed.'.format(request_ip_address)
logging.error(error_msg)
raise ValueError(error_msg)
def validate_request_signature(request):
"""Validate that request signature and function signature match"""
# get signature from header
sha_name, request_signature = request.headers.get('X-Hub-Signature').split('=')
# create matching signature
function_signature = hmac.new(
str.encode(os.environ.get('GITHUB_WEBHOOK_SECRET', 'Specified environment variable is not set.')),
msg=request.data,
digestmod='sha1').hexdigest()
# check if signatures match
if str(request_signature) != str(function_signature):
error_msg = 'Signatures do not match.'
logging.error(error_msg)
raise ValueError(error_msg)
def validate_event_type(event_type):
"""Function to error out if event type is of a type not yet implemented for handling by this function"""
if event_type not in ['star', 'watch', 'fork']:
error_msg = f"Event Type '{event_type}' not yet implemented by this function."
logging.error(error_msg)
raise NotImplementedError()
def github_event(request):
"""Function to handle incoming event from github webhook and save event data to BigQuery."""
# validate request ip
validate_request_ip(request)
# validate request signature
validate_request_signature(request)
# request_timestamp
request_timestamp = str(datetime.datetime.now())
# github_request_type
github_event_type = request.headers.get('X-GitHub-Event')
validate_event_type(github_event_type)
# get relevant env vars
gcp_project_id = os.environ.get('GCP_PROJECT_NAME')
bq_dataset_name = os.environ.get('BQ_DATASET_NAME')
bq_table_name = os.environ.get('BQ_TABLE_NAME')
bq_if_exists = os.environ.get('BQ_IF_EXISTS')
bq_table_suffix = request_timestamp.replace('-', '')[0:8]
# get json from request
request_json = request.get_json()
# create response body
response_body = {
"request_method": str(request.method),
"timestamp": request_timestamp,
"event_type": github_event_type,
"action": request_json.get("action", github_event_type),
"starred_at": request_json.get("starred_at", ""),
"repository_full_name": request_json.get("repository")["full_name"],
"sender_username": request_json.get("sender")["login"]
}
# build response
response = {
"statusCode": 200,
"body": response_body
}
# logging response
logging.info(response)
# make pandas df
data = [response_body['timestamp'], response_body['repository_full_name'], response_body['event_type'],
response_body['action'], response_body['sender_username']]
columns = ['timestamp', 'repo', 'event_type', 'action', 'username']
df = pd.DataFrame(data=[data], columns=columns)
# display df.head() in logs
logging.info(df.head())
# save to big query
df.to_gbq(
destination_table=f'{bq_dataset_name}.{bq_table_name}_{bq_table_suffix}',
project_id=gcp_project_id, if_exists=bq_if_exists
)
return json.dumps(response, indent=4)
view raw main.py hosted with ❤ by GitHub

Our serverless.yml file looks like below. Note that it is pulling environment variables required for serverless to deploy from a .env file you would need to create yourself (here is an example in the repo).

service: handle-github-events
frameworkVersion: ">=1.2.0 <2.0.0"
package:
exclude:
node_modules/**
.gitignore
.git/**
plugins:
serverless-google-cloudfunctions
serverless-dotenv-plugin
provider:
name: google
runtime: python37
project: ${env:GCP_PROJECT_NAME}
region: ${env:GCP_REGION_NAME}
credentials: ${env:GCP_KEY_FILE}
functions:
github_event:
handler: github_event
events:
http: path
view raw serverless.yml hosted with ❤ by GitHub

Step 3 – Deploy

Once we are ready we run `serverless deploy` and if all goes well see output like below:

>serverless deploy -v
Serverless: DOTENV: Loading environment variables from .env:
Serverless:      - GITHUB_WEBHOOK_SECRET
Serverless:      - GCP_KEY_FILE
Serverless:      - GCP_PROJECT_NAME
Serverless:      - GCP_REGION_NAME
Serverless:      - BQ_DATASET_NAME
Serverless:      - BQ_TABLE_NAME
Serverless:      - BQ_IF_EXISTS
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Compiling function "github_event"...
Serverless: Uploading artifacts...
Serverless: Artifacts successfully uploaded...
Serverless: Updating deployment...
Serverless: Checking deployment update progress...
....................
Serverless: Done...
Service Information
service: handle-github-events
project: <your project name will be here>
stage: dev
region: <your region will be here>

Deployed functions
github_event
  https://<your-region>-<your-project-name>.cloudfunctions.net/github_event

Serverless: Removing old artifacts...

Now you should have a cloud function alive at some url like https://your-region-your-project-name.cloudfunctions.net/github_event.

Step 4 – Github Webhook

Once your function is deployed (or in reality you might make the Gtibhub webhook first and then iterate on the function to get it doing what you want) you can create and test Github Webhook you want to send events from.

In my case and for this post i’m going to add the webhook to my andrewm4894/random repository for illustration. Payload URL is the url of the cloud function we created and Secret should be the same string you are storing in your .env file as “GITHUB_WEBHOOK_SECRET”.

Check whatever events you want to trigger on – i’m my case it was star, watch and fork events (Note: the function might not work if you were to send all events or different events – you would just need to adapt it accordingly).

Fingers Crossed

Now we can try see if it works by triggering some events. In this example i logged on as a second username i have and pressed some star, watch, and fork buttons to see what happened.

You can see recent triggers of the webhook in Github and this can be very useful for debugging things and while developing.

An example request sent to the cloud function.

And you can also see the response received from the cloud function. In this case showing that “andrewm4894netdata” (my other user) deleted a star from the “andrewm4894/random” repository 😔.

Example response back from our cloud function.

And then finally we can see the stored events in our table in BigQuery:

We have the data!!

And that’s it! We have our Github Webhook sending events to our Google Cloud Function which is in turn appending them onto a daily table in BigQuery. Go Webhooks!

2 thoughts on “Github Webhook -> Cloud Function -> BigQuery

  1. Hi Andrew! Thank you for doing this blog it’s been really useful. I’m trying to do something similar but from Microsoft Teams to Cloud Function to BigQuery and I have some questions.
    1. What’s the importance of using Serverless?
    2. In this line
    sha_name, request_signature = request.headers.get(‘X-Hub-Signature’).split(‘=’)
    How did you know that the signature is in headers as ‘X-Hub-Signature’? Cause in Teams they require to have the same HMAC token but I don’t know how can I obtein that (the name in headers) from the request to match it with my cloud function. (I have the token of course)

    Thank you so much again!

    Karen.

    1. opps sorry for big delay in responding!

      1. serverless really useful if you are sort of a one person team and cant have the time to managing infra etc as well as making sure the pipelines work as expected. So if you just have a gcp function that just handles the events then all you need to worry about is the business logic and that the events get into BQ and dont need to think that much about any of the infra that this all runs on etc.
      2. not sure – it might be a bit different now or in teams – i have seen HMAC tokens for GCP buckets recently so maybe something useful in there: https://cloud.google.com/storage/docs/authentication/managing-hmackeys. you could get the token from teams, store it in GCP secrets and then make it available to the function like this: https://cloud.google.com/functions/docs/configuring/secrets

Leave a Reply