← Back to Home

Terraform is Magic + r/MachineLearning Links

Terraform is magic, i may be a little late to the game on this one and i’m sure it has it’s fair share of haters (i’ve seen some have a love hate relationship with it, maybe i’m still in my honeymoon period).

But from my point of view as a Data Scientist/ML Engineer playing around with various services in a multi-cloud environment (mainly GCP and AWS) its one of the things that keeps me sane in a sea of changing features and web ui’s that the cloud providers love to throw at us.

When it comes to cloud projects i’m working with, if its not under source control and managed by terraform then i don’t want to know about it.

So as I’ve been on my voyage of discovery with terraform i decided to put it to use in a little side project I’ve been wanting to do for a while.

(Note: All code is available in my reddit-links Github repo.)

The Set Up

The idea is to have a cloud function that runs each day, looks at top posts on r/MachineLearning, pulls out all links, and some metadata, and then saves those links somewhere useful.

My original goal (and something i might get to at some stage) was to pull all YouTube links shared on r/MachineLearning that have a decent score into an automatic playlist on YouTube or maybe automatically add them to my own watch later playlist. I have the data now so this might be an easy enough next project.

So given the goal above here are the various moving parts I’ve picked to use:

Terraform

Once you work with and set up a Terraform project once then its pretty straight forward. I have used Serverless for some cloud functions before but i like the way terraform gives you everything GCP or AWS or Azure etc have to offer at your fingertips once you invest that little bit of learning up front.

Here is a list of the .tf files i’m using and what they all do. There are of course many ways to set things up in Terraform but this seemed like a straightforward enough way and works for me for smaller projects like this.

The function used to pull from reddit is below. It might be a little verbose and could probably be refactored a little but hey - it works. Mainly using the PRAW library to pull from reddit, some usual data wrangling libraries, and then the airtable-python-wrapper library to insert/update records in Airtable.

https://gist.github.com/andrewm4894/dbea54c122d8f3dd4e2695887e337567

End Results - Airtable

The end results of all this being a cloud function that runs once a day to update or insert records into this Airtable. Then from that i have made some views for links from specific domains.

(Note: If you wanted to do this but for a different subreddit then you should need to make minimal changes once you have your terraform variables all set up - primarily changing or adding another cloud scheduler job to here with the relevant params, as well as making sure all keys etc you need are available in the right places).

← Back to Home