All Posts

  • Malloy seems petty cool…

    July 1, 2023 by

    I discovered Malloy recently in this great talk, it seems like a really interesting idea (a higher level abstraction or DSL on top of sql) with some great people behind it (looker founder who seems to really know his stuff). So I decided to try get going with it in as minimal a way as… Read more

  • Painless Anomaly Detection with Apache Airflow

    May 18, 2023 by

    Data observability is so hot right now…but do you know what’s also hot? Using some tried and tested ingredients like Apache Airflow and PyOD to perform painless anomaly detection on your key business metrics. You don’t need to run off and buy an (expensive!) subscription for the latest hot data observability Sass offering (there is… Read more

  • Stripe Webhook + GCP Functions Framework (Python)

    December 22, 2022 by

    This took a couple of days of messing around so decided to make a post out of it. Here is a minimal enough example repo using Terraform and GCP Functions Framework to build a GCP Python function that will receive a Stripe webhook event, perform signature verification, and then just print the event. You can… Read more

  • Colab to just run some curl

    October 20, 2022 by

    Here is a little google colab notebook to just paste in some curl command and get the response back into a json dictionary. This can be handy when working with backend and frontend engineers who might be using a different language than you and just send you some curl commands that you want to explore… Read more

  • Hugging Face Text Classification Quickstart

    August 18, 2022 by

    I have been working a bit lately with some text classification stuff using Hugging Face – its great n all but their docs can actually be a bit overwhelming. So here is a minimal text classification example, using huggingface and either pytorch or tensorflow (you decide). Will try to update and maintain the colab here:… Read more

  • Airflow “Trigger Dags” Python Script

    July 1, 2022 by

    You have some dag that runs multiple times a day but you need to do a manual backfill of last 30 days. It’s 2022 and this is still surprisingly painful with Airflow. The “new” REST API helps and mean’s all the building blocks are there but, as I found out today, there can often still… Read more

  • streamlit multi-page app minimal example

    May 27, 2021 by

    Recently i had a need to assess streamlit for some internal DS/ML/Data apps i wanted to build in my job. By “i had a need” i mean i heard it was the new cool thing so i wanted to play with it and feel better about myself. Anyway, as part of that i built a… Read more

  • Some asyncio fun/pain

    March 25, 2021 by

    You have a list of api endpoints you want to pull data from and collect results into some results list or dataframe for further processing. You could just loop over that list and make a load of requests.get() calls right. But no, you are a sexy ML engineer, this is a great excuse to spend… Read more

  • Anomaly Detection using the Matrix Profile

    February 16, 2021 by

    I like an excuse to play with fancy things, so when i first learned about the Matrix Profile for time series analysis, particularly around anomaly detection, i was intrigued. When i learned there was a nice python package (STUMPY) i could just pip install i was outright excited, as one thing i like more than… Read more

  • “Machine Learning Ireland” slack community

    January 29, 2021 by

    I’ve been thinking for a while that there really should be some go-to online community for people interested in ML and ireland in some way. I was fairly surprised to find that none really existed when i went to find one to join. So, being the genius that i am, i decided to just make… Read more

  • Time series anomaly detection in Go using GoLearn

    January 14, 2021 by

    I’ve posted recently about learning just enough Go to be dangerous over the christmas break, well here is a update on my adventures so far. The below script (which is probably horrible in places if you know Go properly – tips welcome) uses goroutines to pull data from some REST API endpoints and then use… Read more

  • Anomaly Detection Resources

    January 3, 2021 by

    A list of useful Anomaly Detection resources, as I find interesting material I will add to this list. – “Awesome” time series anomaly detection list of GitHub. – YouTube playlist of interesting anomaly detection videos I maintain. – Creator of PyOD “yzhao062/anomaly-detection-resources” list of useful stuff. – An interesting r/machinelearning thread. – Public Mendeley group… Read more

  • First stab at some Go (so hot right now)

    December 28, 2020 by

    It may be a combination of starting to go stir crazy over the Christmas break and some self loathing at the amount of FIFA i’ve been playing that’s driven me to finally start learning some Go for a few data science and machine learning related projects i’m working on where it offers unique advantages. (In… Read more

  • Anomaly Detection Tutorial

    December 2, 2020 by

    Here is an anomaly detection tutorial that i created for my boss and the open source community where i work. It’s part of some work i have been doing around adding some anomaly detection functionality into our open source monitoring project. Like most ML projects the trickiest, but maybe most important, part can often be… Read more

  • Numpy Feature Engineering – 2x Speed Up Over Pandas!

    October 15, 2020 by

    The Setup This is a little one I was surprised to see. Recently I had a need to do some pretty basic feature engineering to a pandas dataframe prior to training some models. Basically I needed to take differences of each column, apply some smoothing, and then add a number of lagged columns for each… Read more

  • Market basket analysis in Python

    September 29, 2020 by

    tl; dr; if you find yourself doing some association rule mining using mlxtend but finding it a bit slow then checkout PyFIM – here is a colab I made to get you started. I have recently been looking to do some market basket analysis (“Association rule learning“) on event log type data (to find insights… Read more

  • Time series clustering with tslearn

    September 3, 2020 by

    I’ve recently been playing around with some time series clustering tasks and came across the tslearn library. I was interested in seeing how easy it would be to get up and running some of the clustering functionality that is already built into tslearn, turns out it was quite easy and straight forward, perfect blog post… Read more

  • Premature Optimization

    June 5, 2020 by

    I’ve been doing some work that necessitated using the same statistical test from spicy lots of times on a fairly wide pandas dataframe with lots of columns. I spent a bit too much time googling around for the most efficient ways to do this, and even more time re-writing things various way before realizing i… Read more

  • Terraform is Magic + r/MachineLearning Links

    April 29, 2020 by

    Terraform is magic, i may be a little late to the game on this one and i’m sure it has it’s fair share of haters (i’ve seen some have a love hate relationship with it, maybe i’m still in my honeymoon period). But from my point of view as a Data Scientist/ML Engineer playing around… Read more

  • Ireland Covid19 Data

    March 23, 2020 by

    I was looking around a bit and could not really find any datasets behind the daily updates from the Irish government that get posted here. In particular i was thinking the break out tables of numbers by different dimensions might be of use for anyone looking to analyse the data. So here is a python… Read more

  • A little brainteaser (or i’m an idiot)

    February 27, 2020 by

    This took me waaay too long to work out today and i was thinking it could make a nice little interview coding type question (which i’d probably fail). Suppose you have 10,000 rows of data and need to continually train and retrain a model training on at most 1,000 rows at a time and retraining… Read more

  • Papers i’m reading #2

    February 20, 2020 by

    Continuation from this post. An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed CPS (Cyber Physical Systems). Link My Summary: Really interesting paper – PGM’s, HMM’s and all that good stuff. Quite complicated though and no clear route to implementation. Also I would wonder how well it scales beyond 10’s of time series.… Read more

  • Github Webhook -> Cloud Function -> BigQuery

    January 28, 2020 by

    I have recently needed to watch and track various activities on specific github repos i’m working on, however the rest api from Gtihub can sometimes be a bit limited (for example, best i could see, if you want to get the most recent list of people who began watching your repo you need to make… Read more

  • Papers i’m reading #1

    December 30, 2019 by

    I’ve recently set myself the goal of reading one academic paper a week relating to the ML/AI things i’m working on i’m my current role. To try help keep me honest and diligent in this regard, I’ve decided to get into the habit of jotting down some quick notes on each paper and every now… Read more

  • My First PyPI Package

    November 26, 2019 by

    I’ve been threatening to myself to do this for a long time and recently got around to it, so as usual i’m going to try milk it for a blog post (Note: i’m not talking about getting into a box like the below picture, its something much less impressive). Confession – I don’t know matplotlib… Read more

  • Multi-Variate, Multi-Step, LSTM for Anomaly Detection

    September 9, 2019 by

    This post will walk through a synthetic example illustrating one way to use a multi-variate, multi-step LSTM for anomaly detection. Imagine you have a matrix of k time series data coming at you at regular intervals and you look at the last n observations for each metric. One approach to doing anomaly detection in such a setting is to… Read more

  • Custom Python Packages in AWS Lambda

    September 4, 2019 by

    I’m pretty sure i’ll be looking this up again at some stage so that passed one of my main thresholds for a blog post. I’ve recently been porting some data and model development pipelines over to AWS Lambda and was mildly horrified to see how clunky the whole process for adding custom python packages to… Read more

  • Clustering Cell Tower Usage Data

    July 20, 2019 by

    Below is a little mini project i did a while back looking at cell tower usage data. Main takeaway really was a nice example of how subjective clustering can be, especially the more features and variables you feed into your distance metric.

  • Parallelize a wide df in Pandas

    July 4, 2019 by

    Sometimes you end up with a very wide pandas dataframe and you are interested in doing the same types of operations (data processing, building a model etc.) but focused on subsets of the columns. For example if we had a wide df with different time series kpi’s represented as columns then we might want to… Read more

  • ( 0 – 0 ) / 0 != 0

    June 11, 2019 by

    Arrrgghh – I just wasted the best part of an afternoon chasing this one down. If i can knock out a quick post on it then at least i’ll feel i’ve gotten something out of it. Here’s the story – somewhere in an admittedly crazy ETL type pipeline i was using pandas pct_change() as a… Read more

  • Java Hello World cli using args4j

    May 1, 2019 by

    A simple example of how to use args4j to add command line args to a simple “Hello World” type java application. For example below command line execution would print “Hello arg4j!” instead of the default “Hello World!” if you don’t pass any args. $ java -jar helloWorldParamaterized –msg=’Hello arg4j!’

  • Java Weka API: Adding List To Instances Object

    April 30, 2019 by

    This is just a quick one to save anyone else new to the Weka api in Java spending as much time as i did figuring this one out. Lets suppose you have a weka instances object and a new list of values you want to add into it as a new attribute (weka slang for… Read more

  • Parallel Jupyter Notebooks

    April 27, 2019 by

    I have become master of the notebooks, they bend at my will and exist to serve my data science needs! Ok i might be getting a bit carried away, but i recently discovered papermill and have been finding it very useful in conjunction with Python multiprocessing to speed up a lot of data science experimental… Read more

  • Java for Machine Learning

    April 25, 2019 by

    I’ve recently been learning java for machine learning related work (a long story to do with mainframes, and you know, why not). Decided to stick down in here some resources i found very useful as i found i still needed to do a bit of Googling to find ML related Java tutorials and moocs i… Read more

  • Previous blog posts

    April 25, 2019 by

    List of blog posts i’ve done before this site. Multi-Variate, Multi-Step, LSTM for Anomaly Detection A Docker Data Science Recipe Celebrity Word Vectors Playing around with Apache Airflow & BigQuery Content Lifecycle Clustering Interview with One potential pitfall with referral source tracking and AMP in Google Analytics

  • My Site…

    April 25, 2019 by

    Was doing some blogs on medium but don’t trust them to not all end up behind some sort of paywall so decided to shell out a couple quid a month for my own domain.

View all posts