All Posts

  • Time series anomaly detection in Go using GoLearn

    January 14, 2021 by

    I’ve posted recently about learning just enough Go to be dangerous over the christmas break, well here is a update on my adventures so far. The below script (which is probably horrible in places if you know Go properly – tips welcome) uses goroutines to pull data from some REST API endpoints and then use… Read more

  • Anomaly detection resources

    January 3, 2021 by

    A list of useful Anomaly Detection resources, as I find interesting material I will add to this list. – “Awesome” time series anomaly detection list of GitHub. – YouTube playlist of interesting anomaly detection videos I maintain. – An interesting r/machinelearning thread. – Public Mendeley group for interesting Anomaly Detection papers. – KDD 2020 Tutorial… Read more

  • First stab at some Go (so hot right now)

    December 28, 2020 by

    It may be a combination of starting to go stir crazy over the Christmas break and some self loathing at the amount of FIFA i’ve been playing that’s driven me to finally start learning some Go for a few data science and machine learning related projects i’m working on where it offers unique advantages. (In… Read more

  • Anomaly Detection Tutorial

    December 2, 2020 by

    Here is an anomaly detection tutorial that i created for my boss and the open source community where i work. It’s part of some work i have been doing around adding some anomaly detection functionality into our open source monitoring project. Like most ML projects the trickiest, but maybe most important, part can often be… Read more

  • Good Places For Datasets

    November 10, 2020 by
  • Numpy Feature Engineering – 2x Speed Up Over Pandas!

    October 15, 2020 by

    The Setup This is a little one I was surprised to see. Recently I had a need to do some pretty basic feature engineering to a pandas dataframe prior to training some models. Basically I needed to take differences of each column, apply some smoothing, and then add a number of lagged columns for each… Read more

  • Market basket analysis in Python

    September 29, 2020 by

    tl; dr; if you find yourself doing some association rule mining using mlxtend but finding it a bit slow then checkout PyFIM – here is a colab I made to get you started. I have recently been looking to do some market basket analysis (“Association rule learning“) on event log type data (to find insights… Read more

  • Time series clustering with tslearn

    September 3, 2020 by

    I’ve recently been playing around with some time series clustering tasks and came across the tslearn library. I was interested in seeing how easy it would be to get up and running some of the clustering functionality that is already built into tslearn, turns out it was quite easy and straight forward, perfect blog post… Read more

  • Premature Optimization

    June 5, 2020 by

    I’ve been doing some work that necessitated using the same statistical test from spicy lots of times on a fairly wide pandas dataframe with lots of columns. I spent a bit too much time googling around for the most efficient ways to do this, and even more time re-writing things various way before realizing i… Read more

  • Terraform is Magic + r/MachineLearning Links

    April 29, 2020 by

    Terraform is magic, i may be a little late to the game on this one and i’m sure it has it’s fair share of haters (i’ve seen some have a love hate relationship with it, maybe i’m still in my honeymoon period). But from my point of view as a Data Scientist/ML Engineer playing around… Read more

  • Ireland Covid19 Data

    March 23, 2020 by

    I was looking around a bit and could not really find any datasets behind the daily updates from the Irish government that get posted here. In particular i was thinking the break out tables of numbers by different dimensions might be of use for anyone looking to analyse the data. So here is a python… Read more

  • A little brainteaser (or i’m an idiot)

    February 27, 2020 by

    This took me waaay too long to work out today and i was thinking it could make a nice little interview coding type question (which i’d probably fail). Suppose you have 10,000 rows of data and need to continually train and retrain a model training on at most 1,000 rows at a time and retraining… Read more

  • Papers i’m reading #2

    February 20, 2020 by

    Continuation from this post. An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed CPS (Cyber Physical Systems). Link My Summary: Really interesting paper – PGM’s, HMM’s and all that good stuff. Quite complicated though and no clear route to implementation. Also I would wonder how well it scales beyond 10’s of time series.… Read more

  • Github Webhook -> Cloud Function -> BigQuery

    January 28, 2020 by

    I have recently needed to watch and track various activities on specific github repos i’m working on, however the rest api from Gtihub can sometimes be a bit limited (for example, best i could see, if you want to get the most recent list of people who began watching your repo you need to make… Read more

  • Papers i’m reading #1

    December 30, 2019 by

    I’ve recently set myself the goal of reading one academic paper a week relating to the ML/AI things i’m working on i’m my current role. To try help keep me honest and diligent in this regard, I’ve decided to get into the habit of jotting down some quick notes on each paper and every now… Read more

  • My First PyPI Package

    November 26, 2019 by

    I’ve been threatening to myself to do this for a long time and recently got around to it, so as usual i’m going to try milk it for a blog post (Note: i’m not talking about getting into a box like the below picture, its something much less impressive). Confession – I don’t know matplotlib… Read more

  • Multi-Variate, Multi-Step, LSTM for Anomaly Detection

    September 9, 2019 by

    This post will walk through a synthetic example illustrating one way to use a multi-variate, multi-step LSTM for anomaly detection. Imagine you have a matrix of k time series data coming at you at regular intervals and you look at the last n observations for each metric. One approach to doing anomaly detection in such a setting is to… Read more

  • Custom Python Packages in AWS Lambda

    September 4, 2019 by

    I’m pretty sure i’ll be looking this up again at some stage so that passed one of my main thresholds for a blog post. I’ve recently been porting some data and model development pipelines over to AWS Lambda and was mildly horrified to see how clunky the whole process for adding custom python packages to… Read more

  • Clustering Cell Tower Usage Data

    July 20, 2019 by

    Below is a little mini project i did a while back looking at cell tower usage data. Main takeaway really was a nice example of how subjective clustering can be, especially the more features and variables you feed into your distance metric.

  • Parallelize a wide df in Pandas

    July 4, 2019 by

    Sometimes you end up with a very wide pandas dataframe and you are interested in doing the same types of operations (data processing, building a model etc.) but focused on subsets of the columns. For example if we had a wide df with different time series kpi’s represented as columns then we might want to… Read more

  • ( 0 – 0 ) / 0 != 0

    June 11, 2019 by

    Arrrgghh – I just wasted the best part of an afternoon chasing this one down. If i can knock out a quick post on it then at least i’ll feel i’ve gotten something out of it. Here’s the story – somewhere in an admittedly crazy ETL type pipeline i was using pandas pct_change() as a… Read more

  • Java Hello World cli using args4j

    May 1, 2019 by

    A simple example of how to use args4j to add command line args to a simple “Hello World” type java application. For example below command line execution would print “Hello arg4j!” instead of the default “Hello World!” if you don’t pass any args. $ java -jar helloWorldParamaterized –msg=’Hello arg4j!’

  • Parallel Jupyter Notebooks

    April 27, 2019 by

    I have become master of the notebooks, they bend at my will and exist to serve my data science needs! Ok i might be getting a bit carried away, but i recently discovered papermill and have been finding it very useful in conjunction with Python multiprocessing to speed up a lot of data science experimental… Read more

  • Java for Machine Learning

    April 25, 2019 by

    I’ve recently been learning java for machine learning related work (a long story to do with mainframes, and you know, why not). Decided to stick down in here some resources i found very useful as i found i still needed to do a bit of Googling to find ML related Java tutorials and moocs i… Read more

  • Previous blog posts

    April 25, 2019 by

    List of blog posts i’ve done before this site. Multi-Variate, Multi-Step, LSTM for Anomaly Detection A Docker Data Science Recipe Celebrity Word Vectors Playing around with Apache Airflow & BigQuery Content Lifecycle Clustering Interview with One potential pitfall with referral source tracking and AMP in Google Analytics

  • My Site…

    April 25, 2019 by

    Was doing some blogs on medium but don’t trust them to not all end up behind some sort of paywall so decided to shell out a couple quid a month for my own domain.

View all posts