-
Malloy seems petty cool…
I discovered Malloy recently in this great talk, it seems like a really interesting idea (a higher level abstraction or DSL on top of sql) with some great people behind it (looker founder who seems to really know his stuff). So I decided to try get going with it in as minimal a way as… Read more
-
Painless Anomaly Detection with Apache Airflow
Data observability is so hot right now…but do you know what’s also hot? Using some tried and tested ingredients like Apache Airflow and PyOD to perform painless anomaly detection on your key business metrics. You don’t need to run off and buy an (expensive!) subscription for the latest hot data observability Sass offering (there is… Read more
-
Stripe Webhook + GCP Functions Framework (Python)
This took a couple of days of messing around so decided to make a post out of it. Here is a minimal enough example repo using Terraform and GCP Functions Framework to build a GCP Python function that will receive a Stripe webhook event, perform signature verification, and then just print the event. You can… Read more
-
Colab to just run some curl
Here is a little google colab notebook to just paste in some curl command and get the response back into a json dictionary. This can be handy when working with backend and frontend engineers who might be using a different language than you and just send you some curl commands that you want to explore… Read more
-
Some sort-of livecoding
-
Explaining kmeans clustering for unsupervised anomaly detection
Here is a video I did in work explaining how our anomaly detection works. https://www.netdata.cloud/blog/how-netdatas-machine-learning-works
-
Hugging Face Text Classification Quickstart
I have been working a bit lately with some text classification stuff using Hugging Face – its great n all but their docs can actually be a bit overwhelming. So here is a minimal text classification example, using huggingface and either pytorch or tensorflow (you decide). Will try to update and maintain the colab here:… Read more
-
Airflow “Trigger Dags” Python Script
You have some dag that runs multiple times a day but you need to do a manual backfill of last 30 days. It’s 2022 and this is still surprisingly painful with Airflow. The “new” REST API helps and mean’s all the building blocks are there but, as I found out today, there can often still… Read more
-
Some ML hot takes
Here is a blog post I made in work with some thoughts on the approach to ML that I want us to try take: https://www.netdata.cloud/blog/our-approach-to-machine-learning
-
Time series anomaly detection using PCA
Here is a little recipe for using good old PCA to do some fast and efficient time series anomaly detection.
-
streamlit multi-page app minimal example
Recently i had a need to assess streamlit for some internal DS/ML/Data apps i wanted to build in my job. By “i had a need” i mean i heard it was the new cool thing so i wanted to play with it and feel better about myself. Anyway, as part of that i built a… Read more
-
Some asyncio fun/pain
You have a list of api endpoints you want to pull data from and collect results into some results list or dataframe for further processing. You could just loop over that list and make a load of requests.get() calls right. But no, you are a sexy ML engineer, this is a great excuse to spend… Read more
-
Anomaly Detection using the Matrix Profile
I like an excuse to play with fancy things, so when i first learned about the Matrix Profile for time series analysis, particularly around anomaly detection, i was intrigued. When i learned there was a nice python package (STUMPY) i could just pip install i was outright excited, as one thing i like more than… Read more
-
“Machine Learning Ireland” slack community
I’ve been thinking for a while that there really should be some go-to online community for people interested in ML and ireland in some way. I was fairly surprised to find that none really existed when i went to find one to join. So, being the genius that i am, i decided to just make… Read more
-
Time series anomaly detection in Go using GoLearn
I’ve posted recently about learning just enough Go to be dangerous over the christmas break, well here is a update on my adventures so far. The below script (which is probably horrible in places if you know Go properly – tips welcome) uses goroutines to pull data from some REST API endpoints and then use… Read more
-
Anomaly Detection Resources
A list of useful Anomaly Detection resources, as I find interesting material I will add to this list. – “Awesome” time series anomaly detection list of GitHub. – YouTube playlist of interesting anomaly detection videos I maintain. – Creator of PyOD “yzhao062/anomaly-detection-resources” list of useful stuff. – An interesting r/machinelearning thread. – Public Mendeley group… Read more
-
First stab at some Go (so hot right now)
It may be a combination of starting to go stir crazy over the Christmas break and some self loathing at the amount of FIFA i’ve been playing that’s driven me to finally start learning some Go for a few data science and machine learning related projects i’m working on where it offers unique advantages. (In… Read more
-
Anomaly Detection Tutorial
Here is an anomaly detection tutorial that i created for my boss and the open source community where i work. It’s part of some work i have been doing around adding some anomaly detection functionality into our open source monitoring project. Like most ML projects the trickiest, but maybe most important, part can often be… Read more
-
Good Places For Datasets
I get asked this a lot by students so decided to make a little list in here that I can add to and point people towards. https://github.com/awesomedata/awesome-public-datasets https://www.kaggle.com/datasets https://datasetsearch.research.google.com/ https://cloud.google.com/bigquery/public-data https://cloud.google.com/public-datasets https://registry.opendata.aws/ https://data.world/data
-
Different types of time series anomalies
I find myself having to refer to this enough times that I decided to make a little infographic I can just more easily link to 🙂
-
Numpy Feature Engineering – 2x Speed Up Over Pandas!
The Setup This is a little one I was surprised to see. Recently I had a need to do some pretty basic feature engineering to a pandas dataframe prior to training some models. Basically I needed to take differences of each column, apply some smoothing, and then add a number of lagged columns for each… Read more
-
Market basket analysis in Python
tl; dr; if you find yourself doing some association rule mining using mlxtend but finding it a bit slow then checkout PyFIM – here is a colab I made to get you started. I have recently been looking to do some market basket analysis (“Association rule learning“) on event log type data (to find insights… Read more
-
I helped build a thing!
Here is a thing i helped build in work that i’m fairly happy with: https://www.linkedin.com/posts/andrewm4894_netdata-introducing-our-first-netdata-cloud-activity-6712008465574887424-SlIr Now, onto the next thing!
-
Time series clustering with tslearn
I’ve recently been playing around with some time series clustering tasks and came across the tslearn library. I was interested in seeing how easy it would be to get up and running some of the clustering functionality that is already built into tslearn, turns out it was quite easy and straight forward, perfect blog post… Read more
-
Premature Optimization
I’ve been doing some work that necessitated using the same statistical test from spicy lots of times on a fairly wide pandas dataframe with lots of columns. I spent a bit too much time googling around for the most efficient ways to do this, and even more time re-writing things various way before realizing i… Read more
-
Terraform is Magic + r/MachineLearning Links
Terraform is magic, i may be a little late to the game on this one and i’m sure it has it’s fair share of haters (i’ve seen some have a love hate relationship with it, maybe i’m still in my honeymoon period). But from my point of view as a Data Scientist/ML Engineer playing around… Read more
-
Ireland Covid19 Data
I was looking around a bit and could not really find any datasets behind the daily updates from the Irish government that get posted here. In particular i was thinking the break out tables of numbers by different dimensions might be of use for anyone looking to analyse the data. So here is a python… Read more
-
A little brainteaser (or i’m an idiot)
This took me waaay too long to work out today and i was thinking it could make a nice little interview coding type question (which i’d probably fail). Suppose you have 10,000 rows of data and need to continually train and retrain a model training on at most 1,000 rows at a time and retraining… Read more
-
Papers i’m reading #2
Continuation from this post. An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed CPS (Cyber Physical Systems). Link My Summary: Really interesting paper – PGM’s, HMM’s and all that good stuff. Quite complicated though and no clear route to implementation. Also I would wonder how well it scales beyond 10’s of time series.… Read more
-
Github Webhook -> Cloud Function -> BigQuery
I have recently needed to watch and track various activities on specific github repos i’m working on, however the rest api from Gtihub can sometimes be a bit limited (for example, best i could see, if you want to get the most recent list of people who began watching your repo you need to make… Read more
-
Papers i’m reading #1
I’ve recently set myself the goal of reading one academic paper a week relating to the ML/AI things i’m working on i’m my current role. To try help keep me honest and diligent in this regard, I’ve decided to get into the habit of jotting down some quick notes on each paper and every now… Read more
-
My First PyPI Package
I’ve been threatening to myself to do this for a long time and recently got around to it, so as usual i’m going to try milk it for a blog post (Note: i’m not talking about getting into a box like the below picture, its something much less impressive). Confession – I don’t know matplotlib… Read more
-
KubeFlow Custom Jupyter Image (+ github for notebook source control)
I’ve been playing around a bit with KubeFlow a bit lately and found that a lot of the tutorials and examples of Jupyter notebooks on KubeFlow do a lot of the pip install and other sort of setup and config stuff in the notebook itself which feels icky. But, in reality, if you were working… Read more
-
Multi-Variate, Multi-Step, LSTM for Anomaly Detection
This post will walk through a synthetic example illustrating one way to use a multi-variate, multi-step LSTM for anomaly detection. Imagine you have a matrix of k time series data coming at you at regular intervals and you look at the last n observations for each metric. One approach to doing anomaly detection in such a setting is to… Read more
-
Custom Python Packages in AWS Lambda
I’m pretty sure i’ll be looking this up again at some stage so that passed one of my main thresholds for a blog post. I’ve recently been porting some data and model development pipelines over to AWS Lambda and was mildly horrified to see how clunky the whole process for adding custom python packages to… Read more
-
Clustering Cell Tower Usage Data
Below is a little mini project i did a while back looking at cell tower usage data. Main takeaway really was a nice example of how subjective clustering can be, especially the more features and variables you feed into your distance metric.
-
Parallelize a wide df in Pandas
Sometimes you end up with a very wide pandas dataframe and you are interested in doing the same types of operations (data processing, building a model etc.) but focused on subsets of the columns. For example if we had a wide df with different time series kpi’s represented as columns then we might want to… Read more
-
( 0 – 0 ) / 0 != 0
Arrrgghh – I just wasted the best part of an afternoon chasing this one down. If i can knock out a quick post on it then at least i’ll feel i’ve gotten something out of it. Here’s the story – somewhere in an admittedly crazy ETL type pipeline i was using pandas pct_change() as a… Read more
-
Java Hello World cli using args4j
A simple example of how to use args4j to add command line args to a simple “Hello World” type java application. For example below command line execution would print “Hello arg4j!” instead of the default “Hello World!” if you don’t pass any args. $ java -jar helloWorldParamaterized –msg=’Hello arg4j!’
-
Java Weka API: Adding List To Instances Object
This is just a quick one to save anyone else new to the Weka api in Java spending as much time as i did figuring this one out. Lets suppose you have a weka instances object and a new list of values you want to add into it as a new attribute (weka slang for… Read more
-
Parallel Jupyter Notebooks
I have become master of the notebooks, they bend at my will and exist to serve my data science needs! Ok i might be getting a bit carried away, but i recently discovered papermill and have been finding it very useful in conjunction with Python multiprocessing to speed up a lot of data science experimental… Read more
-
Java for Machine Learning
I’ve recently been learning java for machine learning related work (a long story to do with mainframes, and you know, why not). Decided to stick down in here some resources i found very useful as i found i still needed to do a bit of Googling to find ML related Java tutorials and moocs i… Read more
-
Previous blog posts
List of blog posts i’ve done before this site. Multi-Variate, Multi-Step, LSTM for Anomaly Detection A Docker Data Science Recipe Celebrity Word Vectors Playing around with Apache Airflow & BigQuery Content Lifecycle Clustering Interview with BigData-MadeSimple.com One potential pitfall with referral source tracking and AMP in Google Analytics
-
My Site…
Was doing some blogs on medium but don’t trust them to not all end up behind some sort of paywall so decided to shell out a couple quid a month for my own domain.