Terraform is Magic + r/MachineLearning Links

Terraform is magic, i may be a little late to the game on this one and i’m sure it has it’s fair share of haters (i’ve seen some have a love hate relationship with it, maybe i’m still in my honeymoon period).

But from my point of view as a Data Scientist/ML Engineer playing around with various services in a multi-cloud environment (mainly GCP and AWS) its one of the things that keeps me sane in a sea of changing features and web ui’s that the cloud providers love to throw at us.

When it comes to cloud projects i’m working with, if its not under source control and managed by terraform then i don’t want to know about it.

So as I’ve been on my voyage of discovery with terraform i decided to put it to use in a little side project I’ve been wanting to do for a while.

(Note: All code is available in my reddit-links Github repo.)

The Set Up

The idea is to have a cloud function that runs each day, looks at top posts on r/MachineLearning, pulls out all links, and some metadata, and then saves those links somewhere useful.

My original goal (and something i might get to at some stage) was to pull all YouTube links shared on r/MachineLearning that have a decent score into an automatic playlist on YouTube or maybe automatically add them to my own watch later playlist. I have the data now so this might be an easy enough next project.

So given the goal above here are the various moving parts I’ve picked to use:

  • Terraform: To manage all the GCP stuff used.
  • GCP Cloud Function: This will be the code that pulls from r/MachineLearning, wrangles the text and html data from reddit, pulls out the links and then either inserts or updates Airtable accordingly.
  • GCP PubSub Topic: This will be the trigger for the cloud function.
  • GCP Cloud Scheduler: This will run a sort of cron job in GCP each day to push a message to the pubsub topic which will in turn trigger the cloud function.
  • Airtable: This is where the data will be stored and published from. I had considered a GCP bucket but Airtable is much easier to share and a bit more user friendly for anyone who might want to use the links pulled from reddit.

Terraform

Once you work with and set up a Terraform project once then its pretty straight forward. I have used Serverless for some cloud functions before but i like the way terraform gives you everything GCP or AWS or Azure etc have to offer at your fingertips once you invest that little bit of learning up front.

Here is a list of the .tf files i’m using and what they all do. There are of course many ways to set things up in Terraform but this seemed like a straightforward enough way and works for me for smaller projects like this.

  • backend.tf – this is optional and used to have a remote backend for the state of your project as opposed to somewhere on your laptop.
  • conf.tf – this is a file i’m using to define any sensitive variables in terraform that i don’t want to go into source control (make sure you add to .gitignore) but need to make available to terraform to do it’s stuff. I have included an conf_example.tf file in the repo to show how this looks.
  • variables.tf – used to define other variables used by terraform.
  • provider.tf – conventional file to define the cloud providers you want to be able to use.
  • gcp-cloud-functions.tf – this is where we will define all the things we need related to our Cloud Functions.
  • gcp-cloud-scheduler.tf – used to define the Cloud Scheduler cron jobs we need.
  • gcp-pubsub-topics.tf – used to define the PubSub topics that will trigger the cloud function.

GCP Function – redditlinks

The function used to pull from reddit is below. It might be a little verbose and could probably be refactored a little but hey – it works. Mainly using the PRAW library to pull from reddit, some usual data wrangling libraries, and then the airtable-python-wrapper library to insert/update records in Airtable.

End Results – Airtable

The end results of all this being a cloud function that runs once a day to update or insert records into this Airtable. Then from that i have made some views for links from specific domains.

(Note: If you wanted to do this but for a different subreddit then you should need to make minimal changes once you have your terraform variables all set up – primarily changing or adding another cloud scheduler job to here with the relevant params, as well as making sure all keys etc you need are available in the right places).

r/MachineLearning – YouTube Links

r/MachineLearning – Arxiv Links

A little brainteaser (or i’m an idiot)

This took me waaay too long to work out today and i was thinking it could make a nice little interview coding type question (which i’d probably fail).

Suppose you have 10,000 rows of data and need to continually train and retrain a model training on at most 1,000 rows at a time and retraining the model every 500 rows, can you tell me how many “batches” of data this will create and the start and end index of each batch?

So thats:

n = 10000
train_max = 1000
train_every = 500

And we want a dictionary like this:

{
  1: {"start": 1, "end": 1000},
  2: {"start": 500, "end": 1000}, 
  ...
  ?: {"start": ?, "end": ?},
}

After doing some crazy loops in python for a while I decided to go back to basics and do it Jeremy Howard style in excel (well gsheets – i’m not a savage) – gsheet.

And here is my Python solution:

…I’m pretty sure someone will come along with a super pythonic one liner that shows maybe i am an idiot after all.

Ok now back to work.

Update: Actually i think what i want is more something like the below where you can define a minimum and maximum size of your training data and then roll that over your data.

Papers i’m reading #1

I’ve recently set myself the goal of reading one academic paper a week relating to the ML/AI things i’m working on i’m my current role.

To try help keep me honest and diligent in this regard, I’ve decided to get into the habit of jotting down some quick notes on each paper and every now and then as i get through a batch of them, stick them into blog post (because i like to try squeeze everything and anything into a blog post if i can get away with it, even better if is minimal extra effort on my part 🙂 ).


Anomaly Detection in Streaming Non-stationary Temporal Data

Link

My Summary: Really interesting paper and application, considers a lot of different design aspects in it. Nice example of a different approach leveraging feature extraction and statistical techniques to get the job done.

Notes:

  • Leverages EVT approaches, forecasts boundary for typical behavior in relation to the extremes.
  • Leverages a feature vector and dimensional reduction approach too which is interesting and somewhat independent of the AD algo. 
  • It is multivariate but the data they use are all sensor data so measuring the same thing, so not quite the same as multivariate measures measuring different things – so still questions on how one would normalize accordingly for this approach.
  • Some lovely pictures. 
  • It is online but does have a sort of offline or training phase where it fits to the ‘representative example’ of the data – and this may need to change/evolve over time. 
  • So it is streaming and unsupervised but with some small caveats. 
  • Interesting discussion on differences between density based and distance based approaches to anomaly detection.
    • “In contrast, defining an anomaly in terms of the density of the observations means that an anomaly is an observation (or cluster of observations) that has a very low chance of occurrence”.
  • Offline phase – estimate the properties of the typical dataset which will be used in the online phase of anomaly detection.
  • HDOutliers is another approach worth looking into.  
  • Interesting choice of 14 features which they then do pca on. Worth looking into these specific features.
  • Offline phase is implemented as just a burn in window on the streaming data so this is not too bad.  
  • Feature extraction and dimension reduction a big part of the preprocessing, interesting approach that could be applied to other algos.
  • Just using first 2 components of the PCA – found that interesting.
  • There is quite a few steps in the algos – quite involved. 
  • Sliding window with concept drift detection used to determine when need to refit the data – interesting approach as opposed to just refitting at regular intervals. Pros and cons to each potentially. 
  • The output at each timestep is a list of time series flagged as anomalous within that sliding window. So there is not really an anomaly score as such. 
  • They suggest that having the ‘informed’ concept drift based approach is more efficient overall as avoids wasteful refits.
  • Unclear to me how this would apply to multivariate ts data with many different types of measurements. Does not really discuss this in the paper – maybe worth a question on the repo if playing around with it. 
  • There still are some probably important params like window size and things like that.

A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data

Link

My Summary: Very good reference paper for more traditional methods as opposed to deep learning based approaches. Good discussion on complexity and online setting too. Primarily concerned with traditional tabular data as opposed to time series but still some good ideas to pursue.

Notes:

  • 2016 paper so maybe pre-deep learning hype which is nice. 
  • Time series data and setting not really any part of this paper so bear that in mind. 
  • Authors use similar taxonomy in terms of types of AD setting as I have seen in other papers. 
  • Mainly if we ‘flatten’ our time series data into feature vectors (tabular data) then we would be in a similar setting as this paper. 
  • Scores are much more useful then binary labels as outputs. 
  • AD Settings (similar to elsewhere):
    • Point AD.
    • Collective AD.
    • Contextual AD.
  • Normalization suggested as important but no mention of difficulty in normalizing streaming data. 
  • Another reference to the NASA shuttle data – must look into this. 
  • 4 Groups of unsupervised AD algos:
    • Nearest neighbor.
    • Clustering based.
    • Statistical.
    • Subspace techniques.
  • KNN based approaches, 10 < k <50 as rule of thumb.
  • KNN can miss local outliers as relies on neighbours.
  • LOF nice in that can give a score between 0 and 1 ← this is a nice property to have.
  • They go through various extensions of LOF.
  • LoOP – Local Outlier Probability – makes some small changes such that you get back a probability. But still that probability can be very specific to the particular model. Its not like its really a probability you can compare to other models. More useful for within model observation comparisons.
  • Some extensions of LOF that first use clustering to reduce complexity. 
  • Clustering based approaches can be very sensitive to the choice of K. 
  • HBOS – histogram statistical based approach. Simple and fast, surprisingly performant.
  • One class SVM – a range of ways to implement, not really lends itself well to online setting. 
  • PCA – get components and then use them in some way to get AS. PCA can reduce to clustering equivalence under certain conditions. 
  • PCA can be fast if D not too large.
  • Metrics@TopN is a good way to evaluate AD systems. E.g so long as some anomalies appear in the top of the pile that can be progress (similar evaluation methods to information retrieval). 
  • Rank comparison approaches can be useful too (we should make sure any data we capture lends itself to this approach also).
  • Local vs Global anomalies are a big consideration in this paper. Is not quite clear what this would mean in our setting. It’s probably true that we are more interested in global anomalies than local ones. But also hard to know which setting you are in, especially in higher dimensions. 
  • #k has a big impact on computation time for clustering algos, as does the size of dataset.
  • HBOS is fast!
  • All algos in this paper are available via a rapidminer extension if we wanted to play with them.
  • Recommendation to start with global based algos as they can also work somewhat on local anomalies. 
  • Clustering approaches also sensitive to random start so good to restart a few times. 
  • Nearest neighbour approaches can be more robust to choice of parameters.
  • But clustering approaches can be faster then knn approaches.

Deep Learning for Anomaly Detection: A Survey

Link

My Summary: A looot of references and got some good ideas out of it. Not much else to it. 

Notes:

  • I like the general taxonomy they use for types of AD problems/framings.
    • Point, contextual, vs collective.
  • Good point about maybe an over focus on autoencoders, not clear what is driving that.
  • Interesting discussion around one class neural networks. 
  • Labels in practice not as big of an area for practical reasons (hard to collect) and in cases where anomalous patterns may change. 
  • Hybrid approaches could be interesting if provide efficiency at runtime. Use DL model for feature representation and then some other model for the scoring.
    • One problem is this is not end to end but could still be something to keep in mind.
    • One class NN as better option here.
  • Little discussion in the paper around considerations in productionising any of it or dealing with specific considerations involving streaming data.  
  • Could convert time series problem into a sequence problem and use things like language models or event based approaches.
  • Adaptivity of your model as a design param you need to think about and decide on.
  • The part about interconnectedness of IoT nodes resonates with some use cases for us.
  • Deep attention based models as useful in helping to explain and locate the anomaly in addition to just detecting it. 
  • GAN’s as an approach worth looking into. 
  • No clear theory or guidance on what choices to make in network architecture and hyper params.
  • Transfer learning based approaches as an open and active area of research. 
  • Hilbert transform and other DSP based approaches mentioned.

Time2Vec: Learning a Vector Representation of Time

Link

My Summary: Nice little paper and idea of learning the frequency functions and representations seems really interesting. 

Notes:

  • Key idea is that time2vec gives you a “general purpose model agnostic representation of time that can be potentially used in any architecture”.
  • Basically it’s trying to extend the notion of feature embeddings to the time series domain.
  • Time2vec as a featurizer essentially.
  • Talk of asynchronous time/sequence based models is interesting. Perhaps could be a class of models we could explore that could run on irregularly sampled data.
  • A large focus here on capturing various periodicity and time varying effects. It could be that 1 second monitoring data is not a great candidate for this by its nature.
  • Could use time2vec type approach to get a more universal feature representation?
  • Unclear if this is all univariate or multivariate.
  • Worth looking around to see if any time2vec implementations to play with.

catch22: CAnonical Time-series CHaracteristics

Link

My Rating: 8/10

My Summary: Well done paper, limited application potentially to an online setting but great food for thought on the range of ts feature transformations literature already out there.

Notes:

  • Idea to compress time series into useful ‘feature vectors’ that can be used for downstream ML tasks, mainly classification and clustering in this paper.
  • Starting point is hctsa matlab package feature space of ~5k features. Catch22 is a project to empirically discover the most useful (and computationally reasonable) of these.
  • Builds on a lot of literature and hand crafted feature engineering in the time series space. 
  • Catch22 is implemented in C with python, R, mathlab wrappers. This could be useful for netdata core C based stuff. 
  • Many features here may require the full ts to be available prior to calculation so not suitable for online streaming setting. Although could implement windowed versions potentially. 
  • E.g. how do you z-score normalise a stream of data?
  • They just used a decision tree as the classification model they used. I wonder how sensitive results are to this. I guess makes sense as they wanted to try test the usefulness of the features themselves. Curious why no linear models. 
  • Clustered the ‘performance vectors’ to try to reduce redundancy and overlap of features. That was nice. 
  • Check out the tsfeatures package from Hyndman mentioned in this paper. 
  • It is interesting to look at some of the ts features themselves – don’t reinvent the wheel when all this already exists!
  • Look into compengine.

Time Series Anomaly Detection; Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data

Link

My rating: 6/10

My summary: Good example of simple regression based approach, not very generalisable, data and results not really powerful. 

Notes:

  • Typical ‘Expected Value’ regression based approach. 
  • Focus on sustained anomalies as opposed to single timesteps.  
  • No semantic or domain based understanding -just independent time series all treated separately. 
  • Data in this paper is “periodic but noisey” 5 min level byte counts.
  • Shout out to Numenta approach to anomaly likelihood, must revisit this.
  • Use of simulated data. 
  • Data Normalization to 0-1 scale. Unclear how this is implemented without data leakage or in an online manner.
  • Simple threshold based approach to detection given, Yhat – Y = AS.  
  • Use of dummy data for model/approach comparison.
  • Pretty small dataset for DL approaches.
  • In the absence of labeled data, leveraging multiple approaches and comparing anomalies raised by each approach and their profiles could be a useful way to iterate towards golden datasets.
  • DL models they used looked quite big and deep for nature and size of the data – not really motivated why chose more complex architecture.
  • LSTM or RNN did not do better than vanilla DNN.
  • Not the most convincing set up and approach. Very limited in terms of data and depth of the research.

Time-series anomaly detection service at Microsoft

Link

My rating: 8/10

My summary: Good walkthrough of end to end service, interesting computer vision application, some good leads to follow up. 

Notes:

  • Interesting to see they use influxdb (and Kafka, and some flink).
  • Deployed as service on kubernetes.
  • Stuck with needing to treat each individual time series separately in the model.
  • They build model to label each point in window as anomalous or not – seems potentially limiting if just interested in if each window of data is anomalous or not. 
  • SR sounds interesting, have not seen that before, worth looking into, although their SR example looks very similar to what you’d get by looking at pct change or something so feels maybe over-engineered. 
  • Converting the problem into a computer vision friendly setting is interesting and not uncommon. In the multivariate setting we could encode the data visually, e.g. fft and wavelet frequency distributions etc. Heatmaps or even some custom encoding into a visual space based on specific characteristics of the data.  
  • Some of the windowed feature engineering stuff seemed interesting, as well as then layering ratios of those windowed features on top of each other. 
  • Is a way for users to label data which seems to help build golden datasets used for later validation and experimentation.
  • Need to look into SPOT and DSPOT as EVT based approaches i’ve only previously superficially looked at.
  • Need to look into DONUT.
  • Some other good references to follow up on.
  • Seems like this is indeed a technical paper relating to this Azure api.

KubeFlow Custom Jupyter Image (+ github for notebook source control)

I’ve been playing around a bit with KubeFlow a bit lately and found that a lot of the tutorials and examples of Jupyter notebooks on KubeFlow do a lot of the pip install and other sort of setup and config stuff in the notebook itself which feels icky.

But, in reality, if you were working in Jupyter notebooks on KubeFlow for real you’d want to build a lot of this into the image used to build the notebook server. Luckily, as with most of KubeFlow, its pretty flexible to customize and extend as you want, in this case by adding custom jupyter images.

Two main example use cases you’d want to do this are for ensuring some custom python package (e.g. my_utils) you have built is readily available in all your notebooks, and other external libraries that you use all the time are also available – e.g. kubeflow pipelines.

To that end, here is a Dockerfile that illustrates this (and here is corresponding image on docker hub).

Once you have such a custom image building fine it’s pretty easy to just point KubeFlow at it when creating a Jupyter notebook server.

Just specify your custom image

Now when you create a new workbook on that jupyter server you have all your custom goodness ready to go.

Github for notebooks

As i was looking around it seems like there is currently plans to implement some git functionality into the notebooks on KubeFlow in a bit more of a native way (see this issue).

For now i decided to just create a ssh key (help docs) for the persistent workspace volume connected to the notebook server (see step 10 here).

Then once you want to git push from your notebook server you can just hack together a little notebook like this that you can use as a poor man’s git ui 🙂

Multi-Variate, Multi-Step, LSTM for Anomaly Detection

This post will walk through a synthetic example illustrating one way to use a multi-variate, multi-step LSTM for anomaly detection.

Imagine you have a matrix of k time series data coming at you at regular intervals and you look at the last n observations for each metric.

A matrix of 5 metrics from period t to t-n

One approach to doing anomaly detection in such a setting is to build a model to predict each metric over each time step in your forecast horizon and when you notice your prediction errors start to change significantly this can be a sign of some anomalies in your incoming data.

This is essentially an unsupervised problem that can be converted into a supervised one. You train the model to predict its own training data. Then once it gets good at this (assuming your training data is relatively typical of normal behavior of your data), if you see some new data for which your prediction error is much higher then expected, that can be a sign that you new data is anomalous in some way.

Note: This example is adapted and built off of this tutorial which i found a very useful starting point. All the code for this post is in this notebook. The rest of this post will essentially walk though the code.

Imports & Paramaters

Below shows the imports and all the parameters for this example, you should be able to play with them and see what different results you get.

Note: There is a Pipfile here that shows the Python libraries needed. If you are not familiar, you should really check out pipenv, its really useful once you play with it a bit.

Fake Data!

We will generate some random data, and then smooth it out to look realistic. This will be our ‘normal’ data that we will use to train the model.

I couldn’t help myself.

Then we will make a copy of this normal data and inject in some random noise at a certain point and for a period of time. This will be our ‘broken’ data.

So this ‘broken’ data is the data that we should see the model struggle with in terms of prediction error. It’s this error (aggregated and summarized in some way, e.g. turned into a z-score) that you could then use to drive an anomaly score (you could also use loss from the continually re-training on new data whereby the training loss should initially spike once the broken data comes into the system but over time the training would then adapt the model to the new data).

This gives us our normal-ish real word looking data that we will use to train the model.

5 random time series that have been smoothed a bit to look realistic.

To make our ‘broken’ data (called data_new in the code) i lazily just copy the ‘normal’ data but mess up a segment of it with some random noise.

And so below we can see our ‘broken’ data. I’ve set the broken segment to be quite wide here and its very obvious the broken data is totally different. The hope is that in reality the model once trained would be good at picking up much more nuanced changes in the data that are less obvious to the human eye.

For example if all metrics were to suddenly become more or less correlated than normal but all still each move by a typical amount individually then this is the sort of change you’d like the model to highlight (this is probably something i should have tried to do when making the ‘broken’ data to make the whole example more realistic, feel free to try this yourself and let me know how you get on).

Same as the “normal” data but i’ve messed up a huge chunk of it.

Some Helper Functions

I’ve built some helper functions to make life easier in the example notebook. I’ll share the code below and talk a little about each.

  • data_reshape_for_model() : This function basically takes in an typical dataframe type array, loops through that data and reshapes it all into a numpy array of the shape expected by the keras LSTM model for both training and prediction. Figuring out how to reshape the data based on the N_TIMESTEPS, N_FEATURES and length of the data was actually probably the trickiest part of this whole example. I’ve noticed that many tutorials online just reshape the data but do so in an incomplete way by essentially just pairing off rows. But what you really want to do is step through all the rows to make sure you roll your N_TIMESTEPS window properly over the data to as to all possible windows in your training.
  • train() : This is just a simple wrapper for the keras train function. There is no real need for it.
  • predict() : Similar to train() is just a wrapper function that does not really do much.
  • model_data_to_df_long() : This function takes in a data array as used by the keras model and unrolls it into one big long pandas dataframe (numpy arrays freak me out a bit sometimes so i always try fall back pandas when i can get away with it 😉).
  • model_df_long_to_wide() : This function then takes the long format dataframe created by model_data_to_df_long() and converts it into a wide format that is closed to the original dataset of one row one observation and one column for each input feature (plus lots more columns for predictions for each feature for each timestep).
  • df_out_add_errors() : This function adds errors and error aggregation columns to the main df_out dataframe which stores all the predictions and errors for each original row of data.
  • yhat_to_df_out() : This function take’s in the model formatted training data and model formatted prediction outputs and wraps all the above functions to make a nice little “df_out” dataframe that has everything we want in it and is one row one observation so lines up more naturally with the original data.

Build & Train The Model

Below code builds the model, trains it and also calls predict on all the training data be able to get errors on the original ‘normal’ training data.

We then call our “do everything” yhat_to_df_out() function on the training data and the predictions from the model.

Now we can plot lots of things from df_out. For example here are the errors averaged across all five features are each timestep prediction horizon.

In the above plot we can see the averaged error of the model on its training data. Each line represents a different forecast horizon. We can see that the lines are sort of ‘stacked’ on top of each other which makes sense as you’d generally expect the error 5 timesteps out (red line “t4_error_avg”) to be higher then the one step ahead forecast (greeny/orangy line “t0_error_avg”).

If we look at the standard deviation of our errors in a similar way, we can see how the standard deviation of our errors generally tends to increase at times when our 5 original features are diverging from each other as you can imagine these are the hardest parts of our time series for this model to predict.

Lets Break It

So now that we have our model trained on our ‘normal’ data we can use it to see how well it does at predicting our new ‘broken’ data.

Here we can see that as soon as we hit the broken data the prediction errors go through the roof.

From the above we can see that as soon as the random broken data comes into the time series the model prediction errors explode.

As mentioned, this is a very obvious and synthetic use case just for learning on but the main idea is that if your data changed in a more complicated and harder to spot way then your error rates would everywhere reflect this change. These error rates could then be used as input into a more global anomaly score for your system.

That’s it, thanks for reading and feel free to add any comments or questions below. I may add some more complicated or real world examples building on this approach at a later stage.

UPDATEHere is a Google Colab notebook that’s a bit better as i’ve worked a bit more on this since the original blog post.