Time series clustering with tslearn

I’ve recently been playing around with some time series clustering tasks and came across the tslearn library. I was interested in seeing how easy it would be to get up and running some of the clustering functionality that is already built into tslearn, turns out it was quite easy and straight forward, perfect blog post fodder 🙂

tl;dr here is a Google Colab notebook you can just copy, run and play with for yourself.

First lets import the libraries we will need:

import pandas as pd
import numpy as np
from tslearn.clustering import TimeSeriesKMeans, KShape, KernelKMeans
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from netdata_pandas.data import get_data, get_chart_list
from am4894plots.plots import plot_lines, plot_lines_grid

netdata_pandas is a helper library i created to pull some time series data from servers running Netdata (where i work) into a pandas dataframe. In this example we will use a demo server, http://london.my-netdata.io/, so as to have somewhat realistic (messy) time series data.

am4894plots is another library i made for myself that i add common plotting functionality i find myself returning to time and time again (because i never really ‘learned’ matplotlib and at this stage refuse to!).

Lets define our inputs, basically anything that is something we can play with and change is worth adding as an input at top of the notebook:

# inputs
host = 'london.my-netdata.io' # host running netdata that we want to pull data from
n = 60*5 # how many seconds of most recent data to pull
n_charts = None # If None then pull data for all charts otherwise sample n_charts randomly
n_clusters = 50 # number of clusters to fit
diff = False # take diffs of the data or not
preprocessing_meanvar = False # True to use TimeSeriesScalerMeanVariance preprocessing
smooth_n = 15 # n observations to smooth over
smooth_func = 'mean' # one of ['mean','min','max','sum']
norm = True # normalize the data to 0-1 range
model = 'kmeans' # one of ['kmeans','kshape','kernelkmeans','dtw']

Next we will get our data and do some fairly standard pre-processing:

# get charts
if n_charts:
    charts = np.random.choice(get_chart_list(host), n_charts).tolist()
    print(charts)
else:
    charts = get_chart_list(host)
# get data
df = get_data(host, charts, after=-n, before=0)
# remove duplicate columns that we might get from get_data()
df = df.loc[:,~df.columns.duplicated()]
# drop any empty columns (it can happen)
df = df.dropna(axis=1, how='all')
# forward fill and backward fill to try remove any N/A values
df = df.ffill().bfill()
# take differences if specified
if diff:
    df = df.diff()
# do any smoothing as specified
if smooth_n > 0:
    if smooth_func == 'mean':
        df = df.rolling(smooth_n).mean().dropna(how='all')
    elif smooth_func == 'max':
        df = df.rolling(smooth_n).max().dropna(how='all')
    elif smooth_func == 'min':
        df = df.rolling(smooth_n).min().dropna(how='all')
    elif smooth_func == 'sum':
        df = df.rolling(smooth_n).sum().dropna(how='all')
    else:
        df = df.rolling(smooth_n).mean().dropna(how='all')
# normalize the data if specified
if norm:
    df = (df-df.min())/(df.max()-df.min())
# drop any empty columns that may remain
df = df.dropna(axis=1, how='all')
# set index to be a datetime for better plotting later
df = df.set_index(pd.to_datetime(df.index, unit='s'))

# look at our data
print(df.shape)
df.head()

Now time to build our clustering model using tslearn (there is a few more parameters here we probably should have added as separate inputs but not to worry):

# get values to cluster on
X = df.transpose().values
if preprocessing_meanvar:
    X = TimeSeriesScalerMeanVariance().fit_transform(X)
    df = pd.DataFrame(X.reshape(df.shape), columns=df.columns, index=df.index)
if model == 'kshape':
    model = KShape(n_clusters=n_clusters, max_iter=10, n_init=2).fit(X)
elif model == 'kmeans':
    model = TimeSeriesKMeans(n_clusters=n_clusters, metric="euclidean", max_iter=10, n_init=2).fit(X)
elif model == 'dtw':
    model = TimeSeriesKMeans(n_clusters=n_clusters, metric="dtw", max_iter=5, n_init=2).fit(X)
elif model == 'kernelkmeans':
    model = KernelKMeans(n_clusters=n_clusters, kernel="gak", max_iter=5, n_init=2).fit(X)
else:
    model = TimeSeriesKMeans(n_clusters=n_clusters, metric="euclidean", max_iter=10, n_init=2).fit(X)

Once we have our clusters we can make some helper objects to use later:

# build helper df to map metrics to their cluster labels
df_cluster = pd.DataFrame(list(zip(df.columns, model.labels_)), columns=['metric', 'cluster'])

# make some helper dictionaries and lists
cluster_metrics_dict = df_cluster.groupby(['cluster'])['metric'].apply(lambda x: [x for x in x]).to_dict()
cluster_len_dict = df_cluster['cluster'].value_counts().to_dict()
clusters_dropped = [cluster for cluster in cluster_len_dict if cluster_len_dict[cluster]==1]
clusters_final = [cluster for cluster in cluster_len_dict if cluster_len_dict[cluster]>1]
clusters_final.sort()

df_cluster.head()

Finally, the fun part, lets plot each cluster separately and see what we have:

for cluster_number in clusters_final:
    # get a rough quality score based on the correlation between metrics in the cluster
    x_corr = df[cluster_metrics_dict[cluster_number]].corr().abs().values
    x_corr_mean = round(x_corr[np.triu_indices(x_corr.shape[0],1)].mean(),2)
    # plot each cluster
    plot_title = f'cluster {cluster_number} (quality={x_corr_mean}, n={cluster_len_dict[cluster_number]})'
    plot_lines(df, cols=cluster_metrics_dict[cluster_number], renderer='colab', theme=None, title=plot_title)

Here are some good examples:

And some not so good ones:

As is typical with clustering you are always going to get some pretty bad random looking ones, especially since i have really just picked a lot of the parameters above off the top of my head, most importantly k the number of clusters which i set to 50 given the high number of metrics we had (over 700).

All in all, i found the tslearn library very useful as it saved me quite a bit of time to get a quick working prototype up and running so i’m looking forward to also playing with some of the other time series related functionality it offers.

Premature Optimization

I’ve been doing some work that necessitated using the same statistical test from spicy lots of times on a fairly wide pandas dataframe with lots of columns. I spent a bit too much time googling around for the most efficient ways to do this, and even more time re-writing things various way before realizing i should have RTFM a bit more in the first place, yep i’ve gone about a week down a path of premature optimization – but hey, *blog post* 🙂

The Set Up

I have a wide pandas dataframe of lots of time series metrics – one for each column, and i have a ‘focus’ window of time during which i am interested to know what metrics look like the may have changed in some way in reference to a ‘baseline’ window just before the focus window.

A rough first idea (before getting too fancy and building models – not there yet for various reasons) is to break out our old friend the KS Test to basically do a statistical test to see if the each metric the ‘focus’ distribution looks statistically significantly different then the ‘baseline’ distribution. The idea being that those metrics that do look to have ‘changed’ in this sense between the two windows might be worth looking at first.

So a pretty simple set up and application. The tricky part was doing this as quickly as possible on a dataframe with around 500-1000 columns and anywhere between 1000-10000 rows of data as a rough typical usage scenario.

Dumb Approach

So my first approach, as usual, is to do the dumbest thing i can and just get something that works and go from there. So here is my ks_df_dumb() function.

def ks_df_dumb(df, ks_mode):
    """
    Take in a df, loop over each column, split into base and focus, and apply test.
    """
    results = []
    for col in df._get_numeric_data():
        base = df[df['window'] == 'base'][col].values
        focus = df[df['window'] == 'focus'][col].values
        ks_stat, p_value = ks_2samp(base, focus, mode=ks_mode)
        results.append((ks_stat, p_value))
    return results

If i run this on my test dataframe of 500 columns * 1000 rows i see the below timings.

%%timeit -n 5 -r 5
results = ks_df_dumb(df, ks_mode)
# 3.77 s ± 57.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
print('ks_df_dumb')
start_time = time.time()
results = ks_df_dumb(df, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_df_dumb
# 3.55 seconds

So about 3-4 seconds which is not great for what i need (it may end up being something a user clicks to trigger and so want to have them wait as little as possible for the results).

Vectorize it?

So now i start messing around with super cool tricks to try and be a hero. I know better than to be looping over stuff in python and pandas, i know i’ll try vectorize it!

def ks_df_vec(df, ks_mode):
    """Take in a df, and use np.vectorize to avoid pandas loop.
    """
    
    def my_ks_2samp(a,b):
        """Wrapper function to pass args to vectorized function. 
        """
        return ks_2samp(a,b,mode='asymp')
    
    results = []
    base = df[df['window'] == 'base']._get_numeric_data().transpose().values
    focus = df[df['window'] == 'focus']._get_numeric_data().transpose().values
    ks_2samp_vec = np.vectorize(ks_2samp, signature='(n),(m)->(),()')
    results = ks_2samp_vec(base, focus)
    results = list(zip(results[0], results[1]))
    return results

Now i see:

%%timeit -n 5 -r 5
results = ks_df_vec(df, ks_mode)
# 2.22 s ± 35.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
print('ks_df_vec')
start_time = time.time()
results = ks_df_vec(df, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_df_vec
# 2.16 seconds

So a bit better at just over 2 seconds but still not great given this is still only 1000 rows of data.

Numpy?

Time to break out numpy! (confession: i never really learned numpy properly and find it very painful to work with and reason about the data and shapes etc as i do stuff to them – it all just feels so unnatural to me in some way – and i find it hard to keep track of things without and indexes or keys, i just don’t trust myself with it – i know i’m not supposed to speak this out loud but hey).

So my approach now will be to just get the data into two separate numpy arrays and work solely with them.

def ks_np_dumb(arr_a, arr_b, ks_mode):
    results = []
    for n in range(arr_a.shape[1]):        
        ks_stat, p_value = ks_2samp(arr_a[:,n],arr_b[:,n], mode=ks_mode)
        results.append((ks_stat, p_value))
    return results
%%timeit -n 5 -r 5
results = ks_np_dumb(arr_base, arr_focus, ks_mode)
# 2.43 s ± 200 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
print('ks_np_dumb')
start_time = time.time()
results = ks_np_dumb(arr_base, arr_focus, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_np_dumb
# 2.22 seconds
def ks_np_vec(arr_a, arr_b, ks_mode):
    
    def my_ks_2samp(a,b):
        return ks_2samp(a,b,mode=ks_mode)
    
    ks_2samp_vec = np.vectorize(my_ks_2samp, signature='(n),(m)->(),()')
    results = ks_2samp_vec(arr_a.T, arr_b.T)
    results = list(zip(results[0], results[1]))
    return results
%%timeit -n 5 -r 5
results = ks_np_vec(arr_base, arr_focus, ks_mode)
# 2.2 s ± 38.7 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
print('ks_np_vec')
start_time = time.time()
results = ks_np_vec(arr_base, arr_focus, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_np_vec
# 2.29 seconds

Hmm – that did not seem to add too much, which i guess is kinda reassuring, it makes sense that the dumb numpy approach be a little bit faster then the dumb pandas one, but is comforting in that its not order of magnitudes different.

And makes sense the the numpy dumb and numpy vectorize are not that different as the docs for it state that its really just still a loop (so to properly really vectorize it that means i’d probably have to do much more work to figure it out).

Feck this time for Cython!!!

Hell yeah i’m going to cythonize the shit out of this! Let be honest this is what i’ve wanted to do the whole time, do something with cython so i can boast about it to all my friends and how i got this awesome speedup, even just by adding some typing information.

Lets go.

%%cython

import numpy as np
cimport numpy as np
cimport cython
from scipy.stats import ks_2samp

DTYPE = np.double


cpdef cy_ks_np(double[:, :] arr_a, double[:, :] arr_b, str ks_mode):

    cdef double k, p
    cdef Py_ssize_t i
    cdef Py_ssize_t m = arr_a.shape[1]
    
    result = np.zeros((m, 2), dtype=DTYPE)
    cdef double[:, :] result_view = result

    for i in range(m):
        k, p = ks_2samp(arr_a[:,i], arr_b[:,i], mode=ks_mode)
        result_view[i,0] = k
        result_view[i,1] = p

    return result

Ahhh look at it, very pleased with it if i do say so myself. I manged to wrangle this tutorial to fit my needs. Went to bed that night very pleased with myself.

But…whats this…

%%timeit -n 5 -r 5
results = cy_ks_np(arr_base, arr_focus, ks_mode)
# 2.28 s ± 54 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
print('cy_ks_np')
start_time = time.time()
results = cy_ks_np(arr_base, arr_focus, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# cy_ks_np
# 2.1 seconds

2.2 Seconds!!! What the heck i was expecting some magical voodoo that would at least 10x speed me up, come on cython don’t do this to me, i was going to be a hero, they were going to chant my name in the office.

So i did what was the logical next step – made a reproducible example and asked StackOverflow to do it for me 🙂

Bro, do you even Profile!

So while i waited on SO to do its thing i asked a few real engineers in my company what they thought. And their first response was – did you profile your code?

I began to panic, i’ve been found out, oh no its happening so i quickly looked the jupyter cell magic to profile my functions.

Well would ya look at that – 500 calls to stats.py:5187(_compute_prob_inside_method) taking up ~1.8 of my ~2 seconds.

Turns out this whole exercise has been a bit of a waste of time so far. So i went back and dug into the ks_2samp() docs and the code on github to see if anything could be done.

Wait whats this mode parameter – maybe i can play with that a bit, oh one option is “‘asymp’ : use asymptotic distribution of test statistic” that sounds like it could be faster than “exact“.

So with ks_mode='asymp' i ran things again and held my breath.

Lo and behold the solution was staring me in the face all along, obviously someone has already provided some sort of knobs and parameters to get a faster implementation under the hood.

As per usual i should have stuck to my number 1 rule of writing as little code as possible and trying to use other peoples good work to make myself look better 🙂

p.s. all the code is in a notebook here.

Terraform is Magic + r/MachineLearning Links

Terraform is magic, i may be a little late to the game on this one and i’m sure it has it’s fair share of haters (i’ve seen some have a love hate relationship with it, maybe i’m still in my honeymoon period).

But from my point of view as a Data Scientist/ML Engineer playing around with various services in a multi-cloud environment (mainly GCP and AWS) its one of the things that keeps me sane in a sea of changing features and web ui’s that the cloud providers love to throw at us.

When it comes to cloud projects i’m working with, if its not under source control and managed by terraform then i don’t want to know about it.

So as I’ve been on my voyage of discovery with terraform i decided to put it to use in a little side project I’ve been wanting to do for a while.

(Note: All code is available in my reddit-links Github repo.)

The Set Up

The idea is to have a cloud function that runs each day, looks at top posts on r/MachineLearning, pulls out all links, and some metadata, and then saves those links somewhere useful.

My original goal (and something i might get to at some stage) was to pull all YouTube links shared on r/MachineLearning that have a decent score into an automatic playlist on YouTube or maybe automatically add them to my own watch later playlist. I have the data now so this might be an easy enough next project.

So given the goal above here are the various moving parts I’ve picked to use:

  • Terraform: To manage all the GCP stuff used.
  • GCP Cloud Function: This will be the code that pulls from r/MachineLearning, wrangles the text and html data from reddit, pulls out the links and then either inserts or updates Airtable accordingly.
  • GCP PubSub Topic: This will be the trigger for the cloud function.
  • GCP Cloud Scheduler: This will run a sort of cron job in GCP each day to push a message to the pubsub topic which will in turn trigger the cloud function.
  • Airtable: This is where the data will be stored and published from. I had considered a GCP bucket but Airtable is much easier to share and a bit more user friendly for anyone who might want to use the links pulled from reddit.

Terraform

Once you work with and set up a Terraform project once then its pretty straight forward. I have used Serverless for some cloud functions before but i like the way terraform gives you everything GCP or AWS or Azure etc have to offer at your fingertips once you invest that little bit of learning up front.

Here is a list of the .tf files i’m using and what they all do. There are of course many ways to set things up in Terraform but this seemed like a straightforward enough way and works for me for smaller projects like this.

  • backend.tf – this is optional and used to have a remote backend for the state of your project as opposed to somewhere on your laptop.
  • conf.tf – this is a file i’m using to define any sensitive variables in terraform that i don’t want to go into source control (make sure you add to .gitignore) but need to make available to terraform to do it’s stuff. I have included an conf_example.tf file in the repo to show how this looks.
  • variables.tf – used to define other variables used by terraform.
  • provider.tf – conventional file to define the cloud providers you want to be able to use.
  • gcp-cloud-functions.tf – this is where we will define all the things we need related to our Cloud Functions.
  • gcp-cloud-scheduler.tf – used to define the Cloud Scheduler cron jobs we need.
  • gcp-pubsub-topics.tf – used to define the PubSub topics that will trigger the cloud function.

GCP Function – redditlinks

The function used to pull from reddit is below. It might be a little verbose and could probably be refactored a little but hey – it works. Mainly using the PRAW library to pull from reddit, some usual data wrangling libraries, and then the airtable-python-wrapper library to insert/update records in Airtable.

view raw reddit-links.py hosted with ❤ by GitHub

End Results – Airtable

The end results of all this being a cloud function that runs once a day to update or insert records into this Airtable. Then from that i have made some views for links from specific domains.

(Note: If you wanted to do this but for a different subreddit then you should need to make minimal changes once you have your terraform variables all set up – primarily changing or adding another cloud scheduler job to here with the relevant params, as well as making sure all keys etc you need are available in the right places).

r/MachineLearning – YouTube Links

r/MachineLearning – Arxiv Links

Ireland Covid19 Data

I was looking around a bit and could not really find any datasets behind the daily updates from the Irish government that get posted here. In particular i was thinking the break out tables of numbers by different dimensions might be of use for anyone looking to analyse the data.

So here is a python script to grab all press release links from the updates page, pull the html tables in pandas dataframes, do some ugly/gnarly data wrangling and save results into csv files here.

As an example i’ve stuck some of the headline figures and stats in a Tableau dashboard here.

Update1: This looks like a nice dashboard using similar data for Ireland.

Update2: IrishDataViz is a great twitter account with some analysis of the irish numbers.

A little brainteaser (or i’m an idiot)

This took me waaay too long to work out today and i was thinking it could make a nice little interview coding type question (which i’d probably fail).

Suppose you have 10,000 rows of data and need to continually train and retrain a model training on at most 1,000 rows at a time and retraining the model every 500 rows, can you tell me how many “batches” of data this will create and the start and end index of each batch?

So thats:

n = 10000
train_max = 1000
train_every = 500

And we want a dictionary like this:

{
  1: {"start": 1, "end": 1000},
  2: {"start": 500, "end": 1000}, 
  ...
  ?: {"start": ?, "end": ?},
}

After doing some crazy loops in python for a while I decided to go back to basics and do it Jeremy Howard style in excel (well gsheets – i’m not a savage) – gsheet.

And here is my Python solution:

def calc_batches(train_max: int, train_every: int, n: int) -> dict:
batches = dict()
# loop over up to as many records as you have
for batch in range(n):
# work out the start of the batch, with a max() to handle first batch
start = max(train_every * batch, 1)
# work out the end of the batch, with a min() to handle last batch
end = min(train_max+(train_every * batch), n)
# add batch info to the dictionary
batches[batch+1] = {"start": start, "end": end}
# break out once you have assigned all rows to a batch
if end == n:
break
return batches
calc_batches(train_max=1000, train_every=500, n=10000)
'''
{1: {'start': 1, 'end': 1000},
2: {'start': 500, 'end': 1500},
3: {'start': 1000, 'end': 2000},
4: {'start': 1500, 'end': 2500},
5: {'start': 2000, 'end': 3000},
6: {'start': 2500, 'end': 3500},
7: {'start': 3000, 'end': 4000},
8: {'start': 3500, 'end': 4500},
9: {'start': 4000, 'end': 5000},
10: {'start': 4500, 'end': 5500},
11: {'start': 5000, 'end': 6000},
12: {'start': 5500, 'end': 6500},
13: {'start': 6000, 'end': 7000},
14: {'start': 6500, 'end': 7500},
15: {'start': 7000, 'end': 8000},
16: {'start': 7500, 'end': 8500},
17: {'start': 8000, 'end': 9000},
18: {'start': 8500, 'end': 9500},
19: {'start': 9000, 'end': 10000}}
'''
view raw calc_batches.py hosted with ❤ by GitHub

…I’m pretty sure someone will come along with a super pythonic one liner that shows maybe i am an idiot after all.

Ok now back to work.

Update: Actually i think what i want is more something like the below where you can define a minimum and maximum size of your training data and then roll that over your data.

def calc_batches(train_min: int, train_max: int, train_every: int, n: int) -> dict:
batches = dict()
batch = 0
for row in range(1,n+1):
if row < train_min:
pass
elif row == train_min:
batches[batch] = dict(start=0, end=row)
elif row % train_every == 0:
batch += 1
batches[batch] = dict(start=max(0,rowtrain_max), end=row)
return batches
calc_batches(train_min=1000, train_max=5000, train_every=500, n=10000)
"""
{0: {'start': 0, 'end': 1000},
1: {'start': 0, 'end': 1500},
2: {'start': 0, 'end': 2000},
3: {'start': 0, 'end': 2500},
4: {'start': 0, 'end': 3000},
5: {'start': 0, 'end': 3500},
6: {'start': 0, 'end': 4000},
7: {'start': 0, 'end': 4500},
8: {'start': 0, 'end': 5000},
9: {'start': 500, 'end': 5500},
10: {'start': 1000, 'end': 6000},
11: {'start': 1500, 'end': 6500},
12: {'start': 2000, 'end': 7000},
13: {'start': 2500, 'end': 7500},
14: {'start': 3000, 'end': 8000},
15: {'start': 3500, 'end': 8500},
16: {'start': 4000, 'end': 9000},
17: {'start': 4500, 'end': 9500},
18: {'start': 5000, 'end': 10000}}
"""

Papers i’m reading #2

Continuation from this post.

An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed CPS (Cyber Physical Systems).

Link

My Summary: Really interesting paper – PGM’s, HMM’s and all that good stuff. Quite complicated though and no clear route to implementation. Also I would wonder how well it scales beyond 10’s of time series. More good for learning about some different approaches as opposed to anything that could be implemented easily. 

Notes:

  • “Spatiotemporal” feature extraction, “Symbolic dynamics”, “Causal interactions”… some fancy sounding stuff, seems to try get at causal relationships which might be a bit too strong a claim be defo interesting stuff that i had not come across before.
  • CPS = power plants, power grids, transport systems etc. 
  • Symbolic dynamic filtering – look this up.
  • General setting in this paper is a probabilistic graphical model type set up.
  • State, Alphabet, Symbol generation matrix, Importance Score – similar sort of set up to HMM type models.
  • Partitioning → States → Spatiotemporal Pattern Network (STPN).
  • Energy based models another way to frame and think about these models.
  • I wonder how well this model scales to many time series? My gut says probably ok for a handful of time series as opposed to 100’s.
  • Some nice pictures of how it all hangs together. 
  • All focused on short subsequences as the core inputs it learns on. Can be overlapping windows. 
  • Lots of maths and symbols in this paper! Very complicated and hard to follow – would need to read 10 times.
  • Does not look like any code or implementations out there so not sure if easily implementable. 
  • Nice results on synthetic data with multiple modalities. 
  • Interesting results on real world smart home (HVAC and underfloor heating) data too. Just a handful of time series and nice an clear case study with obvious causal chains. 
  • Interesting to see global and local anomalies picked up.
  • Still seems like some parameters to tune/pick – also need to threshold the importance score.   

Recent Advances in Anomaly Detection Methods applied to Aviation.

Link

My Summary: Interesting paper focusing specifically on aviation but in a broad sense and in an up to date manner covering many newer techniques too. Still seems in reality a lot of domain specific and traditional methods is what is actually used in reality as opposed to in the research.

Notes:

  • Good food for thought on differences and similarities between novelty detection and anomaly detection. 
  • They use the usual typical taxonomy of point, contextual, and collective.
  • Lots of references to explainability as well, which makes sense in an aviation setting.  
  • “Temporal Logic Based Models” – sounds interesting need to look into  this more.
  • “Model Driven” (rules from experts) vs “Data Driven” (Machine learned).
  • Really nice taxonomy below.
  • 3 types of clustering based assumptions:
    • Anomaly as outside any cluster.
    • Anomaly as far away from centroids.
    • Density based whereby cluster can be a sparse cluster that are all anomalies. 
  • Nice discussion and motivation for Isolation Forest based approach. “Anomalies should be easier to isolate and so should have a shorter path length on average. 
  • Need to look up reference on online isolation forest: “An Anomaly Detection Approach Based on Isolation Forest Algorithm for Streaming Data using Sliding Window
  • Statistical methods – estimation of probability densities. Can be quicker at inference time than clustering. 
  • GMM’s as example approach here. But you still need to pick the number of gaussians. Bayesian GMM’s as option.
  • ICA application to AD is worth looking at. 
  • NASA MKAD seems to get a lot of mention and looks like maybe SOTA in this setting.  
  • PCA based motivation – if less important comps have bigger values then thats a sign of an anomaly. 
  • Autoencoders motivation – anomalies cannot be compressed well. 
  • Extreme Learning Machines (ELM) – some references to this – need to look it up.
  • Overlapping sliding windows as feature transformers. 
  • MSCRED encdec+conv+lstm approach. Worth looking into. 
  • A section and discussion around cases when interpretability matters and “temporal logic based learning” that can learn signal temporal logic (STL) predicates that human domain experts can then understand. No idea what this stuff is, need to look into it. Maybe start here
  • Good discussion on a range of different applications to the aviation sector. 
  • Still what’s actually in production a lot of the times in domain expert systems with manual thresholds.
  • OC-SVM seems to come up quite a bit in the paper. 
  • Whole section on anomaly detection as a input into predictive maintenance.  
  • Once you have flagged anomalies maybe you can build models to look for precursor events before the anomaly. Part of discussion on temporal logic based learning. 

Anomaly Detection in Flight Recorder Data: A Dynamic Data-driven Approach (NASA).

Link

My Summary: A nice look at the different systems and approaches used in aviation. Interesting type of feature engineering proposed Symbolic Dynamic Filtering (SDF). 

Notes:

  • Idea: What about using SDF feature extraction and then applying point based AD algos.
  • SDF > PCA for feature based extraction and dimensionality reduction in some of the examples they looked at. 
  • Seems like a lot more focus on unusual patterns as opposed to just point detection in flight setting. 
  • Lots of good references on more traditional approaches and systems used in aviation.
  • SDF – notions of fast scale vs slow scale time features in an interesting idea. 
  • They used simple enough correlation based preprocessing to throw away and reduce redundant features.
  • Looks like they compared normalised zscores to sdf based scores. 
  • Big question is if SDF features can efficiently be calculated in an online setting. 

Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm

Link

My Summary: Very quick and crisp paper, big focus on computational efficiency and linear time of HBOS. Main downside is HBOS seems mainly univariate. 

Notes:

  • Linear scoring time.
  • But does perform poorly on local outlier problems as opposed to global ones.
  • Fast.
  • 3 main categories of unsupervised AD:
    • Nearest Neighbour
    • Clustering 
    • Statistical
      • Parametric
        • GMM
      • Non-Parametric
        • Hist
        • KDE
  • Main idea – just use histograms as density estimators!
  • Dynamic bin’s recommended.
  • One problem is that it’s still just a point anomaly detector – scores a point at a time.
    • Maybe could get around this with post scoring smoothing or something but still a limitation.
  • When and how to update the reference histogram?
  • Maybe with careful feature processing or grouping it could be done in multivariate setting.
  • Would be interesting to see how hbos compares to traditional zscore based approach.  

Visualizing Big Data Outliers through Distributed Aggregation (HDOutliers)

Link

My Summary: Seems to have some nice properties but not clear if is suitable at all for an online setting.

Notes:

  • Nice plots and discussion around boxen plots and letter-value-boxplot or dixons plot. 
    • IQR and traditional box plots flag way too much as outliers when data size is large, due to the ways percentiles are calculated, is an inherent property.
  • Good discussion on underlying assumptions, their typical violations etc.
  • Typical zscore based approaches assume normality but use estimated mean and stdev from data which are very sensitive to the presence of outliers – so there is a circularity here to be careful of. Some robust extensions to this are possible.
  • Gaps based approach to univariate anomaly detection sound interesting. Not really applicable to multivariate data.
  • All seems to revolve around a normalized X vector. 
  • Unclear on the efficiency of the inference step if any.
  • Mentions LOF as one of the most popular algos and does some comparisons to it.
  • Interesting use of parallel coords chart to explore anomalies once detected.
  • Non parametrics smoothers often used for time series setting, smoother as model and residuals as anomaly score. 
  • HDOutliers does give a probability based threshold which is a nice property.    

Anomaly Detection for Discrete Sequences: A Survey

Link

My Summary: Interesting enough survey of a totally different way of potentially framing time series AD. Paper is from 2012 so a little old, but a good overview of higher level approaches in a more traditional sense.

Notes:

  • We can sometimes map our anomaly detection problem into a sequence detection problem with a predefined alphabet. 
  • There is a design choice around full sequence vs subsequence approach. 
  • Computational complexity can be a concern if focusing on subsequences within long sequences (potential parallels to genomics in this regard).
  • Wide range of domains and applications – this paper tries to give a more universal overview and approach that is agnostic of domain and application specifics.
  • Three broad problem formulations
    • Entire sequence anomaly detection
    • Subsequence anomaly detection
    • Frequency lookup/reference based anomaly detection given a specific query sequence.
  • Formulation 1 and 2 can be considered special cases of each other in certain circumstances. 
  • For online time series AD we are either in 1 or 2 formulation.
  • If anomaly length is known in advance then this can impact design a lot (e.g. known mutation lengths in DNA for example).
  • Looks to me like this approach is fundamentally univariate based which could be one big drawback. 
  • There are similarities to something like a language model and if probability for the ‘sentence’ is very small then it might be an anomaly. 
  • Rolling window based approaches can better find localised anomalies. 
  • Markovian and ‘model’ based approaches.
  • Sparse markovian approaches allow for more flexible and inexact anomaly pattern matching. 
  • HMM approaches can be very sensitive to underlying assumptions and params.
  • Seems like these sequence based approaches could be useful to explore as a totally different way to attack the problem. So could not be strong enough on their own but could be useful in a more ensemble based approach. 
  • Anomalous ‘discord’ detection within a sequence.
  • Mentions Hot Sax paper – must look into this. 
  • Some talk of bitmap based representations – need to look into this. 
  • Third approach is to ask “What is the expected frequency of this sequence pattern?”.

Anomaly Detection in Streams with Extreme Value Theory (SPOT)

Link

My Summary: Very interesting paper, hbos seems to have lots of advantages as being fast and making little assumptions. Very much focused on point anomaly setting.

Notes:

  • Makes no assumptions about underlying distributions.
  • Risk is only input parameter. 
  • Can be used to automatically set dynamic thresholds.
  • Idea: would it be possible to convert contextual or collective into a point detection setting by using clever feature transformations (e.g Catch22 etc).
  • EVT is like a CLT type result but for extreme values.
  • DSPOT for non-stationary case (D for Drift).
  • They are working in a univariate and unimodal setting <- potential limitation. 
  • Builds on Peak Over Thresholds (POT) approach to EVT estimation.
  • DSPOT reacts quickly to change in stream properties. 
  • Me: DSPOT is really comparable to zscore based approach for spike detection type setting. 
  • They use typical taxonomy of distance based, nearest neighbour, or clustering. 
  • Lots of interesting discussion and motivation around EVT and the nice properties it has. 
  • Some detailed discussion on estimation approaches to EVT. 
  • MLE as best way to estimate EVD properties. 
  • Idea: potential to slice windows for feeding into DSPOT in non standard ways. 
  • There is a burn in or ‘calibration’ phase that you kind of can think of an initial training step. N>=1000 is their recommendation. Just not too small. 
  • Initial threshold t = 0.98, determines initialisation. 
  • Python3 implementation here: https://gforge.inria.fr/scm/browser.php?group_id=9388
  • SPOT is robust.
  • Nice example application with some feature engineering around network attack setting. 
  • Idea: using DSPOT and smoothing/preprocessing to tune various properties of the detector. 
  • q param as a false positive regulator. Interesting ROC curve showing TPR vs FPR for different q values. 
  • Another reference to https://www.comp-engine.org/ was made.   
  • Curious why not preprocess to look at abs(diffs) in some of the examples they discuss. They also show evidence that this is slightly faster as opposed to bi-DSPOT.
  • Well worth playing around with this algo to compare to traditional zscore based approach. 
  • Maybe implementing it at different granularities could be useful/interesting. 

Github Webhook -> Cloud Function -> BigQuery

I have recently needed to watch and track various activities on specific github repos i’m working on, however the rest api from Gtihub can sometimes be a bit limited (for example, best i could see, if you want to get the most recent list of people who began watching your repo you need to make a lot of paginated api calls and do battle with rate limiting 💩).

This is where Github Webhooks can be a very useful alternative way to trigger certain events of interest to some endpoint where you can then handle the data as you need. The use case i was interested in was triggering an event any time someone starred, unstarred, watched or forked a specific repository. I wanted to then store that info in a table in Google BigQuery where it can be used to track repository activity over time for various reasons you might want (outreach to the community around the repository, or just tracking growth over time).

After the usual few hours of googling around i landed upon the idea of having the webhook for Github send events to a Google Cloud Function, from there my cloud function can process and append the data onto a BigQuery table. To make developing and maintaining the cloud function easy i used Serverless and built on this example in particular.

p.s. i also found this repository very useful as well as this one from Bloomberg. Also i think you could maybe get something similar done without any code using something like Zapier (although i don’t think they have all the Github Webhook events available).

p.p.s all the code is in this repo.

Step 1 – Serverless

We start by leveraging this Serverless example to create the bare bones structure for our cloud function.

In a folder where we want the code to live we run the below to install Serverless if needed, and pull down the google-python-simple-http-endpoint template and save it into a new Serverless project called handle-github-events.

npm install serverless -g
serverless install -u https://github.com/serverless/examples/tree/master/google-python-simple-http-endpoint -n handle-github-events
view raw start.sh hosted with ❤ by GitHub

The approach i am taking also depends on using a .env file to handle secrets and enviornmental variables so we also need to install the serverless-dotenv-plugin, and run npm install for everything else we need.

cd handle-github-errors
npm i -D serverless-dotenv-plugin
npm install
view raw do_npm_stuff.sh hosted with ❤ by GitHub

Step 2 – Cloud Function

Once we have the bare bones serverless template in place we can build on it to create the function we want for handling incoming requests from the Github webhook. All the code is in this repository and i’ll walk through the main points below.

The core of what we want to do in our Cloud function is in main.py. What it tries to do is:

  1. Validate that the request is coming from a known Github ip address.
  2. Validate that the hashed secret key stored in Github when you create your webhook matches what is expected by the cloud function as pulled from the GITHUB_WEBHOOK_SECRET environment variable.
  3. Parse the json received from the Github request and append it to a table somewhere in BigQuery.
  4. Return as the response to Github some info about the event.
"""
GCP HTTP Cloud Function to handle github webhook events.
Some code stolen from here: https://github.com/carlos-jenkins/python-github-webhooks/blob/master/webhooks.py
"""
# -*- coding: utf-8 -*-
import hmac
import json
import datetime
import logging
import os
from ipaddress import ip_address, ip_network
import pandas as pd
import requests
def validate_request_ip(request):
"""Function to validate that request comes from a known github ip"""
# get ip of request
request_ip_address = ip_address(u'{}'.format(request.access_route[0]))
# get whitelist of valid ip's from github
github_ip_whitelist = requests.get('https://api.github.com/meta&#39;).json()['hooks']
# check if ip is a valid one from github
for valid_ip in github_ip_whitelist:
if request_ip_address in ip_network(valid_ip):
break
else:
error_msg = 'IP {} not allowed.'.format(request_ip_address)
logging.error(error_msg)
raise ValueError(error_msg)
def validate_request_signature(request):
"""Validate that request signature and function signature match"""
# get signature from header
sha_name, request_signature = request.headers.get('X-Hub-Signature').split('=')
# create matching signature
function_signature = hmac.new(
str.encode(os.environ.get('GITHUB_WEBHOOK_SECRET', 'Specified environment variable is not set.')),
msg=request.data,
digestmod='sha1').hexdigest()
# check if signatures match
if str(request_signature) != str(function_signature):
error_msg = 'Signatures do not match.'
logging.error(error_msg)
raise ValueError(error_msg)
def validate_event_type(event_type):
"""Function to error out if event type is of a type not yet implemented for handling by this function"""
if event_type not in ['star', 'watch', 'fork']:
error_msg = f"Event Type '{event_type}' not yet implemented by this function."
logging.error(error_msg)
raise NotImplementedError()
def github_event(request):
"""Function to handle incoming event from github webhook and save event data to BigQuery."""
# validate request ip
validate_request_ip(request)
# validate request signature
validate_request_signature(request)
# request_timestamp
request_timestamp = str(datetime.datetime.now())
# github_request_type
github_event_type = request.headers.get('X-GitHub-Event')
validate_event_type(github_event_type)
# get relevant env vars
gcp_project_id = os.environ.get('GCP_PROJECT_NAME')
bq_dataset_name = os.environ.get('BQ_DATASET_NAME')
bq_table_name = os.environ.get('BQ_TABLE_NAME')
bq_if_exists = os.environ.get('BQ_IF_EXISTS')
bq_table_suffix = request_timestamp.replace('-', '')[0:8]
# get json from request
request_json = request.get_json()
# create response body
response_body = {
"request_method": str(request.method),
"timestamp": request_timestamp,
"event_type": github_event_type,
"action": request_json.get("action", github_event_type),
"starred_at": request_json.get("starred_at", ""),
"repository_full_name": request_json.get("repository")["full_name"],
"sender_username": request_json.get("sender")["login"]
}
# build response
response = {
"statusCode": 200,
"body": response_body
}
# logging response
logging.info(response)
# make pandas df
data = [response_body['timestamp'], response_body['repository_full_name'], response_body['event_type'],
response_body['action'], response_body['sender_username']]
columns = ['timestamp', 'repo', 'event_type', 'action', 'username']
df = pd.DataFrame(data=[data], columns=columns)
# display df.head() in logs
logging.info(df.head())
# save to big query
df.to_gbq(
destination_table=f'{bq_dataset_name}.{bq_table_name}_{bq_table_suffix}',
project_id=gcp_project_id, if_exists=bq_if_exists
)
return json.dumps(response, indent=4)
view raw main.py hosted with ❤ by GitHub

Our serverless.yml file looks like below. Note that it is pulling environment variables required for serverless to deploy from a .env file you would need to create yourself (here is an example in the repo).

service: handle-github-events
frameworkVersion: ">=1.2.0 <2.0.0"
package:
exclude:
node_modules/**
.gitignore
.git/**
plugins:
serverless-google-cloudfunctions
serverless-dotenv-plugin
provider:
name: google
runtime: python37
project: ${env:GCP_PROJECT_NAME}
region: ${env:GCP_REGION_NAME}
credentials: ${env:GCP_KEY_FILE}
functions:
github_event:
handler: github_event
events:
http: path
view raw serverless.yml hosted with ❤ by GitHub

Step 3 – Deploy

Once we are ready we run `serverless deploy` and if all goes well see output like below:

>serverless deploy -v
Serverless: DOTENV: Loading environment variables from .env:
Serverless:      - GITHUB_WEBHOOK_SECRET
Serverless:      - GCP_KEY_FILE
Serverless:      - GCP_PROJECT_NAME
Serverless:      - GCP_REGION_NAME
Serverless:      - BQ_DATASET_NAME
Serverless:      - BQ_TABLE_NAME
Serverless:      - BQ_IF_EXISTS
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Compiling function "github_event"...
Serverless: Uploading artifacts...
Serverless: Artifacts successfully uploaded...
Serverless: Updating deployment...
Serverless: Checking deployment update progress...
....................
Serverless: Done...
Service Information
service: handle-github-events
project: <your project name will be here>
stage: dev
region: <your region will be here>

Deployed functions
github_event
  https://<your-region>-<your-project-name>.cloudfunctions.net/github_event

Serverless: Removing old artifacts...

Now you should have a cloud function alive at some url like https://your-region-your-project-name.cloudfunctions.net/github_event.

Step 4 – Github Webhook

Once your function is deployed (or in reality you might make the Gtibhub webhook first and then iterate on the function to get it doing what you want) you can create and test Github Webhook you want to send events from.

In my case and for this post i’m going to add the webhook to my andrewm4894/random repository for illustration. Payload URL is the url of the cloud function we created and Secret should be the same string you are storing in your .env file as “GITHUB_WEBHOOK_SECRET”.

Check whatever events you want to trigger on – i’m my case it was star, watch and fork events (Note: the function might not work if you were to send all events or different events – you would just need to adapt it accordingly).

Fingers Crossed

Now we can try see if it works by triggering some events. In this example i logged on as a second username i have and pressed some star, watch, and fork buttons to see what happened.

You can see recent triggers of the webhook in Github and this can be very useful for debugging things and while developing.

An example request sent to the cloud function.

And you can also see the response received from the cloud function. In this case showing that “andrewm4894netdata” (my other user) deleted a star from the “andrewm4894/random” repository 😔.

Example response back from our cloud function.

And then finally we can see the stored events in our table in BigQuery:

We have the data!!

And that’s it! We have our Github Webhook sending events to our Google Cloud Function which is in turn appending them onto a daily table in BigQuery. Go Webhooks!

Papers i’m reading #1

I’ve recently set myself the goal of reading one academic paper a week relating to the ML/AI things i’m working on i’m my current role.

To try help keep me honest and diligent in this regard, I’ve decided to get into the habit of jotting down some quick notes on each paper and every now and then as i get through a batch of them, stick them into blog post (because i like to try squeeze everything and anything into a blog post if i can get away with it, even better if is minimal extra effort on my part 🙂 ).


Anomaly Detection in Streaming Non-stationary Temporal Data

Link

My Summary: Really interesting paper and application, considers a lot of different design aspects in it. Nice example of a different approach leveraging feature extraction and statistical techniques to get the job done.

Notes:

  • Leverages EVT approaches, forecasts boundary for typical behavior in relation to the extremes.
  • Leverages a feature vector and dimensional reduction approach too which is interesting and somewhat independent of the AD algo. 
  • It is multivariate but the data they use are all sensor data so measuring the same thing, so not quite the same as multivariate measures measuring different things – so still questions on how one would normalize accordingly for this approach.
  • Some lovely pictures. 
  • It is online but does have a sort of offline or training phase where it fits to the ‘representative example’ of the data – and this may need to change/evolve over time. 
  • So it is streaming and unsupervised but with some small caveats. 
  • Interesting discussion on differences between density based and distance based approaches to anomaly detection.
    • “In contrast, defining an anomaly in terms of the density of the observations means that an anomaly is an observation (or cluster of observations) that has a very low chance of occurrence”.
  • Offline phase – estimate the properties of the typical dataset which will be used in the online phase of anomaly detection.
  • HDOutliers is another approach worth looking into.  
  • Interesting choice of 14 features which they then do pca on. Worth looking into these specific features.
  • Offline phase is implemented as just a burn in window on the streaming data so this is not too bad.  
  • Feature extraction and dimension reduction a big part of the preprocessing, interesting approach that could be applied to other algos.
  • Just using first 2 components of the PCA – found that interesting.
  • There is quite a few steps in the algos – quite involved. 
  • Sliding window with concept drift detection used to determine when need to refit the data – interesting approach as opposed to just refitting at regular intervals. Pros and cons to each potentially. 
  • The output at each timestep is a list of time series flagged as anomalous within that sliding window. So there is not really an anomaly score as such. 
  • They suggest that having the ‘informed’ concept drift based approach is more efficient overall as avoids wasteful refits.
  • Unclear to me how this would apply to multivariate ts data with many different types of measurements. Does not really discuss this in the paper – maybe worth a question on the repo if playing around with it. 
  • There still are some probably important params like window size and things like that.

A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data

Link

My Summary: Very good reference paper for more traditional methods as opposed to deep learning based approaches. Good discussion on complexity and online setting too. Primarily concerned with traditional tabular data as opposed to time series but still some good ideas to pursue.

Notes:

  • 2016 paper so maybe pre-deep learning hype which is nice. 
  • Time series data and setting not really any part of this paper so bear that in mind. 
  • Authors use similar taxonomy in terms of types of AD setting as I have seen in other papers. 
  • Mainly if we ‘flatten’ our time series data into feature vectors (tabular data) then we would be in a similar setting as this paper. 
  • Scores are much more useful then binary labels as outputs. 
  • AD Settings (similar to elsewhere):
    • Point AD.
    • Collective AD.
    • Contextual AD.
  • Normalization suggested as important but no mention of difficulty in normalizing streaming data. 
  • Another reference to the NASA shuttle data – must look into this. 
  • 4 Groups of unsupervised AD algos:
    • Nearest neighbor.
    • Clustering based.
    • Statistical.
    • Subspace techniques.
  • KNN based approaches, 10 < k <50 as rule of thumb.
  • KNN can miss local outliers as relies on neighbours.
  • LOF nice in that can give a score between 0 and 1 ← this is a nice property to have.
  • They go through various extensions of LOF.
  • LoOP – Local Outlier Probability – makes some small changes such that you get back a probability. But still that probability can be very specific to the particular model. Its not like its really a probability you can compare to other models. More useful for within model observation comparisons.
  • Some extensions of LOF that first use clustering to reduce complexity. 
  • Clustering based approaches can be very sensitive to the choice of K. 
  • HBOS – histogram statistical based approach. Simple and fast, surprisingly performant.
  • One class SVM – a range of ways to implement, not really lends itself well to online setting. 
  • PCA – get components and then use them in some way to get AS. PCA can reduce to clustering equivalence under certain conditions. 
  • PCA can be fast if D not too large.
  • Metrics@TopN is a good way to evaluate AD systems. E.g so long as some anomalies appear in the top of the pile that can be progress (similar evaluation methods to information retrieval). 
  • Rank comparison approaches can be useful too (we should make sure any data we capture lends itself to this approach also).
  • Local vs Global anomalies are a big consideration in this paper. Is not quite clear what this would mean in our setting. It’s probably true that we are more interested in global anomalies than local ones. But also hard to know which setting you are in, especially in higher dimensions. 
  • #k has a big impact on computation time for clustering algos, as does the size of dataset.
  • HBOS is fast!
  • All algos in this paper are available via a rapidminer extension if we wanted to play with them.
  • Recommendation to start with global based algos as they can also work somewhat on local anomalies. 
  • Clustering approaches also sensitive to random start so good to restart a few times. 
  • Nearest neighbour approaches can be more robust to choice of parameters.
  • But clustering approaches can be faster then knn approaches.

Deep Learning for Anomaly Detection: A Survey

Link

My Summary: A looot of references and got some good ideas out of it. Not much else to it. 

Notes:

  • I like the general taxonomy they use for types of AD problems/framings.
    • Point, contextual, vs collective.
  • Good point about maybe an over focus on autoencoders, not clear what is driving that.
  • Interesting discussion around one class neural networks. 
  • Labels in practice not as big of an area for practical reasons (hard to collect) and in cases where anomalous patterns may change. 
  • Hybrid approaches could be interesting if provide efficiency at runtime. Use DL model for feature representation and then some other model for the scoring.
    • One problem is this is not end to end but could still be something to keep in mind.
    • One class NN as better option here.
  • Little discussion in the paper around considerations in productionising any of it or dealing with specific considerations involving streaming data.  
  • Could convert time series problem into a sequence problem and use things like language models or event based approaches.
  • Adaptivity of your model as a design param you need to think about and decide on.
  • The part about interconnectedness of IoT nodes resonates with some use cases for us.
  • Deep attention based models as useful in helping to explain and locate the anomaly in addition to just detecting it. 
  • GAN’s as an approach worth looking into. 
  • No clear theory or guidance on what choices to make in network architecture and hyper params.
  • Transfer learning based approaches as an open and active area of research. 
  • Hilbert transform and other DSP based approaches mentioned.

Time2Vec: Learning a Vector Representation of Time

Link

My Summary: Nice little paper and idea of learning the frequency functions and representations seems really interesting. 

Notes:

  • Key idea is that time2vec gives you a “general purpose model agnostic representation of time that can be potentially used in any architecture”.
  • Basically it’s trying to extend the notion of feature embeddings to the time series domain.
  • Time2vec as a featurizer essentially.
  • Talk of asynchronous time/sequence based models is interesting. Perhaps could be a class of models we could explore that could run on irregularly sampled data.
  • A large focus here on capturing various periodicity and time varying effects. It could be that 1 second monitoring data is not a great candidate for this by its nature.
  • Could use time2vec type approach to get a more universal feature representation?
  • Unclear if this is all univariate or multivariate.
  • Worth looking around to see if any time2vec implementations to play with.

catch22: CAnonical Time-series CHaracteristics

Link

My Rating: 8/10

My Summary: Well done paper, limited application potentially to an online setting but great food for thought on the range of ts feature transformations literature already out there.

Notes:

  • Idea to compress time series into useful ‘feature vectors’ that can be used for downstream ML tasks, mainly classification and clustering in this paper.
  • Starting point is hctsa matlab package feature space of ~5k features. Catch22 is a project to empirically discover the most useful (and computationally reasonable) of these.
  • Builds on a lot of literature and hand crafted feature engineering in the time series space. 
  • Catch22 is implemented in C with python, R, mathlab wrappers. This could be useful for netdata core C based stuff. 
  • Many features here may require the full ts to be available prior to calculation so not suitable for online streaming setting. Although could implement windowed versions potentially. 
  • E.g. how do you z-score normalise a stream of data?
  • They just used a decision tree as the classification model they used. I wonder how sensitive results are to this. I guess makes sense as they wanted to try test the usefulness of the features themselves. Curious why no linear models. 
  • Clustered the ‘performance vectors’ to try to reduce redundancy and overlap of features. That was nice. 
  • Check out the tsfeatures package from Hyndman mentioned in this paper. 
  • It is interesting to look at some of the ts features themselves – don’t reinvent the wheel when all this already exists!
  • Look into compengine.

Time Series Anomaly Detection; Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data

Link

My rating: 6/10

My summary: Good example of simple regression based approach, not very generalisable, data and results not really powerful. 

Notes:

  • Typical ‘Expected Value’ regression based approach. 
  • Focus on sustained anomalies as opposed to single timesteps.  
  • No semantic or domain based understanding -just independent time series all treated separately. 
  • Data in this paper is “periodic but noisey” 5 min level byte counts.
  • Shout out to Numenta approach to anomaly likelihood, must revisit this.
  • Use of simulated data. 
  • Data Normalization to 0-1 scale. Unclear how this is implemented without data leakage or in an online manner.
  • Simple threshold based approach to detection given, Yhat – Y = AS.  
  • Use of dummy data for model/approach comparison.
  • Pretty small dataset for DL approaches.
  • In the absence of labeled data, leveraging multiple approaches and comparing anomalies raised by each approach and their profiles could be a useful way to iterate towards golden datasets.
  • DL models they used looked quite big and deep for nature and size of the data – not really motivated why chose more complex architecture.
  • LSTM or RNN did not do better than vanilla DNN.
  • Not the most convincing set up and approach. Very limited in terms of data and depth of the research.

Time-series anomaly detection service at Microsoft

Link

My rating: 8/10

My summary: Good walkthrough of end to end service, interesting computer vision application, some good leads to follow up. 

Notes:

  • Interesting to see they use influxdb (and Kafka, and some flink).
  • Deployed as service on kubernetes.
  • Stuck with needing to treat each individual time series separately in the model.
  • They build model to label each point in window as anomalous or not – seems potentially limiting if just interested in if each window of data is anomalous or not. 
  • SR sounds interesting, have not seen that before, worth looking into, although their SR example looks very similar to what you’d get by looking at pct change or something so feels maybe over-engineered. 
  • Converting the problem into a computer vision friendly setting is interesting and not uncommon. In the multivariate setting we could encode the data visually, e.g. fft and wavelet frequency distributions etc. Heatmaps or even some custom encoding into a visual space based on specific characteristics of the data.  
  • Some of the windowed feature engineering stuff seemed interesting, as well as then layering ratios of those windowed features on top of each other. 
  • Is a way for users to label data which seems to help build golden datasets used for later validation and experimentation.
  • Need to look into SPOT and DSPOT as EVT based approaches i’ve only previously superficially looked at.
  • Need to look into DONUT.
  • Some other good references to follow up on.
  • Seems like this is indeed a technical paper relating to this Azure api.

My First PyPI Package

I’ve been threatening to myself to do this for a long time and recently got around to it, so as usual i’m going to try milk it for a blog post (Note: i’m not talking about getting into a box like the below picture, its something much less impressive).

Confession – I don’t know matplotlib

I have a confession to make that’s been eating away at me and i need to get off my chest – i’m pretty useless at plotting anything in Python. I never really had the time/need to sit down and ‘learn’ matplotlib from first principles (does anyone?). I’ve usually had tools like Tableau or Looker to sit on top of whatever database i am using and make visualizations pretty painlessly.

When I’ve needed to do something custom or more complicated it usually goes like this, i spend about a day or two randomly googling around for something that looks close enough to what i need, start playing around with the code (copy paste), then i find some other example i like a little bit more that uses a different library (seaborn, bokeh, plotly etc.) and start the whole painful process over again!

Eventually i settle on some Frankenstein solution that gets me over the line until the next time. After living this cycle many times i decided to some day build my own plotting library that would short circuit this shitshow and over time become the answer to all my plotting needs. And i was hoping it would also be a nice excuse to learn about Python packaging and deploying to PyPI.

Cookiecutter to the rescue

Turns out, like most other things, there are already great tools out there to make this much easier then i expected it would be – the main one being cookiecutter and in particular this cookiecutter template for pypi packages (i also found this TalkPython course and these talks really useful starting points).

am4894plots

So after a bit of dicking around with cookiecutter i had the basis for my plotting package (see my minimal example ‘hello world’ type package on PyPI here) and just needed to build out my functionality (am4894plots on PyPI).

I’ve mostly been working with time series data recently so decided to start there with some common typical plots i might often reach for when looking at such data. My main principles in the package are:

  • Usually my data is in a pandas dataframe and that what i want to pass into my plotting function along with a list of what cols i want to plot and as little else as possible.
  • I don’t care what library i use under the hood and where possible i might even want to implement the same version of a plot in multiple underlying libraries for whatever reason (At the moment it’s mainly just either Plotly or Bokeh being used, but i can easily see myself adding more over time as needs arise).
  • This package is just for me to use, you are not allowed to use it 🙂

Moving parts

The great thing about leveraging something like cookiecutter is you can plug into as many best practice tools as possible with as little sweat as possible on your end. Below are some notable examples of tools or components you get pretty much out of the box that i expected to have to work much harder for.

Examples

I’ll finish with some quick examples to illustrate what the package actually does and some ways i’m planing to use it.

plot_lines()

plot_scatters()

plot_hists(), plot_boxes()

plot_heatmap()

Thats it

That’s it, now that i (technically) have a package on PyPI i feel just a little bit less of an impostor 🙂