Premature Optimization

I’ve been doing some work that necessitated using the same statistical test from spicy lots of times on a fairly wide pandas dataframe with lots of columns. I spent a bit too much time googling around for the most efficient ways to do this, and even more time re-writing things various way before realizing i should have RTFM a bit more in the first place, yep i’ve gone about a week down a path of premature optimization – but hey, *blog post* 🙂

The Set Up

I have a wide pandas dataframe of lots of time series metrics – one for each column, and i have a ‘focus’ window of time during which i am interested to know what metrics look like the may have changed in some way in reference to a ‘baseline’ window just before the focus window.

A rough first idea (before getting too fancy and building models – not there yet for various reasons) is to break out our old friend the KS Test to basically do a statistical test to see if the each metric the ‘focus’ distribution looks statistically significantly different then the ‘baseline’ distribution. The idea being that those metrics that do look to have ‘changed’ in this sense between the two windows might be worth looking at first.

So a pretty simple set up and application. The tricky part was doing this as quickly as possible on a dataframe with around 500-1000 columns and anywhere between 1000-10000 rows of data as a rough typical usage scenario.

Dumb Approach

So my first approach, as usual, is to do the dumbest thing i can and just get something that works and go from there. So here is my ks_df_dumb() function.

def ks_df_dumb(df, ks_mode):
    Take in a df, loop over each column, split into base and focus, and apply test.
    results = []
    for col in df._get_numeric_data():
        base = df[df['window'] == 'base'][col].values
        focus = df[df['window'] == 'focus'][col].values
        ks_stat, p_value = ks_2samp(base, focus, mode=ks_mode)
        results.append((ks_stat, p_value))
    return results

If i run this on my test dataframe of 500 columns * 1000 rows i see the below timings.

%%timeit -n 5 -r 5
results = ks_df_dumb(df, ks_mode)
# 3.77 s ± 57.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
start_time = time.time()
results = ks_df_dumb(df, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_df_dumb
# 3.55 seconds

So about 3-4 seconds which is not great for what i need (it may end up being something a user clicks to trigger and so want to have them wait as little as possible for the results).

Vectorize it?

So now i start messing around with super cool tricks to try and be a hero. I know better than to be looping over stuff in python and pandas, i know i’ll try vectorize it!

def ks_df_vec(df, ks_mode):
    """Take in a df, and use np.vectorize to avoid pandas loop.
    def my_ks_2samp(a,b):
        """Wrapper function to pass args to vectorized function. 
        return ks_2samp(a,b,mode='asymp')
    results = []
    base = df[df['window'] == 'base']._get_numeric_data().transpose().values
    focus = df[df['window'] == 'focus']._get_numeric_data().transpose().values
    ks_2samp_vec = np.vectorize(ks_2samp, signature='(n),(m)->(),()')
    results = ks_2samp_vec(base, focus)
    results = list(zip(results[0], results[1]))
    return results

Now i see:

%%timeit -n 5 -r 5
results = ks_df_vec(df, ks_mode)
# 2.22 s ± 35.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
start_time = time.time()
results = ks_df_vec(df, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_df_vec
# 2.16 seconds

So a bit better at just over 2 seconds but still not great given this is still only 1000 rows of data.


Time to break out numpy! (confession: i never really learned numpy properly and find it very painful to work with and reason about the data and shapes etc as i do stuff to them – it all just feels so unnatural to me in some way – and i find it hard to keep track of things without and indexes or keys, i just don’t trust myself with it – i know i’m not supposed to speak this out loud but hey).

So my approach now will be to just get the data into two separate numpy arrays and work solely with them.

def ks_np_dumb(arr_a, arr_b, ks_mode):
    results = []
    for n in range(arr_a.shape[1]):        
        ks_stat, p_value = ks_2samp(arr_a[:,n],arr_b[:,n], mode=ks_mode)
        results.append((ks_stat, p_value))
    return results
%%timeit -n 5 -r 5
results = ks_np_dumb(arr_base, arr_focus, ks_mode)
# 2.43 s ± 200 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
start_time = time.time()
results = ks_np_dumb(arr_base, arr_focus, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_np_dumb
# 2.22 seconds
def ks_np_vec(arr_a, arr_b, ks_mode):
    def my_ks_2samp(a,b):
        return ks_2samp(a,b,mode=ks_mode)
    ks_2samp_vec = np.vectorize(my_ks_2samp, signature='(n),(m)->(),()')
    results = ks_2samp_vec(arr_a.T, arr_b.T)
    results = list(zip(results[0], results[1]))
    return results
%%timeit -n 5 -r 5
results = ks_np_vec(arr_base, arr_focus, ks_mode)
# 2.2 s ± 38.7 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
start_time = time.time()
results = ks_np_vec(arr_base, arr_focus, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# ks_np_vec
# 2.29 seconds

Hmm – that did not seem to add too much, which i guess is kinda reassuring, it makes sense that the dumb numpy approach be a little bit faster then the dumb pandas one, but is comforting in that its not order of magnitudes different.

And makes sense the the numpy dumb and numpy vectorize are not that different as the docs for it state that its really just still a loop (so to properly really vectorize it that means i’d probably have to do much more work to figure it out).

Feck this time for Cython!!!

Hell yeah i’m going to cythonize the shit out of this! Let be honest this is what i’ve wanted to do the whole time, do something with cython so i can boast about it to all my friends and how i got this awesome speedup, even just by adding some typing information.

Lets go.


import numpy as np
cimport numpy as np
cimport cython
from scipy.stats import ks_2samp

DTYPE = np.double

cpdef cy_ks_np(double[:, :] arr_a, double[:, :] arr_b, str ks_mode):

    cdef double k, p
    cdef Py_ssize_t i
    cdef Py_ssize_t m = arr_a.shape[1]
    result = np.zeros((m, 2), dtype=DTYPE)
    cdef double[:, :] result_view = result

    for i in range(m):
        k, p = ks_2samp(arr_a[:,i], arr_b[:,i], mode=ks_mode)
        result_view[i,0] = k
        result_view[i,1] = p

    return result

Ahhh look at it, very pleased with it if i do say so myself. I manged to wrangle this tutorial to fit my needs. Went to bed that night very pleased with myself.

But…whats this…

%%timeit -n 5 -r 5
results = cy_ks_np(arr_base, arr_focus, ks_mode)
# 2.28 s ± 54 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
start_time = time.time()
results = cy_ks_np(arr_base, arr_focus, ks_mode)
end_time = time.time()
print(f'{round(end_time-start_time,2)} seconds')
# cy_ks_np
# 2.1 seconds

2.2 Seconds!!! What the heck i was expecting some magical voodoo that would at least 10x speed me up, come on cython don’t do this to me, i was going to be a hero, they were going to chant my name in the office.

So i did what was the logical next step – made a reproducible example and asked StackOverflow to do it for me 🙂

Bro, do you even Profile!

So while i waited on SO to do its thing i asked a few real engineers in my company what they thought. And their first response was – did you profile your code?

I began to panic, i’ve been found out, oh no its happening so i quickly looked the jupyter cell magic to profile my functions.

Well would ya look at that – 500 calls to taking up ~1.8 of my ~2 seconds.

Turns out this whole exercise has been a bit of a waste of time so far. So i went back and dug into the ks_2samp() docs and the code on github to see if anything could be done.

Wait whats this mode parameter – maybe i can play with that a bit, oh one option is “‘asymp’ : use asymptotic distribution of test statistic” that sounds like it could be faster than “exact“.

So with ks_mode='asymp' i ran things again and held my breath.

Lo and behold the solution was staring me in the face all along, obviously someone has already provided some sort of knobs and parameters to get a faster implementation under the hood.

As per usual i should have stuck to my number 1 rule of writing as little code as possible and trying to use other peoples good work to make myself look better 🙂

p.s. all the code is in a notebook here.

Papers i’m reading #1

I’ve recently set myself the goal of reading one academic paper a week relating to the ML/AI things i’m working on i’m my current role.

To try help keep me honest and diligent in this regard, I’ve decided to get into the habit of jotting down some quick notes on each paper and every now and then as i get through a batch of them, stick them into blog post (because i like to try squeeze everything and anything into a blog post if i can get away with it, even better if is minimal extra effort on my part 🙂 ).

Anomaly Detection in Streaming Non-stationary Temporal Data


My Summary: Really interesting paper and application, considers a lot of different design aspects in it. Nice example of a different approach leveraging feature extraction and statistical techniques to get the job done.


  • Leverages EVT approaches, forecasts boundary for typical behavior in relation to the extremes.
  • Leverages a feature vector and dimensional reduction approach too which is interesting and somewhat independent of the AD algo. 
  • It is multivariate but the data they use are all sensor data so measuring the same thing, so not quite the same as multivariate measures measuring different things – so still questions on how one would normalize accordingly for this approach.
  • Some lovely pictures. 
  • It is online but does have a sort of offline or training phase where it fits to the ‘representative example’ of the data – and this may need to change/evolve over time. 
  • So it is streaming and unsupervised but with some small caveats. 
  • Interesting discussion on differences between density based and distance based approaches to anomaly detection.
    • “In contrast, defining an anomaly in terms of the density of the observations means that an anomaly is an observation (or cluster of observations) that has a very low chance of occurrence”.
  • Offline phase – estimate the properties of the typical dataset which will be used in the online phase of anomaly detection.
  • HDOutliers is another approach worth looking into.  
  • Interesting choice of 14 features which they then do pca on. Worth looking into these specific features.
  • Offline phase is implemented as just a burn in window on the streaming data so this is not too bad.  
  • Feature extraction and dimension reduction a big part of the preprocessing, interesting approach that could be applied to other algos.
  • Just using first 2 components of the PCA – found that interesting.
  • There is quite a few steps in the algos – quite involved. 
  • Sliding window with concept drift detection used to determine when need to refit the data – interesting approach as opposed to just refitting at regular intervals. Pros and cons to each potentially. 
  • The output at each timestep is a list of time series flagged as anomalous within that sliding window. So there is not really an anomaly score as such. 
  • They suggest that having the ‘informed’ concept drift based approach is more efficient overall as avoids wasteful refits.
  • Unclear to me how this would apply to multivariate ts data with many different types of measurements. Does not really discuss this in the paper – maybe worth a question on the repo if playing around with it. 
  • There still are some probably important params like window size and things like that.

A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data


My Summary: Very good reference paper for more traditional methods as opposed to deep learning based approaches. Good discussion on complexity and online setting too. Primarily concerned with traditional tabular data as opposed to time series but still some good ideas to pursue.


  • 2016 paper so maybe pre-deep learning hype which is nice. 
  • Time series data and setting not really any part of this paper so bear that in mind. 
  • Authors use similar taxonomy in terms of types of AD setting as I have seen in other papers. 
  • Mainly if we ‘flatten’ our time series data into feature vectors (tabular data) then we would be in a similar setting as this paper. 
  • Scores are much more useful then binary labels as outputs. 
  • AD Settings (similar to elsewhere):
    • Point AD.
    • Collective AD.
    • Contextual AD.
  • Normalization suggested as important but no mention of difficulty in normalizing streaming data. 
  • Another reference to the NASA shuttle data – must look into this. 
  • 4 Groups of unsupervised AD algos:
    • Nearest neighbor.
    • Clustering based.
    • Statistical.
    • Subspace techniques.
  • KNN based approaches, 10 < k <50 as rule of thumb.
  • KNN can miss local outliers as relies on neighbours.
  • LOF nice in that can give a score between 0 and 1 ← this is a nice property to have.
  • They go through various extensions of LOF.
  • LoOP – Local Outlier Probability – makes some small changes such that you get back a probability. But still that probability can be very specific to the particular model. Its not like its really a probability you can compare to other models. More useful for within model observation comparisons.
  • Some extensions of LOF that first use clustering to reduce complexity. 
  • Clustering based approaches can be very sensitive to the choice of K. 
  • HBOS – histogram statistical based approach. Simple and fast, surprisingly performant.
  • One class SVM – a range of ways to implement, not really lends itself well to online setting. 
  • PCA – get components and then use them in some way to get AS. PCA can reduce to clustering equivalence under certain conditions. 
  • PCA can be fast if D not too large.
  • Metrics@TopN is a good way to evaluate AD systems. E.g so long as some anomalies appear in the top of the pile that can be progress (similar evaluation methods to information retrieval). 
  • Rank comparison approaches can be useful too (we should make sure any data we capture lends itself to this approach also).
  • Local vs Global anomalies are a big consideration in this paper. Is not quite clear what this would mean in our setting. It’s probably true that we are more interested in global anomalies than local ones. But also hard to know which setting you are in, especially in higher dimensions. 
  • #k has a big impact on computation time for clustering algos, as does the size of dataset.
  • HBOS is fast!
  • All algos in this paper are available via a rapidminer extension if we wanted to play with them.
  • Recommendation to start with global based algos as they can also work somewhat on local anomalies. 
  • Clustering approaches also sensitive to random start so good to restart a few times. 
  • Nearest neighbour approaches can be more robust to choice of parameters.
  • But clustering approaches can be faster then knn approaches.

Deep Learning for Anomaly Detection: A Survey


My Summary: A looot of references and got some good ideas out of it. Not much else to it. 


  • I like the general taxonomy they use for types of AD problems/framings.
    • Point, contextual, vs collective.
  • Good point about maybe an over focus on autoencoders, not clear what is driving that.
  • Interesting discussion around one class neural networks. 
  • Labels in practice not as big of an area for practical reasons (hard to collect) and in cases where anomalous patterns may change. 
  • Hybrid approaches could be interesting if provide efficiency at runtime. Use DL model for feature representation and then some other model for the scoring.
    • One problem is this is not end to end but could still be something to keep in mind.
    • One class NN as better option here.
  • Little discussion in the paper around considerations in productionising any of it or dealing with specific considerations involving streaming data.  
  • Could convert time series problem into a sequence problem and use things like language models or event based approaches.
  • Adaptivity of your model as a design param you need to think about and decide on.
  • The part about interconnectedness of IoT nodes resonates with some use cases for us.
  • Deep attention based models as useful in helping to explain and locate the anomaly in addition to just detecting it. 
  • GAN’s as an approach worth looking into. 
  • No clear theory or guidance on what choices to make in network architecture and hyper params.
  • Transfer learning based approaches as an open and active area of research. 
  • Hilbert transform and other DSP based approaches mentioned.

Time2Vec: Learning a Vector Representation of Time


My Summary: Nice little paper and idea of learning the frequency functions and representations seems really interesting. 


  • Key idea is that time2vec gives you a “general purpose model agnostic representation of time that can be potentially used in any architecture”.
  • Basically it’s trying to extend the notion of feature embeddings to the time series domain.
  • Time2vec as a featurizer essentially.
  • Talk of asynchronous time/sequence based models is interesting. Perhaps could be a class of models we could explore that could run on irregularly sampled data.
  • A large focus here on capturing various periodicity and time varying effects. It could be that 1 second monitoring data is not a great candidate for this by its nature.
  • Could use time2vec type approach to get a more universal feature representation?
  • Unclear if this is all univariate or multivariate.
  • Worth looking around to see if any time2vec implementations to play with.

catch22: CAnonical Time-series CHaracteristics


My Rating: 8/10

My Summary: Well done paper, limited application potentially to an online setting but great food for thought on the range of ts feature transformations literature already out there.


  • Idea to compress time series into useful ‘feature vectors’ that can be used for downstream ML tasks, mainly classification and clustering in this paper.
  • Starting point is hctsa matlab package feature space of ~5k features. Catch22 is a project to empirically discover the most useful (and computationally reasonable) of these.
  • Builds on a lot of literature and hand crafted feature engineering in the time series space. 
  • Catch22 is implemented in C with python, R, mathlab wrappers. This could be useful for netdata core C based stuff. 
  • Many features here may require the full ts to be available prior to calculation so not suitable for online streaming setting. Although could implement windowed versions potentially. 
  • E.g. how do you z-score normalise a stream of data?
  • They just used a decision tree as the classification model they used. I wonder how sensitive results are to this. I guess makes sense as they wanted to try test the usefulness of the features themselves. Curious why no linear models. 
  • Clustered the ‘performance vectors’ to try to reduce redundancy and overlap of features. That was nice. 
  • Check out the tsfeatures package from Hyndman mentioned in this paper. 
  • It is interesting to look at some of the ts features themselves – don’t reinvent the wheel when all this already exists!
  • Look into compengine.

Time Series Anomaly Detection; Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data


My rating: 6/10

My summary: Good example of simple regression based approach, not very generalisable, data and results not really powerful. 


  • Typical ‘Expected Value’ regression based approach. 
  • Focus on sustained anomalies as opposed to single timesteps.  
  • No semantic or domain based understanding -just independent time series all treated separately. 
  • Data in this paper is “periodic but noisey” 5 min level byte counts.
  • Shout out to Numenta approach to anomaly likelihood, must revisit this.
  • Use of simulated data. 
  • Data Normalization to 0-1 scale. Unclear how this is implemented without data leakage or in an online manner.
  • Simple threshold based approach to detection given, Yhat – Y = AS.  
  • Use of dummy data for model/approach comparison.
  • Pretty small dataset for DL approaches.
  • In the absence of labeled data, leveraging multiple approaches and comparing anomalies raised by each approach and their profiles could be a useful way to iterate towards golden datasets.
  • DL models they used looked quite big and deep for nature and size of the data – not really motivated why chose more complex architecture.
  • LSTM or RNN did not do better than vanilla DNN.
  • Not the most convincing set up and approach. Very limited in terms of data and depth of the research.

Time-series anomaly detection service at Microsoft


My rating: 8/10

My summary: Good walkthrough of end to end service, interesting computer vision application, some good leads to follow up. 


  • Interesting to see they use influxdb (and Kafka, and some flink).
  • Deployed as service on kubernetes.
  • Stuck with needing to treat each individual time series separately in the model.
  • They build model to label each point in window as anomalous or not – seems potentially limiting if just interested in if each window of data is anomalous or not. 
  • SR sounds interesting, have not seen that before, worth looking into, although their SR example looks very similar to what you’d get by looking at pct change or something so feels maybe over-engineered. 
  • Converting the problem into a computer vision friendly setting is interesting and not uncommon. In the multivariate setting we could encode the data visually, e.g. fft and wavelet frequency distributions etc. Heatmaps or even some custom encoding into a visual space based on specific characteristics of the data.  
  • Some of the windowed feature engineering stuff seemed interesting, as well as then layering ratios of those windowed features on top of each other. 
  • Is a way for users to label data which seems to help build golden datasets used for later validation and experimentation.
  • Need to look into SPOT and DSPOT as EVT based approaches i’ve only previously superficially looked at.
  • Need to look into DONUT.
  • Some other good references to follow up on.
  • Seems like this is indeed a technical paper relating to this Azure api.