This took me waaay too long to work out today and i was thinking it could make a nice little interview coding type question (which i’d probably fail).
Suppose you have 10,000 rows of data and need to continually train and retrain a model training on at most 1,000 rows at a time and retraining the model every 500 rows, can you tell me how many “batches” of data this will create and the start and end index of each batch?
My Summary: Really interesting paper – PGM’s, HMM’s and all that good stuff. Quite complicated though and no clear route to implementation. Also I would wonder how well it scales beyond 10’s of time series. More good for learning about some different approaches as opposed to anything that could be implemented easily.
“Spatiotemporal” feature extraction, “Symbolic dynamics”, “Causal interactions”… some fancy sounding stuff, seems to try get at causal relationships which might be a bit too strong a claim be defo interesting stuff that i had not come across before.
CPS = power plants, power grids, transport systems etc.
Symbolic dynamic filtering – look this up.
General setting in this paper is a probabilistic graphical model type set up.
State, Alphabet, Symbol generation matrix, Importance Score – similar sort of set up to HMM type models.
Partitioning → States → Spatiotemporal Pattern Network (STPN).
Energy based models another way to frame and think about these models.
I wonder how well this model scales to many time series? My gut says probably ok for a handful of time series as opposed to 100’s.
Some nice pictures of how it all hangs together.
All focused on short subsequences as the core inputs it learns on. Can be overlapping windows.
Lots of maths and symbols in this paper! Very complicated and hard to follow – would need to read 10 times.
Does not look like any code or implementations out there so not sure if easily implementable.
Nice results on synthetic data with multiple modalities.
Interesting results on real world smart home (HVAC and underfloor heating) data too. Just a handful of time series and nice an clear case study with obvious causal chains.
Interesting to see global and local anomalies picked up.
Still seems like some parameters to tune/pick – also need to threshold the importance score.
Recent Advances in Anomaly Detection Methods applied to Aviation.
My Summary: Interesting paper focusing specifically on aviation but in a broad sense and in an up to date manner covering many newer techniques too. Still seems in reality a lot of domain specific and traditional methods is what is actually used in reality as opposed to in the research.
Good food for thought on differences and similarities between novelty detection and anomaly detection.
They use the usual typical taxonomy of point, contextual, and collective.
Lots of references to explainability as well, which makes sense in an aviation setting.
“Temporal Logic Based Models” – sounds interesting need to look into this more.
“Model Driven” (rules from experts) vs “Data Driven” (Machine learned).
Really nice taxonomy below.
3 types of clustering based assumptions:
Anomaly as outside any cluster.
Anomaly as far away from centroids.
Density based whereby cluster can be a sparse cluster that are all anomalies.
Nice discussion and motivation for Isolation Forest based approach. “Anomalies should be easier to isolate and so should have a shorter path length on average.
A section and discussion around cases when interpretability matters and “temporal logic based learning” that can learn signal temporal logic (STL) predicates that human domain experts can then understand. No idea what this stuff is, need to look into it. Maybe start here.
Good discussion on a range of different applications to the aviation sector.
Still what’s actually in production a lot of the times in domain expert systems with manual thresholds.
OC-SVM seems to come up quite a bit in the paper.
Whole section on anomaly detection as a input into predictive maintenance.
Once you have flagged anomalies maybe you can build models to look for precursor events before the anomaly. Part of discussion on temporal logic based learning.
Anomaly Detection in Flight Recorder Data: A Dynamic Data-driven Approach (NASA).
My Summary: Seems to have some nice properties but not clear if is suitable at all for an online setting.
Nice plots and discussion around boxen plots and letter-value-boxplot or dixons plot.
IQR and traditional box plots flag way too much as outliers when data size is large, due to the ways percentiles are calculated, is an inherent property.
Good discussion on underlying assumptions, their typical violations etc.
Typical zscore based approaches assume normality but use estimated mean and stdev from data which are very sensitive to the presence of outliers – so there is a circularity here to be careful of. Some robust extensions to this are possible.
Gaps based approach to univariate anomaly detection sound interesting. Not really applicable to multivariate data.
All seems to revolve around a normalized X vector.
Unclear on the efficiency of the inference step if any.
Mentions LOF as one of the most popular algos and does some comparisons to it.
Interesting use of parallel coords chart to explore anomalies once detected.
Non parametrics smoothers often used for time series setting, smoother as model and residuals as anomaly score.
HDOutliers does give a probability based threshold which is a nice property.
Anomaly Detection for Discrete Sequences: A Survey
My Summary: Interesting enough survey of a totally different way of potentially framing time series AD. Paper is from 2012 so a little old, but a good overview of higher level approaches in a more traditional sense.
We can sometimes map our anomaly detection problem into a sequence detection problem with a predefined alphabet.
There is a design choice around full sequence vs subsequence approach.
Computational complexity can be a concern if focusing on subsequences within long sequences (potential parallels to genomics in this regard).
Wide range of domains and applications – this paper tries to give a more universal overview and approach that is agnostic of domain and application specifics.
Three broad problem formulations
Entire sequence anomaly detection
Subsequence anomaly detection
Frequency lookup/reference based anomaly detection given a specific query sequence.
Formulation 1 and 2 can be considered special cases of each other in certain circumstances.
For online time series AD we are either in 1 or 2 formulation.
If anomaly length is known in advance then this can impact design a lot (e.g. known mutation lengths in DNA for example).
Looks to me like this approach is fundamentally univariate based which could be one big drawback.
There are similarities to something like a language model and if probability for the ‘sentence’ is very small then it might be an anomaly.
Rolling window based approaches can better find localised anomalies.
Markovian and ‘model’ based approaches.
Sparse markovian approaches allow for more flexible and inexact anomaly pattern matching.
HMM approaches can be very sensitive to underlying assumptions and params.
Seems like these sequence based approaches could be useful to explore as a totally different way to attack the problem. So could not be strong enough on their own but could be useful in a more ensemble based approach.
Anomalous ‘discord’ detection within a sequence.
Mentions Hot Sax paper – must look into this.
Some talk of bitmap based representations – need to look into this.
Third approach is to ask “What is the expected frequency of this sequence pattern?”.
Anomaly Detection in Streams with Extreme Value Theory (SPOT)
I have recently needed to watch and track various activities on specific github repos i’m working on, however the rest api from Gtihub can sometimes be a bit limited (for example, best i could see, if you want to get the most recent list of people who began watching your repo you need to make a lot of paginated api calls and do battle with rate limiting 💩).
This is where Github Webhooks can be a very useful alternative way to trigger certain events of interest to some endpoint where you can then handle the data as you need. The use case i was interested in was triggering an event any time someone starred, unstarred, watched or forked a specific repository. I wanted to then store that info in a table in Google BigQuery where it can be used to track repository activity over time for various reasons you might want (outreach to the community around the repository, or just tracking growth over time).
After the usual few hours of googling around i landed upon the idea of having the webhook for Github send events to a Google Cloud Function, from there my cloud function can process and append the data onto a BigQuery table. To make developing and maintaining the cloud function easy i used Serverless and built on this example in particular.
p.s. i also found this repository very useful as well as this one from Bloomberg. Also i think you could maybe get something similar done without any code using something like Zapier (although i don’t think they have all the Github Webhook events available).
In a folder where we want the code to live we run the below to install Serverless if needed, and pull down the google-python-simple-http-endpoint template and save it into a new Serverless project called handle-github-events.
The approach i am taking also depends on using a .env file to handle secrets and enviornmental variables so we also need to install the serverless-dotenv-plugin, and run npm install for everything else we need.
Step 2 – Cloud Function
Once we have the bare bones serverless template in place we can build on it to create the function we want for handling incoming requests from the Github webhook. All the code is in this repository and i’ll walk through the main points below.
The core of what we want to do in our Cloud function is in main.py. What it tries to do is:
Validate that the request is coming from a known Github ip address.
Validate that the hashed secret key stored in Github when you create your webhook matches what is expected by the cloud function as pulled from the GITHUB_WEBHOOK_SECRET environment variable.
Parse the json received from the Github request and append it to a table somewhere in BigQuery.
Return as the response to Github some info about the event.
Our serverless.yml file looks like below. Note that it is pulling environment variables required for serverless to deploy from a .env file you would need to create yourself (here is an example in the repo).
Step 3 – Deploy
Once we are ready we run `serverless deploy` and if all goes well see output like below:
>serverless deploy -v
Serverless: DOTENV: Loading environment variables from .env:
Serverless: - GITHUB_WEBHOOK_SECRET
Serverless: - GCP_KEY_FILE
Serverless: - GCP_PROJECT_NAME
Serverless: - GCP_REGION_NAME
Serverless: - BQ_DATASET_NAME
Serverless: - BQ_TABLE_NAME
Serverless: - BQ_IF_EXISTS
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Compiling function "github_event"...
Serverless: Uploading artifacts...
Serverless: Artifacts successfully uploaded...
Serverless: Updating deployment...
Serverless: Checking deployment update progress...
project: <your project name will be here>
region: <your region will be here>
Serverless: Removing old artifacts...
Once your function is deployed (or in reality you might make the Gtibhub webhook first and then iterate on the function to get it doing what you want) you can create and test Github Webhook you want to send events from.
In my case and for this post i’m going to add the webhook to my andrewm4894/random repository for illustration. Payload URL is the url of the cloud function we created and Secret should be the same string you are storing in your .env file as “GITHUB_WEBHOOK_SECRET”.
Check whatever events you want to trigger on – i’m my case it was star, watch and fork events (Note: the function might not work if you were to send all events or different events – you would just need to adapt it accordingly).
Now we can try see if it works by triggering some events. In this example i logged on as a second username i have and pressed some star, watch, and fork buttons to see what happened.
You can see recent triggers of the webhook in Github and this can be very useful for debugging things and while developing.
And you can also see the response received from the cloud function. In this case showing that “andrewm4894netdata” (my other user) deleted a star from the “andrewm4894/random” repository 😔.
And then finally we can see the stored events in our table in BigQuery:
And that’s it! We have our Github Webhook sending events to our Google Cloud Function which is in turn appending them onto a daily table in BigQuery. Go Webhooks!
I’ve recently set myself the goal of reading one academic paper a week relating to the ML/AI things i’m working on i’m my current role.
To try help keep me honest and diligent in this regard, I’ve decided to get into the habit of jotting down some quick notes on each paper and every now and then as i get through a batch of them, stick them into blog post (because i like to try squeeze everything and anything into a blog post if i can get away with it, even better if is minimal extra effort on my part 🙂 ).
Anomaly Detection in Streaming Non-stationary Temporal Data
My Summary: Really interesting paper and application, considers a lot of different design aspects in it. Nice example of a different approach leveraging feature extraction and statistical techniques to get the job done.
Leverages EVT approaches, forecasts boundary for typical behavior in relation to the extremes.
Leverages a feature vector and dimensional reduction approach too which is interesting and somewhat independent of the AD algo.
It is multivariate but the data they use are all sensor data so measuring the same thing, so not quite the same as multivariate measures measuring different things – so still questions on how one would normalize accordingly for this approach.
Some lovely pictures.
It is online but does have a sort of offline or training phase where it fits to the ‘representative example’ of the data – and this may need to change/evolve over time.
So it is streaming and unsupervised but with some small caveats.
Interesting discussion on differences between density based and distance based approaches to anomaly detection.
“In contrast, defining an anomaly in terms of the density of the observations means that an anomaly is an observation (or cluster of observations) that has a very low chance of occurrence”.
Offline phase – estimate the properties of the typical dataset which will be used in the online phase of anomaly detection.
HDOutliers is another approach worth looking into.
Interesting choice of 14 features which they then do pca on. Worth looking into these specific features.
Offline phase is implemented as just a burn in window on the streaming data so this is not too bad.
Feature extraction and dimension reduction a big part of the preprocessing, interesting approach that could be applied to other algos.
Just using first 2 components of the PCA – found that interesting.
There is quite a few steps in the algos – quite involved.
Sliding window with concept drift detection used to determine when need to refit the data – interesting approach as opposed to just refitting at regular intervals. Pros and cons to each potentially.
The output at each timestep is a list of time series flagged as anomalous within that sliding window. So there is not really an anomaly score as such.
They suggest that having the ‘informed’ concept drift based approach is more efficient overall as avoids wasteful refits.
Unclear to me how this would apply to multivariate ts data with many different types of measurements. Does not really discuss this in the paper – maybe worth a question on the repo if playing around with it.
There still are some probably important params like window size and things like that.
A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data
My Summary: Very good reference paper for more traditional methods as opposed to deep learning based approaches. Good discussion on complexity and online setting too. Primarily concerned with traditional tabular data as opposed to time series but still some good ideas to pursue.
2016 paper so maybe pre-deep learning hype which is nice.
Time series data and setting not really any part of this paper so bear that in mind.
Authors use similar taxonomy in terms of types of AD setting as I have seen in other papers.
Mainly if we ‘flatten’ our time series data into feature vectors (tabular data) then we would be in a similar setting as this paper.
Scores are much more useful then binary labels as outputs.
AD Settings (similar to elsewhere):
Normalization suggested as important but no mention of difficulty in normalizing streaming data.
Another reference to the NASA shuttle data – must look into this.
4 Groups of unsupervised AD algos:
KNN based approaches, 10 < k <50 as rule of thumb.
KNN can miss local outliers as relies on neighbours.
LOF nice in that can give a score between 0 and 1 ← this is a nice property to have.
They go through various extensions of LOF.
LoOP – Local Outlier Probability – makes some small changes such that you get back a probability. But still that probability can be very specific to the particular model. Its not like its really a probability you can compare to other models. More useful for within model observation comparisons.
Some extensions of LOF that first use clustering to reduce complexity.
Clustering based approaches can be very sensitive to the choice of K.
HBOS – histogram statistical based approach. Simple and fast, surprisingly performant.
One class SVM – a range of ways to implement, not really lends itself well to online setting.
PCA – get components and then use them in some way to get AS. PCA can reduce to clustering equivalence under certain conditions.
PCA can be fast if D not too large.
Metrics@TopN is a good way to evaluate AD systems. E.g so long as some anomalies appear in the top of the pile that can be progress (similar evaluation methods to information retrieval).
Rank comparison approaches can be useful too (we should make sure any data we capture lends itself to this approach also).
Local vs Global anomalies are a big consideration in this paper. Is not quite clear what this would mean in our setting. It’s probably true that we are more interested in global anomalies than local ones. But also hard to know which setting you are in, especially in higher dimensions.
#k has a big impact on computation time for clustering algos, as does the size of dataset.
HBOS is fast!
All algos in this paper are available via a rapidminer extension if we wanted to play with them.
Recommendation to start with global based algos as they can also work somewhat on local anomalies.
Clustering approaches also sensitive to random start so good to restart a few times.
Nearest neighbour approaches can be more robust to choice of parameters.
But clustering approaches can be faster then knn approaches.
My Summary: Well done paper, limited application potentially to an online setting but great food for thought on the range of ts feature transformations literature already out there.
Idea to compress time series into useful ‘feature vectors’ that can be used for downstream ML tasks, mainly classification and clustering in this paper.
Starting point is hctsa matlab package feature space of ~5k features. Catch22 is a project to empirically discover the most useful (and computationally reasonable) of these.
Builds on a lot of literature and hand crafted feature engineering in the time series space.
Catch22 is implemented in C with python, R, mathlab wrappers. This could be useful for netdata core C based stuff.
Many features here may require the full ts to be available prior to calculation so not suitable for online streaming setting. Although could implement windowed versions potentially.
E.g. how do you z-score normalise a stream of data?
They just used a decision tree as the classification model they used. I wonder how sensitive results are to this. I guess makes sense as they wanted to try test the usefulness of the features themselves. Curious why no linear models.
Clustered the ‘performance vectors’ to try to reduce redundancy and overlap of features. That was nice.
Check out the tsfeatures package from Hyndman mentioned in this paper.
It is interesting to look at some of the ts features themselves – don’t reinvent the wheel when all this already exists!
My summary: Good walkthrough of end to end service, interesting computer vision application, some good leads to follow up.
Interesting to see they use influxdb (and Kafka, and some flink).
Deployed as service on kubernetes.
Stuck with needing to treat each individual time series separately in the model.
They build model to label each point in window as anomalous or not – seems potentially limiting if just interested in if each window of data is anomalous or not.
SR sounds interesting, have not seen that before, worth looking into, although their SR example looks very similar to what you’d get by looking at pct change or something so feels maybe over-engineered.
Converting the problem into a computer vision friendly setting is interesting and not uncommon. In the multivariate setting we could encode the data visually, e.g. fft and wavelet frequency distributions etc. Heatmaps or even some custom encoding into a visual space based on specific characteristics of the data.
Some of the windowed feature engineering stuff seemed interesting, as well as then layering ratios of those windowed features on top of each other.
Is a way for users to label data which seems to help build golden datasets used for later validation and experimentation.
Need to look into SPOT and DSPOT as EVT based approaches i’ve only previously superficially looked at.
Need to look into DONUT.
Some other good references to follow up on.
Seems like this is indeed a technical paper relating to this Azure api.
I’ve been threatening to myself to do this for a long time and recently got around to it, so as usual i’m going to try milk it for a blog post (Note: i’m not talking about getting into a box like the below picture, its something much less impressive).
Confession – I don’t know matplotlib
I have a confession to make that’s been eating away at me and i need to get off my chest – i’m pretty useless at plotting anything in Python. I never really had the time/need to sit down and ‘learn’ matplotlib from first principles (does anyone?). I’ve usually had tools like Tableau or Looker to sit on top of whatever database i am using and make visualizations pretty painlessly.
When I’ve needed to do something custom or more complicated it usually goes like this, i spend about a day or two randomly googling around for something that looks close enough to what i need, start playing around with the code (copy paste), then i find some other example i like a little bit more that uses a different library (seaborn, bokeh, plotly etc.) and start the whole painful process over again!
Eventually i settle on some Frankenstein solution that gets me over the line until the next time. After living this cycle many times i decided to some day build my own plotting library that would short circuit this shitshow and over time become the answer to all my plotting needs. And i was hoping it would also be a nice excuse to learn about Python packaging and deploying to PyPI.
So after a bit of dicking around with cookiecutter i had the basis for my plotting package (see my minimal example ‘hello world’ type package on PyPI here) and just needed to build out my functionality (am4894plots on PyPI).
I’ve mostly been working with time series data recently so decided to start there with some common typical plots i might often reach for when looking at such data. My main principles in the package are:
Usually my data is in a pandas dataframe and that what i want to pass into my plotting function along with a list of what cols i want to plot and as little else as possible.
I don’t care what library i use under the hood and where possible i might even want to implement the same version of a plot in multiple underlying libraries for whatever reason (At the moment it’s mainly just either Plotly or Bokeh being used, but i can easily see myself adding more over time as needs arise).
This package is just for me to use, you are not allowed to use it 🙂
The great thing about leveraging something like cookiecutter is you can plug into as many best practice tools as possible with as little sweat as possible on your end. Below are some notable examples of tools or components you get pretty much out of the box that i expected to have to work much harder for.
I’ve been playing around a bit with KubeFlow a bit lately and found that a lot of the tutorials and examples of Jupyter notebooks on KubeFlow do a lot of the pip install and other sort of setup and config stuff in the notebook itself which feels icky.
But, in reality, if you were working in Jupyter notebooks on KubeFlow for real you’d want to build a lot of this into the image used to build the notebook server. Luckily, as with most of KubeFlow, its pretty flexible to customize and extend as you want, in this case by adding custom jupyter images.
Two main example use cases you’d want to do this are for ensuring some custom python package (e.g. my_utils) you have built is readily available in all your notebooks, and other external libraries that you use all the time are also available – e.g. kubeflow pipelines.
To that end, here is a Dockerfile that illustrates this (and here is corresponding image on docker hub).
Once you have such a custom image building fine it’s pretty easy to just point KubeFlow at it when creating a Jupyter notebook server.
Now when you create a new workbook on that jupyter server you have all your custom goodness ready to go.
Github for notebooks
As i was looking around it seems like there is currently plans to implement some git functionality into the notebooks on KubeFlow in a bit more of a native way (see this issue).
For now i decided to just create a ssh key (help docs) for the persistent workspace volume connected to the notebook server (see step 10 here).
Then once you want to git push from your notebook server you can just hack together a little notebook like this that you can use as a poor man’s git ui 🙂
This post will walk through a synthetic example illustrating one way to use a multi-variate, multi-step LSTM for anomaly detection.
Imagine you have a matrix of k time series data coming at you at regular intervals and you look at the last n observations for each metric.
One approach to doing anomaly detection in such a setting is to build a model to predict each metric over each time step in your forecast horizon and when you notice your prediction errors start to change significantly this can be a sign of some anomalies in your incoming data.
This is essentially an unsupervised problem that can be converted into a supervised one. You train the model to predict its own training data. Then once it gets good at this (assuming your training data is relatively typical of normal behavior of your data), if you see some new data for which your prediction error is much higher then expected, that can be a sign that you new data is anomalous in some way.
Note: This example is adapted and built off of this tutorial which i found a very useful starting point. All the code for this post is in this notebook. The rest of this post will essentially walk though the code.
Imports & Paramaters
Below shows the imports and all the parameters for this example, you should be able to play with them and see what different results you get.
Note: There is a Pipfile here that shows the Python libraries needed. If you are not familiar, you should really check out pipenv, its really useful once you play with it a bit.
We will generate some random data, and then smooth it out to look realistic. This will be our ‘normal’ data that we will use to train the model.
Then we will make a copy of this normal data and inject in some random noise at a certain point and for a period of time. This will be our ‘broken’ data.
So this ‘broken’ data is the data that we should see the model struggle with in terms of prediction error. It’s this error (aggregated and summarized in some way, e.g. turned into a z-score) that you could then use to drive an anomaly score (you could also use loss from the continually re-training on new data whereby the training loss should initially spike once the broken data comes into the system but over time the training would then adapt the model to the new data).
This gives us our normal-ish real word looking data that we will use to train the model.
To make our ‘broken’ data (called data_new in the code) i lazily just copy the ‘normal’ data but mess up a segment of it with some random noise.
And so below we can see our ‘broken’ data. I’ve set the broken segment to be quite wide here and its very obvious the broken data is totally different. The hope is that in reality the model once trained would be good at picking up much more nuanced changes in the data that are less obvious to the human eye.
For example if all metrics were to suddenly become more or less correlated than normal but all still each move by a typical amount individually then this is the sort of change you’d like the model to highlight (this is probably something i should have tried to do when making the ‘broken’ data to make the whole example more realistic, feel free to try this yourself and let me know how you get on).
Some Helper Functions
I’ve built some helper functions to make life easier in the example notebook. I’ll share the code below and talk a little about each.
data_reshape_for_model() : This function basically takes in an typical dataframe type array, loops through that data and reshapes it all into a numpy array of the shape expected by the keras LSTM model for both training and prediction. Figuring out how to reshape the data based on the N_TIMESTEPS, N_FEATURES and length of the data was actually probably the trickiest part of this whole example. I’ve noticed that many tutorials online just reshape the data but do so in an incomplete way by essentially just pairing off rows. But what you really want to do is step through all the rows to make sure you roll your N_TIMESTEPS window properly over the data to as to all possible windows in your training.
train() : This is just a simple wrapper for the keras train function. There is no real need for it.
predict() : Similar to train() is just a wrapper function that does not really do much.
model_data_to_df_long() : This function takes in a data array as used by the keras model and unrolls it into one big long pandas dataframe (numpy arrays freak me out a bit sometimes so i always try fall back pandas when i can get away with it 😉).
model_df_long_to_wide() : This function then takes the long format dataframe created by model_data_to_df_long() and converts it into a wide format that is closed to the original dataset of one row one observation and one column for each input feature (plus lots more columns for predictions for each feature for each timestep).
df_out_add_errors() : This function adds errors and error aggregation columns to the main df_out dataframe which stores all the predictions and errors for each original row of data.
yhat_to_df_out() : This function take’s in the model formatted training data and model formatted prediction outputs and wraps all the above functions to make a nice little “df_out” dataframe that has everything we want in it and is one row one observation so lines up more naturally with the original data.
Build & Train The Model
Below code builds the model, trains it and also calls predict on all the training data be able to get errors on the original ‘normal’ training data.
We then call our “do everything” yhat_to_df_out() function on the training data and the predictions from the model.
Now we can plot lots of things from df_out. For example here are the errors averaged across all five features are each timestep prediction horizon.
In the above plot we can see the averaged error of the model on its training data. Each line represents a different forecast horizon. We can see that the lines are sort of ‘stacked’ on top of each other which makes sense as you’d generally expect the error 5 timesteps out (red line “t4_error_avg”) to be higher then the one step ahead forecast (greeny/orangy line “t0_error_avg”).
If we look at the standard deviation of our errors in a similar way, we can see how the standard deviation of our errors generally tends to increase at times when our 5 original features are diverging from each other as you can imagine these are the hardest parts of our time series for this model to predict.
Lets Break It
So now that we have our model trained on our ‘normal’ data we can use it to see how well it does at predicting our new ‘broken’ data.
From the above we can see that as soon as the random broken data comes into the time series the model prediction errors explode.
As mentioned, this is a very obvious and synthetic use case just for learning on but the main idea is that if your data changed in a more complicated and harder to spot way then your error rates would everywhere reflect this change. These error rates could then be used as input into a more global anomaly score for your system.
That’s it, thanks for reading and feel free to add any comments or questions below. I may add some more complicated or real world examples building on this approach at a later stage.
UPDATE: Here is a Google Colab notebook that’s a bit better as i’ve worked a bit more on this since the original blog post.
I’m pretty sure i’ll be looking this up again at some stage so that passed one of my main thresholds for a blog post.
I’ve recently been porting some data and model development pipelines over to AWS Lambda and was mildly horrified to see how clunky the whole process for adding custom python packages to your Lambda was (see docs here).
This was probably the best post i found but it still did not quite cover custom python packages you might need to include beyond just the more typical pypi ones like numpy, pandas, etc. (p.s. this video was really useful if you are working in Cloud9).
So i set out to hack together a process that would automate 90% of the work in packaging up any python packages you might want to make available to your AWS Lambda including local custom python packages you might have built yourself.
The result involves a Docker container to build your packages in (i have to use this as using windows based python package local install does not work in Lambda as the install contains some windows stuff Lambda won’t like), and a jupyter notebook (of course there is some jupyter 🙂 ) to take some inputs (what packages you want, what to call the AWS Layer, etc), build local installs of the packages, add them to a zip file, load zip file to S3 and then finally use awscli to make a new layer from said S3 zip file.
The first place to start is with the below Dockerfile that creates a basic conda ready docker container with jupyter installed. Note it also includes conda-build and copies over the packages/ folder into the container (required as i wanted to install my “my_utils” package and have it available to the jupyter notebook).
$ docker run -it --name my-aws-python-packages
-e AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id)
-e AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key)
The above runs the container, port forwards 8888 (for jupyter), mounts both the /packages and /work folders (as for these files we want changes from outside docker or inside to be reflected and vice versa), and passes in my AWS credentials as environment variables to the container (needed for the asw cli commands we will run inside the container). Its last step is to then launch jupyter lab which you then should be able to get to at http://localhost:8888/lab using the token provided by jupyter.
Notebook time – make_layer.ipynb
Once the docker container is running and you are in jupyter the make_layer notebook automates the local installation of a list of python packages, zipping them to /work/python.zip folder as expected by AWS Layers (when unzipped your root folder needs to be /python/…), loading it to an S3 location, and then using awscli to add a new layer or version (if the layer already exists).
The notebook itself is not that big so i’ve included it below.
For this example i’ve included two custom packages along with pandas into my AWS Layer. The custom packages are just two little basic hello_world() type packages (one actually creates the subprocess_execute() function used in the make_layer notebook). I’ve included pandas then as well to illustrate how to include a pypi package.
To round off the example we then also need to create a little AWS Lambda function to validate that the packages installed in our layer can actually be used by Lambda.
And we can then go into the AWS console to the Lamdba function we just created. We can test it in the UI and see the expected output whereby our custom functions work as expected as does Pandas:
That’s it for this one, i’m hoping someone might find this useful as i was really surprised by how painful it was to get a simple custom package or even pypi packages for that matter available to your AWS Lambda functions.
If you wanted you could convert the ipynb notebook into a python script and automate the whole thing. Although i’m pretty sure Amazon will continue to make the whole experience a bit more seamless and easier over time.
Sometimes you end up with a very wide pandas dataframe and you are interested in doing the same types of operations (data processing, building a model etc.) but focused on subsets of the columns.
For example if we had a wide df with different time series kpi’s represented as columns then we might want to do something like look at each kpi at a time, apply some pre-processing and build something like an ARIMA time series model perhaps.
This is the situation i found myself in recently and it took me best part of an afternoon to figure out. Usually when i find myself in that situation i try and squeeze out a blog post in case might be useful for someone else or future me.
Note: repository with all code is here. p.s. thanks to this and this post that i built off of.
For this example i’m afraid i’m going to use the Iris dataset :0 . This example is as minimal and easy as i could throw together, basically the aim of the code is to:
Build some function to take in a df, do some processing and spit out a new df.
Have that function be parameterized in some way as might be needed (e.g if you wanted to do slightly different work for one subset of columns).
Apply that function in parallel across the different subsets of your df that you want to process.
There are two main functions of interest here parallelize_dataframe() and do_work() both of which live in their own file called my_functions.py which can be imported into your jupyter notebook.
parallelize_dataframe() does the below things:
Break out df into a list of df’s based on the col_subsets list passed in as a parameter.
Wrap the function that was passed in into a partial along with the kwargs (this is how your parameters make it into the do_work() function).
Use map() from multiprocessing to apply the func (along with the args you want to send it) to each subset of columns from your df in parallel.
Reduce all this back into on final df by joining all the resulting df’s from the map() output into one wide df again (note the assumption here of joining back on the df indexes – they need to be stable and meaningful).
The do_work() function in this example is just a simple function to add some new columns as examples of types of pandas (or any other) goodness you might want to do. In reality in my case it would be more like a apply_model() type function that would take each subset of columns, do some feature extraction, train a model and then also score the data as needed to.
Having the ability to do this for multiple subsets of columns in your wide df can really free up your time to focus on the more important things like dickying around with model parameters and different pre-processing steps 🙂
That’s pretty much it, a productive afternoon (in the play center with kids i might add) and am quite pleased with myself.
Update: One addition i made to this as things got more complicated when i went to implement it was the ability to apply different function params to each subset df. For example if you wanted to pass in different parameters to the function for different columns. In the do_parallel_zip.ipynb and corresponding my_functions_zip.py (i’m calling them “_zip” as they use zip() to “zip” up both the df_list and the corresponding kwargs to go with it to be unpacked later by do_work_zip()).
To be concrete, if we wanted to multiply the “sepal_…” cols by 100 and the “petal_..” cols by 0.5. We could use the “zip” approach like below (notebook here):