Multi-Variate, Multi-Step, LSTM for Anomaly Detection

This post will walk through a synthetic example illustrating one way to use a multi-variate, multi-step LSTM for anomaly detection.

Imagine you have a matrix of k time series data coming at you at regular intervals and you look at the last n observations for each metric.

A matrix of 5 metrics from period t to t-n

One approach to doing anomaly detection in such a setting is to build a model to predict each metric over each time step in your forecast horizon and when you notice your prediction errors start to change significantly this can be a sign of some anomalies in your incoming data.

This is essentially an unsupervised problem that can be converted into a supervised one. You train the model to predict its own training data. Then once it gets good at this (assuming your training data is relatively typical of normal behavior of your data), if you see some new data for which your prediction error is much higher then expected, that can be a sign that you new data is anomalous in some way.

Note: This example is adapted and built off of this tutorial which i found a very useful starting point. All the code for this post is in this notebook. The rest of this post will essentially walk though the code.

Imports & Paramaters

Below shows the imports and all the parameters for this example, you should be able to play with them and see what different results you get.

Note: There is a Pipfile here that shows the Python libraries needed. If you are not familiar, you should really check out pipenv, its really useful once you play with it a bit.

import numpy as np
import pandas as pd
from numpy import concatenate
from matplotlib import pyplot
from keras.models import Sequential
from keras.callbacks import Callback
from keras.layers import LSTM, Dense, Activation
import matplotlib.pyplot as plt
%matplotlib inline
N_DATA_ORIG = 3000 # length of data to generate
N_FEATURES = 5 # number of syntethic features to create
N_ROLLING = 1000 # length of window over which to smooth the random data to make it look realistic
N_TIMESTEPS = 5 # number of timesteps you want to both train on and predict out to
N_DATA = N_DATA_ORIG N_ROLLING # length of final data after smoothing
N_TRAIN_ITERATIONS = 5 # number of times to iterate training of the model
N_EPOCHS = 5 # within each train call how many epochs to run for
BATCH_SIZE = 100 # batch size to train on
N_LAYERS = 3 # number of layers for the LSTM
N_LSTM_UNITS = 2 # number of hidden unit in each LSTM layer
BREAK_LEN = 1000 # length of the break in the data we will create
random_break_point = np.random.choice(N_DATA) # pick a random point in the data to swap in the broken data into

Fake Data!

We will generate some random data, and then smooth it out to look realistic. This will be our ‘normal’ data that we will use to train the model.

I couldn’t help myself.

Then we will make a copy of this normal data and inject in some random noise at a certain point and for a period of time. This will be our ‘broken’ data.

So this ‘broken’ data is the data that we should see the model struggle with in terms of prediction error. It’s this error (aggregated and summarized in some way, e.g. turned into a z-score) that you could then use to drive an anomaly score (you could also use loss from the continually re-training on new data whereby the training loss should initially spike once the broken data comes into the system but over time the training would then adapt the model to the new data).

# make some noisy but smooth looking data
data = np.sqrt(np.random.rand(N_DATA_ORIG,N_FEATURES))
df_data = pd.DataFrame(data)
df_data = df_data.rolling(window=N_ROLLING).mean()
df_data = df_data.dropna()
df_data = df_data.head(N_DATA)
print(df_data.shape)
data = df_data.values
# plot the normal healthy data
fig, ax = plt.subplots(num=None, figsize=(14, 6), dpi=80, facecolor='w', edgecolor='k')
size = len(data)
for x in range(data.shape[1]):
ax.plot(range(0,size), data[:,x], '-', linewidth=1)

This gives us our normal-ish real word looking data that we will use to train the model.

5 random time series that have been smoothed a bit to look realistic.

To make our ‘broken’ data (called data_new in the code) i lazily just copy the ‘normal’ data but mess up a segment of it with some random noise.

# make some random data
data_rand = np.random.rand(N_DATA,N_FEATURES)
data_new = np.copy(data)
# at a random point for a certain number of steps, swap out the smooth data with some random data
data_new[random_break_point🙁random_break_point+BREAK_LEN)] = data_rand[random_break_point🙁random_break_point+BREAK_LEN)]
# plot the new data
fig, ax = plt.subplots(num=None, figsize=(14, 6), dpi=80, facecolor='w', edgecolor='k')
size = len(data_new)
for x in range(data_new.shape[1]):
ax.plot(range(0,size), data_new[:,x], '-', linewidth=0.5)

And so below we can see our ‘broken’ data. I’ve set the broken segment to be quite wide here and its very obvious the broken data is totally different. The hope is that in reality the model once trained would be good at picking up much more nuanced changes in the data that are less obvious to the human eye.

For example if all metrics were to suddenly become more or less correlated than normal but all still each move by a typical amount individually then this is the sort of change you’d like the model to highlight (this is probably something i should have tried to do when making the ‘broken’ data to make the whole example more realistic, feel free to try this yourself and let me know how you get on).

Same as the “normal” data but i’ve messed up a huge chunk of it.

Some Helper Functions

I’ve built some helper functions to make life easier in the example notebook. I’ll share the code below and talk a little about each.

  • data_reshape_for_model() : This function basically takes in an typical dataframe type array, loops through that data and reshapes it all into a numpy array of the shape expected by the keras LSTM model for both training and prediction. Figuring out how to reshape the data based on the N_TIMESTEPS, N_FEATURES and length of the data was actually probably the trickiest part of this whole example. I’ve noticed that many tutorials online just reshape the data but do so in an incomplete way by essentially just pairing off rows. But what you really want to do is step through all the rows to make sure you roll your N_TIMESTEPS window properly over the data to as to all possible windows in your training.
  • train() : This is just a simple wrapper for the keras train function. There is no real need for it.
  • predict() : Similar to train() is just a wrapper function that does not really do much.
  • model_data_to_df_long() : This function takes in a data array as used by the keras model and unrolls it into one big long pandas dataframe (numpy arrays freak me out a bit sometimes so i always try fall back pandas when i can get away with it 😉).
  • model_df_long_to_wide() : This function then takes the long format dataframe created by model_data_to_df_long() and converts it into a wide format that is closed to the original dataset of one row one observation and one column for each input feature (plus lots more columns for predictions for each feature for each timestep).
  • df_out_add_errors() : This function adds errors and error aggregation columns to the main df_out dataframe which stores all the predictions and errors for each original row of data.
  • yhat_to_df_out() : This function take’s in the model formatted training data and model formatted prediction outputs and wraps all the above functions to make a nice little “df_out” dataframe that has everything we want in it and is one row one observation so lines up more naturally with the original data.
def data_reshape_for_model(data_in,n_timesteps,n_features,print_info=True):
''' Function to reshape the data into model ready format, either for training or prediction.
'''
# get original data shape
data_in_shape = data_in.shape
# create a dummy row with desired shape and one empty observation
data_out = np.zeros((1,n_timesteps,n_features))
# loop though each row of data and reshape accordingly
for row in range(len(data_in)):
# for each row look ahead as many timesteps as needed and then transpose the data to give shape keras wants
tmp_array = np.array([data_in[row🙁row+n_timesteps),].transpose()])
# if your reshaped data is as expected then concate the new observation into data_out
if tmp_array.shape == (1,n_timesteps,n_features):
data_out = np.concatenate((data_out,tmp_array))
# drop first dummy row of data_out
data_out = data_out[1:]
# get output data shape
data_out_shape = data_out.shape
if print_info: print(f'{data_in_shape} -> {data_out_shape}')
return data_out
def train(model,data,n_epochs=10,batch_size=50,print_info=False,callbacks=None,shuffle=False,verbose=1):
''' Function to take in model and data and train the model using defined params.
'''
# fit the model to the data
model.fit(data, data, epochs=n_epochs, batch_size=batch_size,
validation_data=(data, data), verbose=verbose, shuffle=shuffle,
callbacks=callbacks)
return model
def predict(model,data,print_info=True):
''' Function to take in model and data and return predictions in model data format.
'''
# get prediction from model
yhat = model.predict(data)
if print_info: print(yhat.shape)
return yhat
def model_data_to_df_long(data,n_timesteps,n_features):
''' Function to take model data numpy array and translate it into a long format dataframe.
'''
# define empty list to collect data into
data_tmp = []
# for each row in the data
for r in range(len(data)):
row = data[r]
# for each feature in each row
for f in range(n_features):
# for each timestep of each feature in each row
for t in range(n_timesteps):
# add an element to the list decoding what it represents
tmp = [r,f'f{f}',f't{t}',row[f,t]]
# append that element to the data
data_tmp.append(tmp)
# now use the collected data to create a pandas df
df_long = pd.DataFrame(data_tmp,columns=['row','feature','timestep','value'])
# add a label col that can be used to go from long format to wide
df_long['label'] = df_long['feature'] + '_' + df_long['timestep']
return df_long
def model_df_long_to_wide(df_long,key_col='label'):
''' Function that can translate a long formant model data df into a wide version of it.
'''
# use pivot to go from long to wide
df_wide = df_long[['row','label','value']].pivot(index='row',columns=key_col,values='value')
return df_wide
def df_out_add_errors(df_out,n_timesteps,n_features):
''' Function to take in a df_out type df and add in error columns
'''
# loop through to get errors
f_cols = [f'f{f}' for f in range(n_features)]
t_cols = [f't{t}' for t in range(n_timesteps)]
for f_col in f_cols:
for t_col in t_cols:
lag = int(t_col.replace('t','')) + 1
df_out[f'{f_col}_{t_col}_error'] = df_out[f_col].shift(lag*1) df_out[f'{f_col}_{t_col}_yhat']
# get summary error metrics by timestep across all features
for t_col in t_cols:
df_out[f'{t_col}_error_avg'] = df_out[[col for col in df_out.columns if f'{t_col}_error' in col]].mean(axis=1)
df_out[f'{t_col}_error_med'] = df_out[[col for col in df_out.columns if f'{t_col}_error' in col]].median(axis=1)
df_out[f'{t_col}_error_min'] = df_out[[col for col in df_out.columns if f'{t_col}_error' in col]].min(axis=1)
df_out[f'{t_col}_error_max'] = df_out[[col for col in df_out.columns if f'{t_col}_error' in col]].max(axis=1)
df_out[f'{t_col}_error_rng'] = df_out[[col for col in df_out.columns if f'{t_col}_error' in col]].max(axis=1) df_out[[col for col in df_out.columns if f'{t_col}_error' in col]].min(axis=1)
return df_out
def yhat_to_df_out(data_train,yhat,n_timesteps,n_features):
''' Function to take data train and yhat prediction output array from model and turn it into final form of df_out.
'''
# helper df's
df_train_long = model_data_to_df_long(data_train,n_timesteps,n_features)
df_yhat_long = model_data_to_df_long(yhat,n_timesteps,n_features)
df_train_wide = model_df_long_to_wide(df_train_long)
df_yhat_wide = model_df_long_to_wide(df_yhat_long)
df_yhat_wide.columns = [f'{col}_yhat' for col in df_yhat_wide.columns]
# begin process to collect final data frame
# make df_out
train_cols_latest = [col for col in df_train_wide.columns if f't{n_timesteps1}' in col]
df_out = df_train_wide[train_cols_latest]
# clean up col names
df_out.columns = [col.split('_')[0] for col in df_out.columns]
# now concat train cols and cols from df_yhat_wide
df_out = pd.concat([df_out,df_yhat_wide],axis=1)
# add in error cols
df_out = df_out_add_errors(df_out,n_timesteps,n_features)
return df_out

Build & Train The Model

Below code builds the model, trains it and also calls predict on all the training data be able to get errors on the original ‘normal’ training data.

# build network
model = Sequential()
# add number of layer specified
for layer in range(N_LAYERS):
model.add(LSTM(N_LSTM_UNITS,input_shape=(N_TIMESTEPS,N_FEATURES),return_sequences=True))
model.add(Dense(N_FEATURES))
model.compile(loss='mae', optimizer='adam')
# print model summary
print(model.summary())
# reshape data for training
print(f'… reshaping data for training …')
data_train = data_reshape_for_model(data,N_TIMESTEPS,N_FEATURES)
# begin training iterations
for i in range(N_TRAIN_ITERATIONS):
print(f'… training iteration {i} …')
model = train(model,data_train,callbacks=[LossHistory()])
# get predictions on healthy data using final trained model
yhat = predict(model,data_train)

We then call our “do everything” yhat_to_df_out() function on the training data and the predictions from the model.

df_out = yhat_to_df_out(data_train,yhat,N_TIMESTEPS,N_FEATURES)

Now we can plot lots of things from df_out. For example here are the errors averaged across all five features are each timestep prediction horizon.

plot_cols = [col for col in df_out.columns if 'error_avg' in col]
print(plot_cols)
# plot the new data
fig, ax = plt.subplots(num=None, figsize=(14, 6), dpi=80, facecolor='w', edgecolor='k')
size = len(df_out)
for col in plot_cols:
ax.plot(range(0,size), df_out[col], '-', linewidth=0.5)

In the above plot we can see the averaged error of the model on its training data. Each line represents a different forecast horizon. We can see that the lines are sort of ‘stacked’ on top of each other which makes sense as you’d generally expect the error 5 timesteps out (red line “t4_error_avg”) to be higher then the one step ahead forecast (greeny/orangy line “t0_error_avg”).

If we look at the standard deviation of our errors in a similar way, we can see how the standard deviation of our errors generally tends to increase at times when our 5 original features are diverging from each other as you can imagine these are the hardest parts of our time series for this model to predict.

Lets Break It

So now that we have our model trained on our ‘normal’ data we can use it to see how well it does at predicting our new ‘broken’ data.

# now train on new data
print(f'… reshaping data for new data training …')
data_train_new = data_reshape_for_model(data_new,N_TIMESTEPS,N_FEATURES)
print("… begin training on new data …")
model = train(model,data_train_new,n_epochs=1)
yhat_new = predict(model,data_train_new)
df_out_new = yhat_to_df_out(data_train_new,yhat_new,N_TIMESTEPS,N_FEATURES)
plot_cols = [col for col in df_out_new.columns if 'error_avg' in col]
print(plot_cols)
# plot the new data
fig, ax = plt.subplots(num=None, figsize=(14, 6), dpi=80, facecolor='w', edgecolor='k')
size = len(df_out_new)
for col in plot_cols:
ax.plot(range(0,size), df_out_new[col], '-', linewidth=0.5)
Here we can see that as soon as we hit the broken data the prediction errors go through the roof.

From the above we can see that as soon as the random broken data comes into the time series the model prediction errors explode.

As mentioned, this is a very obvious and synthetic use case just for learning on but the main idea is that if your data changed in a more complicated and harder to spot way then your error rates would everywhere reflect this change. These error rates could then be used as input into a more global anomaly score for your system.

That’s it, thanks for reading and feel free to add any comments or questions below. I may add some more complicated or real world examples building on this approach at a later stage.

UPDATEHere is a Google Colab notebook that’s a bit better as i’ve worked a bit more on this since the original blog post.