Bulut
- Jun 20, 2021
- 12 min read

Details of anomaly detection

1 Introduction

In my previous post, I argued for the theoretical value of anomalies. I introduced information theory, which quantifies our intuitive notion of information. In this quantification, the more unlikely an event is, the more information it contains about the distribution from which it comes. Anomalies are unlikely events; if they were expected or likely, they would be the norm, and hence not anomalies. I concluded that from an information theory perspective, anomalies are more informative than normal events.

That's great; in theory. In this post I will briefly discuss details of automated anomaly detection. Using the right models to find anomalies is the first step of tapping the informational potential of anomalies, but as with most things in data science, this is easier said than done.

Achieving accurate modeling becomes particularly difficult when time and space constraints are introduced into the system, and they often are introduced. Most high velocity businesses have thousands or millions of metrics which affect the key performance indicators (KPIs) of the business, and each metric can be receiving thousands or millions of data per second, depending on how big and fast the business is. So, not only do our models for these metrics have to work accurately, but they also have to be time and space efficient.

Furthermore, usually, a lot of these metrics are correlated in some way, so for maximum value, our models need to capture these correlations so that the insight gained reflects the big picture of what is going on in the system. Notice that if we have n metrics, we have on the order of 2n correlations, so we have to be smart about how we go about finding these correlations (20 metrics gives us 2^20 ≈ 1 million correlations).

In this post I discuss meanings of outliers, temporality (whether an approach considers order and time), and dimensionality (whether the data and the method used are univariate or multivariate). Temporality is discussed mainly to introduce that we are dealing with time-series, but it is easy to take for granted. The dimensionality of approaches categorizes the approaches very broadly, and utilizing both univariate and multivariate approaches is crucial to achieving meaningful results using anomaly detection. I will not go into detail on specific methods, but I plan to go into more detail in future posts.

2 Details of accurate anomaly detection

2.1 Meanings of outliers

Conventionally, outliers mean two things:

unwanted data,
events of interest.

The aim of identifying unwanted data is to clean the data so that the observations better represent the distribution we care about, without the “noise” of outliers. The reason for removing this unwanted data can be because:

the observation has measurement error (and hence we don't wish to include it in our calculations)
we don't believe there is a measurement error, but we still have some reason to only consider “normal” events in our calculations.

Such an approach is useful for when we wish to use the data to create a model with the principal aim of making predictions.

With events of interest, we are after the informational value of the observation. Something significant and unusual has happened, and we wish to explore this happening itself. Why did it happen? Do we expect it to happen again? Is it an anomalous event that we wish to rehappen or not?

Our approach is based on finding events of interest. Detection techniques have subtleties; some are specifically designed to find unwanted data, some for events of interest, and some can work well with both. Realizing this difference and accurately reflecting it in the models used is important for precision.

2.2 Temporality

To provide real-time anomaly detection, we have to work with time series data as inputs. To achieve this in a scalable manner, we use Kafka through a micro-service architecture.

Time series data introduces some intricacies to detecting anomalies. We are not working with one fixed batch of data. We are working with an ordered, continuous stream of data. Notice that time series data have potential to carry more information than a single batch, as we know the temporal order of the observations. That event a happened after event b gives us strictly more information than just that events a, b happened.

Some methods inherently include temporal information, while some ignore it. The main difference between the two approaches is that the latter methods produce the same results when applied to any permutation of the series. This doesn't seem ideal; in almost any real life scenario, the temporal order of observations is crucial to making accurate classification.

Hence, we either use methods that inherently use temporal information, or we introduce windowing to methods that normally ignore it. Windowing of any kind makes an algorithm pseudo-temporal. It is temporal, because now the order of the sequence matters.

Note that the size of the windows, and the windowing protocol (we could also use sliding windows) used will affect the outcome of the algorithm. The choice of protocol and parameters for the protocol usually depends on the data. These choices can be made manually, but also through augmented machine learning (part manually, but using knowledge extracted from machine learning). To optimally choose protocols and parameters, we need to have some insight into how the data behaves overall with respect to time; we need to detect seasonality. Seasonality and seasonality detection will be discussed in a later post.

I call it “pseudo”-temporal, because just adding windowing to a method that is not temporal gives the method time-temporality, but not order-temporality. Notice that what's important in a windowed method is that the observations arrived in different windows, NOT that they arrived in different orders. The method still doesn't take into account the order of the observations that arrive in the same window. Therefore, just using windowing as a way of adding temporality to our methods isn't ideal in the sense that it heavily depends on the window size, and fails to capture intra-window ordering.

Since we are dealing with time series, we have a lot of temporal information in our data streams. In real life, the order and time of things matter. We ideally want to make the best use of this information.

2.3 Types of outlier detection techniques

2.3.1 Input data

The simplest type of data to deal with is a univariate time series.

Intuitively put, a univariate time series is just one time series, whereas a multivariate time series is a collection of multiple time series. In the case of a multivariate time series, each variable could depend not only on its past behavior, but also on the other variables' behaviors. We will have a lot more to say about this later on.

2.3.2 Outlier type

There are two conventional categorizations of outliers. I will introduce both.

A commonly used categorization is global, contextual, and collective outliers. Global outliers deviate significantly from the rest of the data set; contextual outliers deviate significantly with respect to a specific context of the object.

A collective outlier does not necessarily have to be an outlier individually, but it is part of a set of objects that as a whole deviate significantly from the entire data set.

Say it is 25C outside. Is this anomalous? Depends on the time and place. If it is a summer day in Turkey, then it isn't. If it is a winter day in northern Russia, then it is. On the other hand, if it was somehow 100C outside, this would be an anomaly regardless of the context - it would be a global outlier. This is the intuitive difference between global and contextual outliers.

Contextual outliers are just global outliers, but with condition(s) introduced. In the previous example, the condition was time and location.

Collective outliers are a bit different. As the name suggests, an outlier cannot be a “collective outlier” on its own; we speak of collective outliers, and a collective outlier is an object that is a part of a collective outlier group.

Consider this: buying toilet paper is not anomalous (I hope). However, if everyone decides to buy toilet paper on the same day, there is certainly something out of the ordinary happening! Another generic example is a variable which normally has some deviation staying at exactly the same value for an extended period of time (e.g. stock price staying static for days). Although perhaps the price itself is not anomalous, the collective behavior is surely strange.

Notice that in both examples, we are using the concept of time by conditioning on it. Most sources give similar examples to collective outliers.

This categorization seems to be almost universally accepted in the more practical, business oriented, non-academic data science world. Why it is so escapes me, as Blázquez-Garcia et al. offer a more concise categorization, which also better reflects the categorization of detection approaches.

It is as follows:

Point outliers. A point outlier is a datum that behaves unusually in a specific time instant when compared either to the other values in the time series (global outlier) or to its neighboring points (local outlier). Point outliers can be univariate or multivariate depending on whether they affect one or more time-dependent variables, respectively.
Subsequence outliers. Subsequence outliers refer to consecutive points in time whose joint behavior is unusual, although each individual observation is not necessarily a point outlier. Subsequence outliers can also be global or local, and can affect one or more time-dependent variables.
Outlier time series. Entire time series can be outliers. Naturally, this makes sense only when we are playing with multivariate time series. If the behavior of a variable differs significantly from the behaviors of the other variables, it can be considered an outlier time series.

Let us first compare this categorization to the global/contextual/collective categorization. The reason I discussed the earlier categorization at all is that it is widely used and cited. Upon some reflection, one realizes that contextual outliers are just point outliers with an extra condition(s).

To illustrate further, consider the height distribution of a population. A 200cm tall person would be a global outlier, as it is an unusual height for both men and women. If we had no way of distinguishing between the data from men and women, a 180cm person would probably not be considered an outlier - although it is an unusual height for a woman, and would be a contextual outlier if we had labels for gender in our data. My point being, contextual outliers are just point outliers with an extra dimension added. There is no inherent difference in the meaning of a global and a contextual outlier.

On the other hand, the second categorization distinguishes between types of outliers, because point/subsequence/time series outliers are inherently different, and hence, require different methods entirely to be detected. Global and contextual outliers can both be classified under point outliers.

The tricky bit is collective outliers - are they equivalent to subsequence outliers? Subsequence outliers refer to consecutive points in time specifically, whereas the definition of collective outliers do not specify that the collection has to be defined by a jointness in temporality - only that they need to be irregular when considered collectively. As I stated before, I have not come across a use of the concept of collective outliers except for observations collected together by time, so to the best of my understanding, collective outliers can also be seen as equivalent to subsequence outliers.

Outlier time series are not considered at all by the first categorization. They are important, as outlier time series detection gives a more macroscopic insight into the delicate interplay of possibly thousands of metrics. If some variable is behaving completely differently from the rest of the data collected, this variable probably needs some exploring, as would any anomalous observation. Hence, outlier time series detection has to be a part of any machine learning anomaly detection system that aims to be complete.

So the second categorization offers a more parsimonious theory of the meanings of outliers, and leads to a more concise and clear understanding of what we're dealing with. Such an understanding is foundational to successful anomaly detection. We have to understand what it is we are looking for - simply slapping data with a machine learning algorithm isn't going to achieve optimal results (although it is better than no anomaly detection!).

Lastly, observe that types of outliers are closely related to the input data type and the time context.

Point and subsequence outliers can be found in both univariate and multivariate contexts, meaning that multivariate point/subsequence outliers cannot be found in univariate data.

And if the detection method uses the entire time series as contextual information (the model), then the outliers detected are time-global outliers. If the method uses a certain segment of the series, then the outliers detected are time-local. All time-global outliers are also time-local, but not all time-local outliers are time-global. Some observations seem normal when we consider the whole data set, whereas within their time neighborhoods, they might seem anomalous.

I am using the prefix time- to disambiguate, as global and local are overloaded in most fields of data science, and anomaly detection is no exception (see earlier categorization and local outlier factor algorithms for non-time uses of global and local, respectively).

2.3.3 Nature of method used

Earlier, I discussed the inherent temporality of a method. Some methods take time an order into account by design, and to others we can introduce pseudo-temporality with the use of windows or other alterations which will not be discussed in this post.

Similarly, there are univariate detection methods which consider only a single time-dependent variable, and multivariate detection methods that by default can work with more than one variable. First, note that we can work with multivariate time series using univariate detection methods, simply by performing individual analysis on each of the variables and ignoring possible dependencies between variables.

Although appealing due to its simplicity, if we are after a holistic picture of what is happening in any system, this is not a wise option. To get meaningful results from possibly hundreds or thousands of metrics, the dependencies have to be taken into account in order to prevent loss of information.

The issue is that we are simply better at intuition regarding univariate data. It's very difficult to intuitively grasp higher dimensional structures, so our univariate anomaly detection techniques are better developed than multivariate techniques.

Therefore, there are many “multivariate” techniques actually apply a preprocessing method to the multivariate data to find a new set of variables, to which univariate techniques can be applied. These preprocessing methods are based on dimensionality reduction. Dimensionality reduction is the reduction of the codependent multivariate data into fewer, ideally as independent as can be variables, to which univariate techniques are applied. Note that if either the dimensionality reduction technique or the univariate technique applied considers temporality, then the overall multivariate detection technique also considers temporality.

Of course, there are techniques which are multi-dimensional by nature (requiring no dimensionality reduction). Unfortunately, all approaches have their ups and downs. There is no one approach that works accurately and quickly on all kinds of data that can identify all kinds of outliers.

Univariate techniques are usually the easiest to use and scale. However, as mentioned before, they come with the cost of information loss. To add to this, if we are working with a system that has a lot of codependent variables to which we are applying univariate anomaly detection, we will be faced with way too many anomalies to make sense of. Our picture of what is happening will be too zoomed in, and as a result, we might not understand what is actually going on.

Dimensionality reduction followed by the application of univariate techniques is similarly easy to use and scale, and can work with multivariate data without completely disregarding the codependencies. The issue with this approach is that it is highly dependent on the reduction technique's success. If we are dealing with a series with groups of variables that are highly dependent on each other, but not dependent on variables from different groups (i.e. we can get a good non-fuzzy clustering of the variables), then dimensionality reduction will be highly effective. However, this is not always true. Most of the time, the dependencies of variables are very delicate, and dimensionality reduction will still lead to some informational loss. Whether trading this information for the speed up is situational.

Lastly, with no dimensionality reduction, multivariate anomaly detection do not have the informational loss that comes with univariate techniques (with or without dimensionality reduction). The first major issue with this approach is that it's (expectedly) harder to scale than the other approaches. As we are working with more and more variables, it becomes very difficult to get real-time feedback from multivariate anomaly detection.

Another issue is that it may be hard to interpret the results of multivariate detection. After all, we want to respond in some way to the anomaly detected. While univariate anomaly detection says “there was something anomalous in this particular variable at this particular time”, multivariate detection says “there was something anomalous in the system at this particular time”. While applying univariate detection to multivariate data can be regarded as “too zoomed in”, multivariate detection can be “too zoomed out” in some cases.

3 What to do

The key takeaway from this post is that there are multiple kinds of data, multiple kinds of outliers, and multiple ways of detecting outliers. Note also that under these broad categorizations there are more layers of categorization. For example, multivariate point outlier detection methods are categorized as: model-based, dissimilarity-based, and histogramming methods. Naturally, each category has different algorithms, each with their ups and downs when used with a particular set of data.

All this to say: accurate, real-time anomaly detection is very difficult. Our aim is to mine as much valuable information as possible from the data available, in as little time as possible. Valuable is contextual; in some cases we care particularly about point outliers, in some about subsequence outliers.

The truth is that real data often does not adhere to our clear cut categorizations. For optimal results, we can't ignore the possible benefits of diverse approaches. At Snomaly, we aim to utilize the benefits of multiple approaches in order to capture the most amount of information from the anomalous data.

Univariate detection is the simplest to use and scale, and its results are easy to interpret. When singled out, we can relatively easily learn what's normal for each metric on its own. This is key to understanding the data in reasonable time, so univariate methods have to be somewhat employed.

However, as said, the relationships between variables cannot be ignored. Just univariate detection is not sufficient for this. To explore these relationships, we utilize a combination of dimensionality reduction, pure multi-dimensional detection methods, and other machine learning techniques to find correlations between the different variables in order to create a behavioral topology of the variables. This is probably the most challenging part of anomaly detection. The algorithm running on the stream of data has to be fast enough to provide real-time insight, but should provide this insight with as little loss of information as possible. The combination of methods we employ aims to give us the best of both worlds: fast, and concise.

Bulut Buyru