A Theory of Anomalies

Bulut
May 21, 2021
8 min read

Updated: Jun 20, 2021

Abstract

In this post I briefly discuss anomalies, anomaly detection, and why anomaly detection is valuable. My arguments are mostly based on information theory, so I briefly introduce information theory . I argue that anomalies contain the most information about distributions from which they come, and that because of this information wealth, there enormous value in accurately detecting and reacting to anomalies from data, in real-time. This is what we do at Snomaly, so it can also be seen as a sales pitch :)

Throughout the post, anomaly and outlier are used synonymously. I've found that conventionally, anomaly is used in more practical, data science related literature, whereas outlier is used in works that take a more theoretical, computer-sciency approach. They represent identical ideas.

Relevant concepts

What is an anomaly/outlier?

A basic, intuitive definition of an anomaly is “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins 1980).

Note that the definition does not clearly define deviation and the generating mechanism. Hence, what is an anomaly is dependent on our assumptions about the statistical processes that generate the observations, and how we define “deviation” in these observations. Different assumptions and definitions of deviation lead to different models of “outlierness”, and hence different flavors of anomaly detection. These flavors are part of the reason why anomaly detection is non-trivial and interesting. But before we go into this, we have to first answer why we should even care about detecting anomalies.

Why are anomalies interesting?

We have, I hope, agreed on what an anomaly is. To answer this question properly (and to treat words equally), we must also agree on what interesting is. I will briefly discuss some concepts and results from what is known as information theory. For those of not-so-theoretical nature, the discussion might seem esoteric. However, it gives us the best quantification of information through mathematically defining “information” - or how interesting an observation is. I hope that information theory's interesting is analogous to your interesting, at least to some extent.

(briefly) What is information theory?

(For an overview of the history of the field, I recommend: https://web.mit.edu/6.933/www/Fall2001/Shannon2.pdf)

Information theory is the scientific study of information. Unlike most scientific fields, the origin of what we consider information theory can be traced back mainly to Claude Shannon's 1948 paper “A Mathematical Theory of Communication”. If you take a CD and scratch it, it will most likely still work perfectly. This is a practical application of the theory. In fact, pretty much all modern machinery and communication is based on results of information theory, and Shannon is considered the father of the Digital Age.

What's remarkable is how widely applicable results of information theory are. For example, biological information is seen as widely analogous to the concept of information in information theory. Similar to a CD scratched by a knife, all animals can live almost perfectly functionally despite receiving “informational damage”, and this can be seen on all levels:

all animals including humans are seen to survive following the removal of certain parts of their brains, let alone surviving minor neural damage;
DNA sequences, which we believe to be fundamental source of our genetic information, are often damaged, and magically repaired.

But I digress. For an introduction to the fascinating analogy between the two fields, I recommend: https://plato.stanford.edu/entries/information-biological/.

It should be noted that the founding fathers of information theory considered their work to be about the transmission of information, not about information, and hence did not like the name information theory. But information and the transmission of information are mutually dependent; without information there can be no transmission of it, and non-transmissible information is alien to our sensory-based consciousnesses.

In information theory, “anything is a source of information if it has a number of alternative states that might be realized on a particular occasion” (Stanford). We speak about the information content of an event that occurs from a random variable (the aforementioned source of information which has a number of alternative states - the particular occasion is the event, of whose information content we speak). Let us (I hope) agree that it is fair to say that the more information content an event contains, the more interesting it is.

Mathematically, the information content of an event x, with probability p is defined as:

Note that the base of the log (b) is unspecified, whose choice corresponds to a scaling factor. Different choices of base correspond to different units of information; if b=2, then the unit is the commonly known bit, if b=e, the unit is nat, etc.. If you need a quick refresher on logarithms, I will include a brief discussion at the end of the post.

The choice of the logarithm function for information content is not arbitrary; in fact it is decided by Claude Shannon's three axioms of self-information:

An event with probability 100% is perfectly unsurprising and yields no information.
The less probable an event is, the more surprising it is and the more information it yields.
If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

Only certain functions satisfy these conditions, and these certain functions are all logarithmic functions of the above form. The function mathematically represents the ideas that Shannon describes in his axioms. I have omitted the derivation, but it is easy to follow, and the simplicity of it is beautiful. You can find it on https://en.wikipedia.org/wiki/Information_content.

To illustrate, let's compare the information content of the result of a coin flip, and that of a dice roll. Assuming a fair coin, the probability of either outcome (H/T) of a coin flip is:

Hence, the information gain from measuring heads (without loss of generality) is:

So the information gain is 1 shannon (bit).

Similarly, the probability of any of the six outcomes of a fair dice roll is

And the information content of rolling 6 (or any other valid roll) is:

Let's reflect on what these results mean. In what sense does a dice roll contain more information than a coin flip? Let me entertain you with an analogous and more intuitive example.

We know that there are 26 letters in the English alphabet. Every word, meaningful or not, that we create is a collection of these letters. We also know that computers don't speak our language; they speak binary, the language of the alphabet {0,1}. Although much simpler, a binary alphabet comes with the price of expensive computation, which our brains cannot handle. For example, the word "hi" becomes "0110100001101001". In the English alphabet, we have many more symbols to choose from, so our descriptions become way shorter. We can make 26*26=676 two letter words in the English alphabet, whereas there are only 4 in the binary language. One character of the English alphabet is more valuable to us than one character of the binary alphabet.

This is reflected in the information content of characters of the alphabets. If we assume a uniform distribution (which is not true for the English language; some letters -'a', 'e', 't' - are used way more often than others - 'x', 'q' ):

whereas in the case of the binary alphabet,

(the derivation is analogous to coin-flipping). This is the sense in which a dice roll contains more information than a coin flip.

Let's consider another example, this time in a non-uniform distribution (i.e. not all outcomes are equally likely). Knowing that a particular number is not the winning number in the lottery provides very little information, whereas knowing the winning number tells me pretty much all there is I need to know.

All this to say: information theory is one of the greatest intellectual triumphs of the 20th century, as it allows for the universal quantification of the human notion of 'information'. According to the theory's quantification of information, the informational value of an event depends on how surprising the event is. The more unlikely an event is to occur, the more informative it is. I hope it is not too controversial to say that anomalies can also be seen as unlikely events, and this un-likelihood is what makes anomalies informative, or interesting.

So what?

I'm sure I didn't have to go into information theory, or really go into anything to make you believe that the winning number in the lottery is far more informative than the non-winning number.

The issue is that the example is quite extreme. In real life scenarios, things are usually much more subtle. Data doesn't come to us with “anomaly” and “non-anomaly” labels (i.e. winning and non-winning), and even after knowing what observation is abnormal and what is not, it is not immediately clear how this information can be used for future benefit (in this case, I can simply buy a lottery ticket).

So we must somehow:

'accurately' judge whether observations are normal, and
change our behavior accordingly for maximum benefit.

The first step is what we call “anomaly detection”, whether done by a computer program or a human. However, in our context, anomaly detection refers to anomaly detection done by computer programs, as computers are tremendously stronger at computation when the computation is well-defined (by this I mean that there are specific instructions to follow). And as we all know, with great data comes the need for great computation.

This is where the value of streaming anomaly detection lies. Anomaly detection techniques are ubiquitous, but the more complicated and generally successful approaches are intended to work statically on static data structures. Once the data gets very large, using these techniques become too costly, in terms of time and space required to execute the detection. This is why we use Kafka Streams, and why we are focusing on both developing and implementing streaming anomaly detection. Streaming approaches allow us to model the information gained from all of the data without the operations involved becoming too costly. The aim is to gain as much insight about the data as possible without it taking too much time or space to react in real-time.

Explicit use cases of anomaly detection algorithms include fraud detection (detecting credit card abuse through abnormal buying patterns), medicine (unusual symptoms, abnormal test results), and cybersecurity (detecting whether a certain user is a human being or a computer program). Despite these, we believe that anomaly detection is extremely underutilized and undervalued. This, we believe, is because of the:

undervaluation of anomaly detection
lack of easy-to-access anomaly detection.

We expect that these should change soon.

Summary

I introduced information theory, specifically what information theory defines to be the “information content” of an event. I argued that this quantification is useful, accurate, and that it corresponds to our human notion of information. I introduced the concept of anomalies, and argued that anomalies contain the most information about distributions from which they come, as they are observations we deem to be “unlikely to have come from a certain distribution”. I argued for the value of automated anomaly detection, especially streaming anomaly detection, due to the vastness of the data it can handle in real-time.

Coming up

Different assumptions about the model lead to different algorithms, and also to different classifications of the data. So, using an algorithm that is based on assumptions that accurately reflect the distribution of the data is crucial to valuable outlier detection. For example, if the data has a lot of clusters, the algorithm must in some way reflect this. As said before, different assumptions lead to different flavors of outlier detection, and there is no one universally successful approach yet (one with no assumptions). This is why we offer various different anomaly detection techniques at Snomaly.

In my next post, I will discuss some of these techniques, their assumptions, and advantages. I will specifically compare streaming and static approaches to “local outlier factor” algorithms, which are density-based algorithms.

Bulut Buyru