Anomaly Detection: (Dis-)advantages of k-means clustering

Gepostet am: 04. Juli 2017

Julian Keppel und Sascha Schmalz

In the previous post we talked about network anomaly detection in general and introduced a clustering approach using the very popular k-means algorithm. In this blog post we will show you some of the advantages and disadvantages of using k-means. Furthermore we will give a general overview about techniques other than clustering which can be used for anomaly detection.

Simple k-means is one of the most known and used algorithms for clustering. One of the biggest advantages of k-means is that it is really easy to implement and—even more important—most of the time you don’t even have to implement it yourself! For most of the common programming languages used in data science an efficient implementation of k-means already exists.Pick a language of your choice with an appropriate package / module and get started! But this simplicity also comes with some drawbacks. The most important limitations of Simple k-means are:

• The user has to specify $$k$$ (the number of clusters) in the beginning
• k-means can only handle numerical data
• k-means assumes that we deal with spherical clusters and that each cluster has roughly equal numbers of observations

There are several more downsides of the clustering which we will not cover in this post as they are not crucial for the results we obtained.

Choosing an appropriate k

Choosing the number of clusters $$k$$ can be difficult even if we have a static data set and previous domain knowledge about the data. But in network anomaly detection our data is neither static nor do we know much about attacks in the future. So how do we choose the parameter $$k$$?

There are several ways to choose an appropriate $$k$$. For most of them we do not necessarily need domain knowledge. One of the simplest methods is the so called elbow method. Using the elbow method we run k-means clustering for a range of values of k. (e.g. 1 to 150). For each value of $$k$$ we then compute the sum of squared errors (SSE) and add both into a line plot. Illustration 1 shows an exemplary curve of a range of values of $$k$$ and the corresponding SSE. We want to choose $$k$$ so that we have a small SSE, but as we increase $$k$$ the SSE tends to decrease towards 0 (If $$k$$ is equal to the number of observations in our data set each data point is considered a cluster and so the SSE is 0). Hence we choose $$k$$ such that the SSE is fairly small but the rate of change of the SSE is relatively high. This point is usually the elbow of the curve.

In illustration 1 we can see that it might be hard to determine where the elbow of the curve actually is. The elbow method just gives an orientation where the optimal number of $$k$$ might be, but it is a very subjective method and for some data sets it might not work.

There are also more analytical ways to determine the optimal number of clusters like implemented in X-means. X-means is a variation of k-means and tries to optimize the Bayesian Information Criteria (BIC) or the Akaike Information Criteria (AIC). Both are well known criteria for model selection¹.

Despite finding an optimal $$k$$ there is also another problem: We do not have a fixed data set and therefore we don’t know if $$k$$ is a static number. Like the data, it may change over time and we have to check for the optimal $$k$$ periodically.

Numerical data

We already mentioned in the previous post that some pre-processing of the data is necessary. This is due to the fact that k-means can only handle numerical data and the results might be skewed if we do not normalize it. Please have a look at the previous post for further information.

Spherical and equally sized clusters

To keep it short: the k-means algorithm is a special form of the well known Expectation Maximization (EM) algorithm and assumes that all clusters are equally sized and have the same variances. In most of the cases this assumption is not satisfied. Clusters will differ in their size, density and variance. Violating this assumption will not make it impossible to use k-means, but one has to be careful interpreting the results!

Illustration 2 shows an exemplary clustering of the mouse data set using k-means and EM clustering. We can clearly see the downside of k-means clustering. Even if the clusters are spherical k-means is not able to detect the clusters correctly, which happens because crucial assumptions are violated.

Clustering … what else?

The introduced k-means algorithm is a typical clustering (unsupervised learning) algorithm. Besides clustering the following techniques can be used for anomaly detection:

• Supervised learning (classification) is the task of training and applying an ordinary classifier to fully labeled train and test data. One should consider that data sets for anomaly detection can be heavily skewed. There might be a lot of “good” data points and only a few anomalies. This can cause problems with some classifiers. Normally Support Vector Machines (SVM) or Artificial Neural Networks (ANN) are a good choice.²
• Semi-supervised learning is a combination of supervised and unsupervised learning. The data for training is partially labeled, partially unlabeled. The aim of semi-supervised learning is to incorporate the information of the unlabeled observations to enhance the predictive performance.
• Hybrids of supervised / unsupervised learning can be both, a combination of supervised or unsupervised and supervised algorithms. Clustering algorithms are normally used to pre-cluster the data for a subsequent classifier. This usually enhances the quality of the results and speeds up the classification process.

In general one can say that the introduced alternatives to clustering have a better predictive performance. But why do we not use them? Supervised algorithms or hybrids can only be used if we have fully labeled data sets. When it comes to anomaly / fraud detection we normally talk about very large data sets. And most of the time these data sets don’t have labels. Labeling the data—if done by a human—is a very time consuming and expensive task. Hence it is not feasible.

Semi-supervised learning algorithms need partly labeled data. Information from unlabeled data can be incorporated via several methods which we will not explain in detail. But just to name a few³:

• Generative models
• Semi-supervised SVM
• Bootstrapping (wrapper)

In general, incorporating the data will improve the performance of the classifier.

So when it comes to large unlabeled data sets, clustering is almost the only possibility we have to analyze the underlying structure of the data.