Anomaly Detection: (Dis-)advantages of k-means clustering

Notice:
This post is older than 5 years – the content might be outdated.

In the previous post we talked about network anomaly detection in general and introduced a clustering approach using the very popular k-means algorithm. In this blog post we will show you some of the advantages and disadvantages of using k-means. Furthermore we will give a general overview about techniques other than clustering which can be used for anomaly detection.

Simple k-means is one of the most known and used algorithms for clustering. One of the biggest advantages of k-means is that it is really easy to implement and—even more important—most of the time you don’t even have to implement it yourself! For most of the common programming languages used in data science an efficient implementation of k-means already exists.Pick a language of your choice with an appropriate package / module and get started! But this simplicity also comes with some drawbacks. The most important limitations of Simple k-means are:

The user has to specify \(k\) (the number of clusters) in the beginning
k-means can only handle numerical data
k-means assumes that we deal with spherical clusters and that each cluster has roughly equal numbers of observations

There are several more downsides of the clustering which we will not cover in this post as they are not crucial for the results we obtained.

Choosing an appropriate k

Choosing the number of clusters \(k\) can be difficult even if we have a static data set and previous domain knowledge about the data. But in network anomaly detection our data is neither static nor do we know much about attacks in the future. So how do we choose the parameter \(k\)?

There are several ways to choose an appropriate \(k\). For most of them we do not necessarily need domain knowledge. One of the simplest methods is the so called elbow method. Using the elbow method we run k-means clustering for a range of values of k. (e.g. 1 to 150). For each value of \(k\) we then compute the sum of squared errors (SSE) and add both into a line plot. Illustration 1 shows an exemplary curve of a range of values of \(k\) and the corresponding SSE. We want to choose \(k\) so that we have a small SSE, but as we increase \(k\) the SSE tends to decrease towards 0 (If \(k\) is equal to the number of observations in our data set each data point is considered a cluster and so the SSE is 0). Hence we choose \(k\) such that the SSE is fairly small but the rate of change of the SSE is relatively high. This point is usually the elbow of the curve.

An elbow-shaped curve

In illustration 1 we can see that it might be hard to determine where the elbow of the curve actually is. The elbow method just gives an orientation where the optimal number of \(k\) might be, but it is a very subjective method and for some data sets it might not work.

There are also more analytical ways to determine the optimal number of clusters like implemented in X-means. X-means is a variation of k-means and tries to optimize the Bayesian Information Criteria (BIC) or the Akaike Information Criteria (AIC). Both are well known criteria for model selection¹.

Despite finding an optimal \(k\) there is also another problem: We do not have a fixed data set and therefore we don’t know if \(k\) is a static number. Like the data, it may change over time and we have to check for the optimal \(k\) periodically.

Numerical data

We already mentioned in the previous post that some pre-processing of the data is necessary. This is due to the fact that k-means can only handle numerical data and the results might be skewed if we do not normalize it. Please have a look at the previous post for further information.

Spherical and equally sized clusters

To keep it short: the k-means algorithm is a special form of the well known Expectation Maximization (EM) algorithm and assumes that all clusters are equally sized and have the same variances. In most of the cases this assumption is not satisfied. Clusters will differ in their size, density and variance. Violating this assumption will not make it impossible to use k-means, but one has to be careful interpreting the results!

3 clustering methods compared to the original data

Illustration 2 shows an exemplary clustering of the mouse data set using k-means and EM clustering. We can clearly see the downside of k-means clustering. Even if the clusters are spherical k-means is not able to detect the clusters correctly, which happens because crucial assumptions are violated.

Clustering … what else?

The introduced k-means algorithm is a typical clustering (unsupervised learning) algorithm. Besides clustering the following techniques can be used for anomaly detection:

Supervised learning (classification) is the task of training and applying an ordinary classifier to fully labeled train and test data. One should consider that data sets for anomaly detection can be heavily skewed. There might be a lot of “good“ data points and only a few anomalies. This can cause problems with some classifiers. Normally Support Vector Machines (SVM) or Artificial Neural Networks (ANN) are a good choice.²
Semi-supervised learning is a combination of supervised and unsupervised learning. The data for training is partially labeled, partially unlabeled. The aim of semi-supervised learning is to incorporate the information of the unlabeled observations to enhance the predictive performance.
Hybrids of supervised / unsupervised learning can be both, a combination of supervised or unsupervised and supervised algorithms. Clustering algorithms are normally used to pre-cluster the data for a subsequent classifier. This usually enhances the quality of the results and speeds up the classification process.

In general one can say that the introduced alternatives to clustering have a better predictive performance. But why do we not use them? Supervised algorithms or hybrids can only be used if we have fully labeled data sets. When it comes to anomaly / fraud detection we normally talk about very large data sets. And most of the time these data sets don’t have labels. Labeling the data—if done by a human—is a very time consuming and expensive task. Hence it is not feasible.

Semi-supervised learning algorithms need partly labeled data. Information from unlabeled data can be incorporated via several methods which we will not explain in detail. But just to name a few³:

Generative models
Semi-supervised SVM
Bootstrapping (wrapper)

In general, incorporating the data will improve the performance of the classifier.

So when it comes to large unlabeled data sets, clustering is almost the only possibility we have to analyze the underlying structure of the data.

Conclusion and further reading

Despite the fact that k-means is easy to use we showed that is also comes with some disadvantages. These disadvantages don’t make it impossible to use k-means but they might affect the results of the analysis. This should be considered while interpreting the results. We also gave a more general overview about classification / clustering techniques which can be used for anomaly detection. We may consider supervised, hybrids or semi-supervised algorithms if we have a data set which is at least partially labeled. In the next post we will introduce a hybrid (supervised and unsupervised) approach for anomaly detection in network traffic.

Literature

D. Pelleg, A. Moore (2000): X-means: Extending K-means with Efficient Estimation of the Number of Clusters; ICML ’00 Proceedings of the Seventeenth International Conference on Machine Learning Pages 727-734.
C. Phua, V. Lee, K. Smith, R. Gayler (2010); Comprehensive Survey of Data Mining-based Fraud Detection Research, ICICTA ’10 Proceedings of the 2010 International Conference on Intelligent Computation Technology and Automation Volume 1, p. 50-53.
S. Cheng, J. Liu, X. Tang (2014); Using unlabeled Data to Improve Inductive Models by Incorporating Transductive Models; International Journal of Advanced Research in Artificial Intelligence Volume 3 Number 2, p. 33-38.

Mastering Python: Fortgeschrittene Techniken für Entwickler:innen

Dieses Training führt praxisorientiert in fortgeschrittene Konzepte von Python ein. Im Verlauf des Trainings lernen die Teilnehmer:innen anhand von interaktiven Beispielen und umfangreichen Praxisaufgaben alle wichtigen Konzepte der Sprache kennen.

Zum Training

4 Kommentare

Pingback: K-Means Clustering in Python – Big Data Science, Machine Learning and more…
Pingback: Network Anomaly Detection: Online vs. Offline Machine Learning
Javier sagt:

08.11.2020 um 10:29 p.m. Uhr

Could you give references, please?

Antworten
1. Lukas Funk sagt:
  
  10.11.2020 um 1:31 p.m. Uhr
  
  Well, there are references in the last paragraph. I’m afraid we can’t point you further.
  
  Antworten

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	Leadfeeder
Anbieter	Dealfront Group GmbH

Anomaly Detection: (Dis-)advantages of k-means clustering

Choosing an appropriate k

Numerical data

Spherical and equally sized clusters

Clustering … what else?

Conclusion and further reading

Literature

Mastering Python: Fortgeschrittene Techniken für Entwickler:innen

4 Kommentare

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

Hierarchical Time Series Forecasting for Electricity Demand

Iterative Regression for the Prediction of Battery Discharge Curves

From Data to Decision: Pioneering Predictive Maintenance in Food Safety

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Akzeptieren
Name	Podigee
Anbieter	Podigee
Zweck	Wird verwendet, um Podigee-Inhalte automatisch zu entsperren.
Datenschutzerklärung	https://www.podigee.com/de/ueber-uns/datenschutz
Host(s)	podigee., podigee.com, podigee.io

Anomaly Detection: (Dis-)advantages of k-means clustering

Choosing an appropriate k

Numerical data

Spherical and equally sized clusters

Clustering … what else?

Conclusion and further reading

Literature

Mastering Python: Fortgeschrittene Techniken für Entwickler:innen

4 Kommentare

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

Hierarchical Time Series Forecasting for Electricity Demand

Iterative Regression for the Prediction of Battery Discharge Curves

From Data to Decision: Pioneering Predictive Maintenance in Food Safety

inoNews