The previous two posts gave a short introduction of network anomaly detection in general. We also introduced the k-means algorithm as a simple clustering technique and discussed some advantages and drawbacks of the algorithm. Furthermore we gave some general information about techniques other than clustering which can be used for anomaly detection. In this post we want to introduce a hybrid unsupervised/supervised approach. We are going to use Balanced Iterative Reducing and Clustering using Hierarchies, also known as BIRCH as a pre-clustering step for a subsequent Support Vector Machine (SVM) classifier.
Unfortunately we do not have a labeled data set of the inovex network traffic to train the classifier on. So we will first introduce the general structure of the hybrid approach and the result of the analysis of the NSL-KDD data set. Afterwards we are going to briefly introduce an adaptation of the hybrid approach which can also handle unlabeled data.
The NSL-KDD data set
For training and testing purposes we used the NSL-KDD data set. The NSL-KDD data set is based on the KDD 99 data set, which is a popular choice for intrusion detection tasks. The training data set consists of 125,973 records where 53.46% are normal records and 46.54% represent attacks. The test data set consists of 22,543 records where 43.08% are normal records and 56.92% represent attacks. For training and testing purposes the whole training resp. testing data was used. The data was preprocessed according to the following steps:
- Categorical features: One Hot Encoding
- Continuous features: No adjustments were made
- Binary features: No adjustments were made
All features have been scaled in a second step to ensure that the clustering algorithm is not biased by different valuations.
Clustering & classification
In order to understand the functionality of the hybrid approach you should first get a grasp of the underlying clustering and classification algorithms. Below we are going to introduce them briefly.
BIRCH was originally introduced by Zhang et al. in 1996. The main idea of BIRCH is to scan a data set only once and thereby build a so called Cluster-Feature tree (CF tree) in-memory. The CF tree is a rooted tree consisting of non-leaf nodes, leaf nodes and undirected edges. It stores information about the clusters in so-called Cluster Features. Cluster Features are vectors which store information in the following manner:
\(CF_x = (N, LS, SS)\)
with N as the number of elements in the cluster, LS the linear sum and SS the squared sum of all elements. The cluster features are a compressed way of representing the points of each cluster.
The size and the shape of the tree is mainly influenced by the values of the branching factor B, the threshold T and the maximum number of elements in a leaf node L. B and T are hyperparameters. By changing the value of B and T one can influence the size and the shape of the tree. Every non-leaf node can have a maximum number of B child nodes. Every leaf node contains up to L data points which are no further away as T from each other.
In BIRCH data points are added to the CF tree point by point. This makes it perfectly suitable for handling streaming data as well as large amounts of data. Inserting new data points is done like so:
- Identify the closest leaf node by traversing the CF tree
- Update the leaf node and consider the following rules
- If the entry does not fit into existing leaf nodes create a new leaf if the control parameter B will not be violated
- Else split the leaf nodes and redistribute all entries
- Update all non-leaf nodes (cluster features) if they are affected by the insertion
Not only that inserting new data points can be done dynamically, BIRCH also comes with some additional, very handy properties.
Time complexity: BIRCH scans the data set once. One scan of the data set implies a linear time complexity O(n).
Hyperparameters: In clustering, the user often has to specify k – the expected number of clusters (e.g. in k-means). In BIRCH there is no need to specify k beforehand. This can be pretty handy because often the user doesn’t have enough domain knowledge to specify k. Instead the user specifies some shape parameters for the CF tree. A threshold T which defines the biggest distance from an observation to its cluster center. And a branching factor B which limits the number of possible child nodes in the tree.
Dynamic: The CF tree can be built dynamically. Inserting new observations into an already existing tree is possible.
Using BIRCH one should always consider the following things. First, it can only handle numerical data – so non-numerical features need an additional pre-processing step. Second, BIRCH is sensitive to the order of the elements in the data stream. Supplying BIRCH with a permuted data set might result in a different tree, hence different clusters.
For our purposes we used the scikit-learn library. With this implementation it is fairly easy to use BIRCH. One first defines the model specifications.
model = Birch(threshold=0.5, branching_factor=150, n_clusters=None)
After specifying the model one can train the algorithm either with streaming or batch data.
model.fit(data) # batch training
model.partial_fit(data) # online training
Support Vector Machine
SVMs are very common in today’s machine learning. An SVM tries to map the observations of the training data (vectors) into a higher dimensional space by a function Θ. The SVM then tries to find a hyperplane in that space which separates the data points from each other with the highest margin. The separating hyperplane is depending on the chosen kernel function. There are several kernel functions to choose from, e.g. a linear kernel:
\(k(x_i, x_j) = \theta (x_i)^T\theta (x_j)\)
Usually SVMs perform very well but they also have some drawbacks. SVMs have a time complexity somewhere between \(O(n2)\) and \(O(n3)\) which makes them not suitable for the analysis of huge amount of data. The computational complexity is highly dependant on the number of support vectors.
In this section we are going to show you how we combined the two algorithms to get a scalable SVM classifier for network intrusion detection. We use BIRCH as a pre-processing step for the SVM. We do this mainly because of two reasons: First, with BIRCH we can generalize from the data and thereby remove outliers. Second, we reduce the amount of input data for the SVM classifiers dramatically. The hybrid approach has the following structure:
- Preprocessing of the data (transforming features, setting class labels, …).
- Apply BIRCH
- Construct CF trees, one for each attack type and one for the normal records
- The cluster centers of the newly generated CF trees are used as new input data
- Train four SVM classifiers, one for each attack type. So for each attack type do the following:
- Set the class labels of all observations which don’t belong to the current attack type to ‘normal’.
- Train the SVM classifier with the adjusted data set
- Combine the four SVM classifiers to build an IDS. If at least one of the four classifiers predicts a record as an anomaly, the record will be classified as an anomaly.
Unfortunately we do not have a labeled network traffic data set of the inovex network. For our analysis we are going to use the NSL-KDD data set which we briefly introduced in a previous section.
In a first step we measured the performance of the hybrid approach with respect to its accuracy and false-positive rate on the test data set. The results are shown in figure 1. As we want to misclassify the least possible attacks as normal records we decided to choose the configuration of the algorithm with the lowest false-positive rate – although this reduces the overall accuracy. We were able to achieve a false-positive rate of 1.4% and a total accuracy of 87.83% on the test data set. This result was obtained with a threshold of T = 0.5 and a branching factor of B = 160.
The detailed results per attack type for this configuration are shown in the following table.
|Type of traffic||Accuracy on training data (%)||Accuracy on testing data (%)|
|Denial of Service (dos)||92.26||88.86|
With an overall accuracy of 87.17%, the hybrid approach performs quite good. For the attack types U2R and R2L we can obtain a really good result and are able to detect 99.83% respectively 97.48% of all attacks in the test set.
Other than its accuracy there is an even more interesting fact. Table 2 shows the number of cluster centers for each category after applying BIRCH (B was set to 160). In the original data set we have ~125.000 observations. Applying BIRCH and increasing the threshold T from 0.05 to 0.5 gradually will decrease the number of clusters dramatically. But why? If we increase the value of the threshold parameter T, more and more points will be within the range T. Thus there’s no need for splitting the nodes and hence we end up with less clusters.
|Data type||normal||dos||probe||R2L||U2R||Rate of compression (%)|
|Original data set||67343||45927||11656||995||52||-|
|T = 0.05||20444||7391||3680||325||49||74.69|
|T = 0.1||9750||2049||1831||159||41||89.02|
|T = 0.2||3498||588||765||73||32||96.06|
|T = 0.3||1654||254||421||43||24||98.01|
|T = 0.4||859||114||214||26||17||99.02|
|T = 0.5||538||57||141||18||14||99.40|
Less clusters imply less training data for the SVM classifier used later on. We already mentioned that an SVM has a very bad computational complexity. Thus, decreasing the number of training points leads to a heavy decrease in training time. Table 3 shows the decrease of training time for different threshold values. In our final approach we use a threshold value T = 0.5. This will decrease the amount of data points from ~125.000 to 768 in total and the training time for the SVM classifier from 1117 to only 0.11 seconds!
|Threshold||Training time (s)||Training time (%)|
So using BIRCH as a pre-processing step helps to speed up things dramatically and makes the SVM classifier applicable on even bigger data sets.
Conclusions & further reading
We introduced a hybrid approach for network intrusion detection using BIRCH in combination with SVM classifiers. Using BIRCH we were able to handle the huge amount of data to train SVM classifiers. The SVM classifiers showed a reasonable performance on the training set as well as the test data set.
One drawback of the approach is that we can only apply it on labeled (or partially) labeled data. Often we do not have a labeled network data set and labeling the data set would be too expensive – maybe even impossible. In this case we can use a One Class SVM (OCSVM) instead of the proposed standard SVM classifier. OCSVMs are particularly useful when you have a lot of ‘normal’ observations and only a few ‘abnormal’ observations which is the case when it comes to network traffic data.
For further reading regarding BIRCH and SVMs have a look at the following links: