Confusion Matrix in CyberSecurity

Predictive modeling is a technique or a mathematical process that helps to predict future possibilities by studying and analyzing historical data and understanding the patterns. It includes various steps like data collection, feature selection, pre-processing, data wrangling, model evaluation, and model creation. Data wrangling can be defined as converting raw data by cleaning, structuring, etc into the desired format which provides ease in the decision-making power of a model. Model evaluation is one of the parts of the model development lifecycle which aims to decide whether the model performs effectively. Therefore, it becomes critical to consider the model outcomes according to every possible evaluation method. Applying different methods can provide different perspectives. There are various metrics available for model evaluation such as Confusion matrix, accuracy, precision, recall, specificity, and F1 score. This article will mainly focus on the Confusion matrix.

Confusion Matrix is one of the metrics used in the evaluation of a model i.e, to check the efficiency or accuracy of the model. The main objective while creating a model is to get low bias and variance which evaluates a model more critically.

Let’s look for a standard definition — It is defined as a table that is quite popularly used in the determination of the performance of a classification model on a set of test data for which the true values are known. The matrix is usually of 2*2, which is calculated with the help of actual and predicted values, with four values namely True Positive, True Negative, False Positive, and False Negative.

We will see each of these values in details with an example:

1. True Positive — It is the condition in which the predicted value and the actual value are the same which is also true. For example, we have an entry of 180 people out of which 100 people are diagnosed with a disease ‘X’.
2. True Negative — Predicted values and the actual values are true i.e, the patient is not suffering from a disease ‘X’. In our example, 50 people fall into this category.
3. False Negative — In this the prediction values are negative but the actual values are positive. In our case, we predicted no, but they do have the disease and the number of people in this category is 20.
4. False Positive — In this the prediction value is positive but in reality the actual value is negative. Suppose if we claim that the person is suffering from the disease but they are not, they are 10.

Let’s start by evaluating the results calculated by the Confusion matrix

Accuracy — To predict how often the value predicted by the classifier is correct, calculated by the formula. Accuracy = TN+TP / TN+FP+FN+TP

Misclassification Rate — To predict how often the value predicted by the classifier is incorrect, calculated by the formula. FN+FP / TN+FP+FN+TP.

Misclassification rate = FN+FP = 10+5

TN+FP+FN+TP 50+10+100+5

= 9%

Precision — When the values predicted say yes, but the percentage of it is correct?

TP/predicted yes = 3/5 = 0.6

Recall -How many correct true positives are found from all the possibilities involving positives.

Recall = TruePositives / (TruePositives + FalseNegatives)

= 3/4 = 0.75

Prevalence — How often does the yes condition actually occur in our sample?

actual yes/total =3/10 = 0.3

F1-Score- It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

2*0.75*0.6

0.75+0.6

= 0.667

Sensitivity — It computes the ratio of positive classes correctly detected. This metric gives how good the model is to recognize a positive class.

CyberSecurity refers to the process of protecting computers or other devices from the theft of information, damage of software or hardware, and other intellectual properties whose damage may affect a serious problem to the person/organization, who is the owner/in charge of that information.

AI and ML technologies have become critical yet useful in the information security world as they can quickly analyze millions of events and identify many different types of threats. Let’s start by understanding what are these threats we are discussing, our focus is mainly on cyber-attacks which can be defined as any attempt to gain unauthorized access to a computer, computing system, or computer network with the intent to cause damage. They may cause

1. Identity theft, extortion of information which might result in blackmailing
2. Malware induction into the systems, affecting multiple systems by injecting viruses
3. Spoofing, Phishing, and Spamming
4. Denial of various services may further lead to multiple attacks
6. Sabotaging vital information
7. Vandalism through various websites
8. The exploitation of privacy through web browsers
9. Account hacks and money scams
10. Ransomware
11. Theft of Intellectual Property

The attack can be caused due to error type-1 (False Positive) and type-2 (False Negative).

Confusion matrix application used in machine learning

An IDS (Intrusion Detection System) is used as a traffic monitoring system for suspicious activity and issues alert when such activity is discovered. It is a software application that scans a network or a system for the harmful activity or policy breaching. Any malicious venture or violation is normally reported either to an administrator or collected centrally using a security information and event management (SIEM) system. A SIEM system integrates outputs from multiple sources and uses alarm filtering techniques to differentiate malicious activity from false alarms.

Although intrusion detection systems monitor networks for potentially malicious activity, they are also disposed to false alarms. Hence, organizations need to fine-tune their IDS products when they first install them. It means properly setting up the intrusion detection systems to recognize what normal traffic on the network looks like as compared to malicious activity. It also monitors network packets inbound the system to check the malicious activities involved in it and at once sends the warning notifications.

In the case of the binary classifier IDS, four possible outcomes are possible. Attacks correctly predicted as attacks (TP), or incorrectly predicted as normal (FN). Normal correctly predicted as normal (TN), or incorrectly predicted as an attack (FP). The False Positive and False Negative are the errors and the tradeoff between these two factors can be intuitively analyzed with the help of the Receiver Operating Characteristic (ROC) curve. However, in the case of multi-classifiers, when a class of attack is incorrectly predicted as another class of attack, it could not be any of the existing four instances. Here, a new approach is proposed to evaluate the anomaly-based IDS. A new proposed metric F-score per Cost (FPC) is one value calculated for each attack predictor.

In this, there are two classes of connection points “Normal” and “Attack”. The attack type is not identified here. The data set which contains labeled or unlabeled data points are used to evaluate the IDS. For example, the KDD CUP ’99 competition presented a data set of five classes, a normal class, and four classes of different attacks. The KDD’ 99 data set is considered a benchmark data set for evaluating IDS. Most of the previous studies have used the KDD’99 data set for training, testing, and validating their proposed IDS.

The task for the KDD ’99 Cup was to build a classifier capable of distinguishing between legitimate and illegitimate connections in a computer network. This data set is now considered the de facto data set for intrusion detection. The connections in the data set are either normal connections or intrusions, of which there are four main categories: Probing (Surveillance, port scanning, etc.), DoS (Denial of Service), U2R (Unauthorized access to local superuser privileges), R2L (Unauthorized access from remote machine). They have applied the evaluation metrics that were derived and calculated based on the four instances TP, TN, FP, and FN. These four instances are the outcome of comparing two actual classes with the two predicted classes. However, the attacks of a certain class that are wrongly predicted as a different class of attacks could not be related to any of these four instances

Here PD stands for Probability of detection and FAR stands for False Alarm Rate.

Metrics for Evaluating the effectiveness of IDS

The accuracy (AC) is the proportion of the total number of the correct predictions to the actual data set size. It is determined using the equation:

The Recall (R) is the proportion of correctly predicted attack cases to the actual size of the attack class, calculated using the equation:

Precision ( P ) is defined as the proportion of attack cases that were correctly predicted relative to the predicted size of the attack class, calculated using the equation:

Specificity is the proportion of true negative points to negative elements, calculated using the equation:

F-score scores the balance between precision and recall. It is a measure of the accuracy of a test. The F-score can be considered as the harmonic mean of recall and precision and is given as:

The above use-case is to compare the false alarm rate of three different IDS

Area curve is a convenient way of comparing three IDS and determining the convenient IDS according to the use-case needed. The ROC curves show the average intrusion detection rates of the three IDS models. The ROC curve could not be used to compare the three IDS and which one is suitable for certain circumstances. This is a simple use-case of confusion matrix in CyberSecurity.