AI Security

Expose Hidden Threats: Logistic Regression in Action Against Phishing Emails

StratosAlly

December 16, 2024

Logistic regression is a fundamental statistical technique with wide applications in finance, healthcare, cybersecurity, etc. Despite its name, logistic regression is mainly used for classification, rather than for regression problems. It’s helpful for binary classification problems where one needs to predict one of two possible outcomes.

At its core, logistic regression models the probability of an event occurring given one or more independent variables.

Characteristics of logistic regression include:

Input data: Any numeric data

Output: Binary, 0 or 1, 0 meaning false, 1 meaning true

Predicted output: Gives a probability between 0 and 1

The Logistic Function

The heart of logistic regression is the logistic or Sigmoid function. It’s an S-shaped curve. This function is fitted to the data and used to make probability predictions. This Sigmoid function can be represented as:

P(X) = 1 / (1 + e^-(β₀ + β₁X))

Where: e denotes Euler’s number, which is a mathematical constant; β₀ and β₁ represent coefficients; and X is the input variable.

The result of this function always falls between 0 and 1, making it ideal for probability predictions.
Once the logistic function has been fitted onto the data, the function could be used to make predictions. For example, given a specific amount of rain, what would be the likelihood that there would be a flood? The model projects data points onto the logistic function, therefore giving estimates in probabilities.

Fitting the Model: Maximum Likelihood Estimation

This technique is applied to fit the logistic function to the data using MLE. MLE finds coefficients that are the β values which maximize the likelihood of the observed data. Steps toward that objective will be:

Calculate the likelihood of every data point

Find the product of those likelihoods to get the overall likelihood.

Find the coefficients that maximized that overall likelihood.

These coefficients could be optimized using various techniques, including the use of gradient descent or Newton’s method.

Steps to Classify Phishing Emails Using Logistic Regression

Step 1: Data Collection of legitimate and phishing/spam emails

Step 2: Extract relevant features from emails that might indicate phishing

Step 3: Data Preprocessing, i.e., the dataset should be balanced

Step 4: Split the dataset for training and testing

Step 5: Train the logistic regression model using the training data

Step 6: Make predictions on the test set by using the trained model

Step 7: Evaluate the model (We can use confusion matrix)

Step 8: Fine-tune the model

Step 9: Deployment and monitoring

Implementing Phishing Emails Classifier Using Python

We will implement this project in Google Colab.

The email dataset has been taken from the UCI Machine Learning Repository. We will upload this to Google Colab.

Below is a screenshot of the dataset used: –

30 parameters have been used in the dataset.

As computing machines only understand numbers, yes and no are written as 1 and -1.

Now, we will import the necessary libraries: –

The logistic regression function is present in the sci-kit learn library.

The function to read the datasets is present in the NumPy library.

We will then split our data for training and testing. As a decision tree is a supervised machine learning algorithm, the final output will be based on labelled data whose output is known.

Now, this is where the decision tree comes into play: –

Finally, we check the accuracy of the algorithm: –

Hence, we have implemented a Logistic Regression algorithm for classifying phishing emails.

We have observed that the logistic regression model created here detects spam mail with an accuracy of 84.5%.

Expose Hidden Threats: Logistic Regression in Action Against Phishing Emails

StratosAlly

The Logistic Function

Fitting the Model: Maximum Likelihood Estimation

Steps to Classify Phishing Emails Using Logistic Regression

Implementing Phishing Emails Classifier Using Python

more Related articles

The Hidden Danger: Malicious Firefox Add-ons and Stolen Crypto Keys

Bypassing Upload Filters: How Directory Traversal Leads to RCE

Hackers Could Exploit MCP Inspector to Hijack Developer Machines

Password Access Ending in Microsoft Authenticator this August

community

Join the community to Stay Updated with the Latest Cybersecurity Trends