Logistic regression is a fundamental statistical technique with wide applications in finance, healthcare, cybersecurity, etc. Despite its name, logistic regression is mainly used for classification, rather than for regression problems. It’s helpful for binary classification problems where one needs to predict one of two possible outcomes.
At its core, logistic regression models the probability of an event occurring given one or more independent variables.
Characteristics of logistic regression include:
Input data: Any numeric data
Output: Binary, 0 or 1, 0 meaning false, 1 meaning true
Predicted output: Gives a probability between 0 and 1
The Logistic Function
The heart of logistic regression is the logistic or Sigmoid function. It’s an S-shaped curve. This function is fitted to the data and used to make probability predictions. This Sigmoid function can be represented as:
P(X) = 1 / (1 + e^-(β₀ + β₁X))
Where: e denotes Euler’s number, which is a mathematical constant; β₀ and β₁ represent coefficients; and X is the input variable.
The result of this function always falls between 0 and 1, making it ideal for probability predictions.
Once the logistic function has been fitted onto the data, the function could be used to make predictions. For example, given a specific amount of rain, what would be the likelihood that there would be a flood? The model projects data points onto the logistic function, therefore giving estimates in probabilities.
Fitting the Model: Maximum Likelihood Estimation
This technique is applied to fit the logistic function to the data using MLE. MLE finds coefficients that are the β values which maximize the likelihood of the observed data. Steps toward that objective will be:
- Calculate the likelihood of every data point
- Find the product of those likelihoods to get the overall likelihood.
- Find the coefficients that maximized that overall likelihood.
These coefficients could be optimized using various techniques, including the use of gradient descent or Newton’s method.
Steps to Classify Phishing Emails Using Logistic Regression
Step 1: Data Collection of legitimate and phishing/spam emails
Step 2: Extract relevant features from emails that might indicate phishing
Step 3: Data Preprocessing, i.e., the dataset should be balanced
Step 4: Split the dataset for training and testing
Step 5: Train the logistic regression model using the training data
Step 6: Make predictions on the test set by using the trained model
Step 7: Evaluate the model (We can use confusion matrix)
Step 8: Fine-tune the model
Step 9: Deployment and monitoring
Implementing Phishing Emails Classifier Using Python
We will implement this project in Google Colab.
The email dataset has been taken from the UCI Machine Learning Repository. We will upload this to Google Colab.
Below is a screenshot of the dataset used: –
30 parameters have been used in the dataset.
As computing machines only understand numbers, yes and no are written as 1 and -1.
Now, we will import the necessary libraries: –
The logistic regression function is present in the sci-kit learn library.
The function to read the datasets is present in the NumPy library.
We will then split our data for training and testing. As a decision tree is a supervised machine learning algorithm, the final output will be based on labelled data whose output is known.
Now, this is where the decision tree comes into play: –
Finally, we check the accuracy of the algorithm: –
Hence, we have implemented a Logistic Regression algorithm for classifying phishing emails.
We have observed that the logistic regression model created here detects spam mail with an accuracy of 84.5%.