A Practical Guide To Logistic Regression

Written: 2024-07-05 14:39:35 Last update: 2024-07-18 15:00:00

Table of Content

Introduction
Prerequisites
The Code Walkthrough
Loading and Visualizing Data
Creating and Training the Model
Evaluating the Model
Visualizing Performance
Saving and Loading the Model
Running the Script
Conclusion

Introduction

Logistic Regression is a fundamental statistical technique in the field of machine learning and data analysis, used extensively for binary classification problems. Unlike linear regression, which predicts a continuous output, logistic regression predicts the probability of a binary outcome, such as yes/no, true/false, or success/failure. In this article, we'll walk through a practical demonstration of logistic regression using Python, emphasizing key concepts and steps involved in building, evaluating, and using the model.

Prerequisites

To follow along with this demonstration, you'll need a basic understanding of Python and its data science libraries such as Pandas, NumPy, and Matplotlib. Familiarity with machine learning concepts will also be beneficial. Ensure you have the necessary libraries installed:

pip install pandas numpy matplotlib scikit-learn seaborn ipympl

The Code Walkthrough

Let's dive into the source code to understand the process of implementing logistic regression in Python.This script starts with the necessary imports:

import matplotlib.pylab as plt
import seaborn as sns
import sys
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve, confusion_matrix, classification_report, ConfusionMatrixDisplay
import pickle

Loading and Visualizing Data

The first step in any machine learning project is to load and understand your data. The function load_csv_into_dataframe loads data from a CSV file into a Pandas DataFrame.

def load_csv_into_dataframe(csv_filename):
try:
    df = pd.read_csv(csv_filename)
    return df
except FileNotFoundError:
    print(f"*** load_csv_into_dataframe(): '{csv_filename}' not found")
return None

Visualization helps in understanding the distribution and relationships in the data. The functions visualize_data and visualize_correlation_matrix are used for this purpose.

def visualize_data(df):
    numerical_features = ['Age', 'Sex', 'Trestbps', 'Chol', 'Thalach', 'Oldpeak']
    plt.figure(figsize=(15, 10))
    for i, feature in enumerate(numerical_features):
        plt.subplot(2, 3, i+1)
        plt.hist(df[feature], bins=20, edgecolor='k')
        plt.title(f'{feature} Distribution')
        plt.xlabel(feature)
        plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

def visualize_correlation_matrix(df):
    corr_matrix = df.corr()
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', vmin=-1, vmax=1)
    plt.title('Correlation Matrix')
    plt.show()

Creating and Training the Model

Once the data is understood, we proceed to model creation. The create_model function is designed to train a logistic regression model.

def create_model(df):
print(f"### create_model(DataFrame)")
X = df.drop(df.columns[-1], axis=1)
y = df.take([-1], axis=1)
numeric_df = df.apply(pd.to_numeric, errors='coerce')
is_all_numeric = not numeric_df.isnull().values.any()
if not is_all_numeric:
    print(f"*** ERROR: the data content of CSV to be used as training model has non-numeric value, please fix this!")
    print(df)
    return None

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train.values, y_train.values)
print(f"Logistic Regression model is created")
print(f"***\nCoefficients = {model.coef_}\nIntercept = {model.intercept_}\n***")
print(f"\n***** Evaluate the model using the same TRAINED data *****")
predict_data(model, X_train.values, y_train.values)
print(f"\n***** Evaluate model using UNSEEN TESTING data *****")
predict_data(model, X_test.values, y_test.values)
return model

Evaluating the Model

Evaluating the model is crucial to understand its performance. The predict_data function evaluates the model using various metrics.

def predict_data(model, X, y):
    y_pred = model.predict(X)
    y_pred_prob = model.predict_proba(X)[:, 1]
    accuracy = accuracy_score(y, y_pred)
    precision = precision_score(y, y_pred)
    recall = recall_score(y, y_pred)
    roc_auc = roc_auc_score(y, y_pred_prob)
    conf_matrix = confusion_matrix(y, y_pred)
    class_report = classification_report(y, y_pred)
    print(f'Accuracy: {accuracy:.2f}')
    print(f'Precision: {precision:.2f}')
    print(f'Recall: {recall:.2f}')
    print(f'ROC-AUC: {roc_auc:.2f}')
    print('Confusion Matrix:')
    print(conf_matrix)
    print('Classification Report:')
    print(class_report)
    visualize_confusion_matrix(model, conf_matrix)
    visualize_roc_curve(roc_auc, y, y_pred_prob)

Visualizing Performance

Visualization of the model’s performance is done using confusion matrix and ROC curve.

def visualize_confusion_matrix(model, conf_matrix):
    disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title('Confusion Matrix')
    plt.show()

def visualize_roc_curve(roc_auc, y, y_predicted):
    fpr, tpr, thresholds = roc_curve(y, y_predicted)
    plt.figure()
    plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()

Saving and Loading the Model

To avoid retraining the model each time, it can be saved and loaded using the pickle module.

def save_model(model, filename):
    print('### save_model()')
    pickle.dump(model, open(filename, 'wb'))

def load_model(filename):
    try:
        model = pickle.load(open(filename, 'rb'))
        print(f"### load_model(): '{filename}' loaded")
        return model
    except FileNotFoundError:
        print(f"### load_model(): error '{filename}' not found")
    return None

Running the Script

The script includes a main block to handle command-line arguments and execute the appropriate functions.

if __name__ == "__main__":
    if len(sys.argv) != 2:
        show_usage()
        exit(-1)
    input_filename = sys.argv[1]
    operation = None
    if input_filename.endswith(".csv"):
        operation = 'csv'
    else:
        show_usage()
        exit(-2)
    
    if operation == 'csv':
        df = load_csv_into_dataframe(input_filename)
        if df is None:
            print(f"load csv '{input_filename}' failed")
            exit(-3)
        visualize_data(df)
        visualize_correlation_matrix(df)
        model = create_model(df)
        if model is None:            
            exit(-10)
        else:
            model_filename = input_filename[0:-4] + '.model'
            save_model(model, model_filename)
    print(f"*** app is ended normally ***")

Conclusion

This article walked through a simple yet comprehensive logistic regression implementation in Python. By understanding the steps involved in data loading, visualization, model creation, evaluation, and saving/loading, you should now have a solid foundation to apply logistic regression to your own datasets. This practical guide aims to demystify the process and encourage you to experiment and refine your models for better accuracy and reliability in your predictive analyses.

The full source code can be found at Github: https://github.com/HMaxF/Logistic-Regression

Note: this article is my first article that is written with the help of ChatGPT 4o, I wrote the python code and ask ChatGPT to write the article then I editted the article to make proper flow and to suit my preference, and I like the way ChatGPT able to write an article based on a python code.

Quick.Work

"Will code for travel"