Table of Content
- Introduction
- Prerequisites
- The Code Walkthrough
- Loading and Visualizing Data
- Creating and Training the Model
- Evaluating the Model
- Visualizing Performance
- Saving and Loading the Model
- Running the Script
- Conclusion
Introduction
Logistic Regression is a fundamental statistical technique in the field of machine learning and data analysis, used extensively for binary classification problems. Unlike linear regression, which predicts a continuous output, logistic regression predicts the probability of a binary outcome, such as yes/no, true/false, or success/failure. In this article, we'll walk through a practical demonstration of logistic regression using Python, emphasizing key concepts and steps involved in building, evaluating, and using the model.
Prerequisites
To follow along with this demonstration, you'll need a basic understanding of Python and its data science libraries such as Pandas, NumPy, and Matplotlib. Familiarity with machine learning concepts will also be beneficial. Ensure you have the necessary libraries installed:
pip install pandas numpy matplotlib scikit-learn seaborn ipympl
The Code Walkthrough
Let's dive into the source code to understand the process of implementing logistic regression in Python.This script starts with the necessary imports:
import matplotlib.pylab as plt import seaborn as sns import sys import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve, confusion_matrix, classification_report, ConfusionMatrixDisplay import pickle
Loading and Visualizing Data
The first step in any machine learning project is to load and understand your data. The function load_csv_into_dataframe loads data from a CSV file into a Pandas DataFrame.
def load_csv_into_dataframe(csv_filename): try: df = pd.read_csv(csv_filename) return df except FileNotFoundError: print(f"*** load_csv_into_dataframe(): '{csv_filename}' not found") return NoneVisualization helps in understanding the distribution and relationships in the data. The functions visualize_data and visualize_correlation_matrix are used for this purpose.
def visualize_data(df): numerical_features = ['Age', 'Sex', 'Trestbps', 'Chol', 'Thalach', 'Oldpeak'] plt.figure(figsize=(15, 10)) for i, feature in enumerate(numerical_features): plt.subplot(2, 3, i+1) plt.hist(df[feature], bins=20, edgecolor='k') plt.title(f'{feature} Distribution') plt.xlabel(feature) plt.ylabel('Frequency') plt.tight_layout() plt.show() def visualize_correlation_matrix(df): corr_matrix = df.corr() plt.figure(figsize=(12, 10)) sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', vmin=-1, vmax=1) plt.title('Correlation Matrix') plt.show()
Creating and Training the Model
Once the data is understood, we proceed to model creation. The create_model function is designed to train a logistic regression model.
def create_model(df): print(f"### create_model(DataFrame)") X = df.drop(df.columns[-1], axis=1) y = df.take([-1], axis=1) numeric_df = df.apply(pd.to_numeric, errors='coerce') is_all_numeric = not numeric_df.isnull().values.any() if not is_all_numeric: print(f"*** ERROR: the data content of CSV to be used as training model has non-numeric value, please fix this!") print(df) return None X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LogisticRegression(max_iter=1000) model.fit(X_train.values, y_train.values) print(f"Logistic Regression model is created") print(f"***\nCoefficients = {model.coef_}\nIntercept = {model.intercept_}\n***") print(f"\n***** Evaluate the model using the same TRAINED data *****") predict_data(model, X_train.values, y_train.values) print(f"\n***** Evaluate model using UNSEEN TESTING data *****") predict_data(model, X_test.values, y_test.values) return model
Evaluating the Model
Evaluating the model is crucial to understand its performance. The predict_data function evaluates the model using various metrics.
def predict_data(model, X, y): y_pred = model.predict(X) y_pred_prob = model.predict_proba(X)[:, 1] accuracy = accuracy_score(y, y_pred) precision = precision_score(y, y_pred) recall = recall_score(y, y_pred) roc_auc = roc_auc_score(y, y_pred_prob) conf_matrix = confusion_matrix(y, y_pred) class_report = classification_report(y, y_pred) print(f'Accuracy: {accuracy:.2f}') print(f'Precision: {precision:.2f}') print(f'Recall: {recall:.2f}') print(f'ROC-AUC: {roc_auc:.2f}') print('Confusion Matrix:') print(conf_matrix) print('Classification Report:') print(class_report) visualize_confusion_matrix(model, conf_matrix) visualize_roc_curve(roc_auc, y, y_pred_prob)
Visualizing Performance
Visualization of the model’s performance is done using confusion matrix and ROC curve.
def visualize_confusion_matrix(model, conf_matrix): disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=model.classes_) disp.plot(cmap='Blues') plt.title('Confusion Matrix') plt.show() def visualize_roc_curve(roc_auc, y, y_predicted): fpr, tpr, thresholds = roc_curve(y, y_predicted) plt.figure() plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.show()
Saving and Loading the Model
To avoid retraining the model each time, it can be saved and loaded using the pickle module.
def save_model(model, filename): print('### save_model()') pickle.dump(model, open(filename, 'wb')) def load_model(filename): try: model = pickle.load(open(filename, 'rb')) print(f"### load_model(): '{filename}' loaded") return model except FileNotFoundError: print(f"### load_model(): error '{filename}' not found") return None
Running the Script
The script includes a main block to handle command-line arguments and execute the appropriate functions.
if __name__ == "__main__": if len(sys.argv) != 2: show_usage() exit(-1) input_filename = sys.argv[1] operation = None if input_filename.endswith(".csv"): operation = 'csv' else: show_usage() exit(-2) if operation == 'csv': df = load_csv_into_dataframe(input_filename) if df is None: print(f"load csv '{input_filename}' failed") exit(-3) visualize_data(df) visualize_correlation_matrix(df) model = create_model(df) if model is None: exit(-10) else: model_filename = input_filename[0:-4] + '.model' save_model(model, model_filename) print(f"*** app is ended normally ***")
Conclusion
This article walked through a simple yet comprehensive logistic regression implementation in Python. By understanding the steps involved in data loading, visualization, model creation, evaluation, and saving/loading, you should now have a solid foundation to apply logistic regression to your own datasets. This practical guide aims to demystify the process and encourage you to experiment and refine your models for better accuracy and reliability in your predictive analyses.
The full source code can be found at Github: https://github.com/HMaxF/Logistic-Regression
Note: this article is my first article that is written with the help of ChatGPT 4o, I wrote the python code and ask ChatGPT to write the article then I editted the article to make proper flow and to suit my preference, and I like the way ChatGPT able to write an article based on a python code.