Are Regression Models Fair Enough to Be Used in Real-World Decision Making?

Introduction
What I Learned
What I Built
Challenges and Solutions
Conclusion

Introduction

This week, I learned about data preprocessing and regression, two essential steps in building effective machine learning and data analysis models. Raw data collected from real-world sources is often messy, incomplete, or inconsistent. Before applying any model, the data must be cleaned and prepared so that meaningful patterns can be discovered.

Regression is a powerful technique used to understand relationships between variables and make predictions. By combining preprocessing with regression, I was able to transform raw data into a form that could be analyzed and used for prediction.

What I Learned

Key Concepts

Data preprocessing: Cleaning and preparing raw data for analysis
Handling missing values: Removing or filling in missing data
Feature scaling: Making sure all variables are on a similar scale
Regression: A method used to model relationships between variables and predict outcomes

Frameworks

Pandas: Used to clean, organize, and manipulate datasets
NumPy: Used for numerical operations
Scikit-learn: Used to build and evaluate regression models

Techniques Mastered

Removing duplicate and irrelevant data
Filling missing values using mean or median
Splitting data into training and testing sets
Building a simple linear regression model
Evaluating model performance using error metrics like R², Adjusted R², MSE, and RMSE

What I Built

Project Name: Startup Profit Prediction Using Multiple Linear Regression

Description:
I built a multiple linear regression project to help a startup company predict its monthly profit based on different business metrics. The goal was not only to build an accurate prediction model, but also to identify which factors have the most significant impact on profit using feature selection techniques.

Code Snippet:

#Load the dataset
startupData = pd.read_csv("Assignment-Datasets/assignment3_startup_profit.csv")
startupData.head()
#Explore the dataset
startupData.info()
startupData.describe()
#Encode categorical data
startupData = pd.get_dummies(startupData, columns=['Location'], drop_first=True)
#Split into test and train data
from sklearn.model_selection import train_test_split
X = startupData.drop('Profit', axis=1)
y = startupData['Profit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Multiple linear regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Key Features of the Project

Cleaned and prepared the dataset for modeling
Encoded the categorical location variable and avoided the dummy variable trap
Split the dataset into training (80%) and testing (20%) sets
Built an initial multiple linear regression model using all features
Evaluated the model using R², Adjusted R², MSE, and RMSE
Applied backward elimination to remove statistically insignificant features
Rebuilt and evaluated an optimized regression model
Created visualizations to compare predictions and analyze residuals

Technical Discussion

The initial regression model was trained using all available features to establish a baseline performance. To improve the model, I applied backward elimination, a feature selection technique that removes variables with high p-values (greater than 0.05).

By iteratively eliminating insignificant features and rebuilding the model, I was able to improve interpretability while maintaining or improving predictive performance. This approach helped identify the most influential business factors affecting profit, making the model more useful for real-world decision-making.

Challenges and Solution

Handling Categorical Data (Location Variable)

Challenge:
The dataset included a categorical variable, Location, which could not be used directly in a regression model.

Solution:
I applied OneHotEncoder to convert the categorical variable into numerical dummy variables. To avoid the dummy variable trap, one category was removed, ensuring the regression model did not suffer from multicollinearity.

Selecting Relevant Features

Challenge:
Using all available features initially made the model harder to interpret and included variables that did not significantly affect profit.

Solution:
I used backward elimination with p-values from OLS regression to systematically remove features with p-values greater than 0.05. This ensured only statistically significant predictors were retained.

Conclusion

Learning about data preprocessing and regression gave me a deeper understanding of how data becomes useful in real-world applications. Preparing data correctly is just as important as choosing the right model. By cleaning data and applying regression techniques, I gained practical experience that will help me build more accurate and reliable data-driven solutions in the future.

Are Regression Models Fair Enough to Be Used in Real-World Decision Making?

Table of Contents

Introduction

What I Learned

Key Concepts

Frameworks

Techniques Mastered

What I Built

Challenges and Solution

Conclusion

Comments

More from this blog

My Journey with Polynomial, SVR, and Decision Trees”

Web Scraping: My First Hands-On Experience

Trade Flow Market Analysis: Exploring Nigeria’s Cocoa Trade

An Empirical Analysis of the Relationship Between GDP Growth (annual %) and Total Population: Evidence from World Bank Data”

Command Palette

Table of Contents

Introduction

What I Learned

Key Concepts

Frameworks

Techniques Mastered

What I Built

Challenges and Solution

Conclusion

Comments

More from this blog