Skip to main content

Command Palette

Search for a command to run...

Are Regression Models Fair Enough to Be Used in Real-World Decision Making?

Published
4 min read
Are Regression Models Fair Enough to Be Used in Real-World Decision Making?
E

Currently learning and developing my skills in data science.

Table of Contents

  1. Introduction

  2. What I Learned

  3. What I Built

  4. Challenges and Solutions

  5. Conclusion

Introduction

This week, I learned about data preprocessing and regression, two essential steps in building effective machine learning and data analysis models. Raw data collected from real-world sources is often messy, incomplete, or inconsistent. Before applying any model, the data must be cleaned and prepared so that meaningful patterns can be discovered.

Regression is a powerful technique used to understand relationships between variables and make predictions. By combining preprocessing with regression, I was able to transform raw data into a form that could be analyzed and used for prediction.

What I Learned

Key Concepts

  • Data preprocessing: Cleaning and preparing raw data for analysis

  • Handling missing values: Removing or filling in missing data

  • Feature scaling: Making sure all variables are on a similar scale

  • Regression: A method used to model relationships between variables and predict outcomes

Frameworks

  • Pandas: Used to clean, organize, and manipulate datasets

  • NumPy: Used for numerical operations

  • Scikit-learn: Used to build and evaluate regression models

Techniques Mastered

  • Removing duplicate and irrelevant data

  • Filling missing values using mean or median

  • Splitting data into training and testing sets

  • Building a simple linear regression model

  • Evaluating model performance using error metrics like R², Adjusted R², MSE, and RMSE

What I Built

Project Name: Startup Profit Prediction Using Multiple Linear Regression

Description:
I built a multiple linear regression project to help a startup company predict its monthly profit based on different business metrics. The goal was not only to build an accurate prediction model, but also to identify which factors have the most significant impact on profit using feature selection techniques.

Code Snippet:

#Load the dataset
startupData = pd.read_csv("Assignment-Datasets/assignment3_startup_profit.csv")
startupData.head()
#Explore the dataset
startupData.info()
startupData.describe()
#Encode categorical data
startupData = pd.get_dummies(startupData, columns=['Location'], drop_first=True)
#Split into test and train data
from sklearn.model_selection import train_test_split
X = startupData.drop('Profit', axis=1)
y = startupData['Profit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Multiple linear regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Key Features of the Project

  • Cleaned and prepared the dataset for modeling

  • Encoded the categorical location variable and avoided the dummy variable trap

  • Split the dataset into training (80%) and testing (20%) sets

  • Built an initial multiple linear regression model using all features

  • Evaluated the model using R², Adjusted R², MSE, and RMSE

  • Applied backward elimination to remove statistically insignificant features

  • Rebuilt and evaluated an optimized regression model

  • Created visualizations to compare predictions and analyze residuals

Technical Discussion

The initial regression model was trained using all available features to establish a baseline performance. To improve the model, I applied backward elimination, a feature selection technique that removes variables with high p-values (greater than 0.05).

By iteratively eliminating insignificant features and rebuilding the model, I was able to improve interpretability while maintaining or improving predictive performance. This approach helped identify the most influential business factors affecting profit, making the model more useful for real-world decision-making.

Challenges and Solution

  1. Handling Categorical Data (Location Variable)

Challenge:
The dataset included a categorical variable, Location, which could not be used directly in a regression model.

Solution:
I applied OneHotEncoder to convert the categorical variable into numerical dummy variables. To avoid the dummy variable trap, one category was removed, ensuring the regression model did not suffer from multicollinearity.

  1. Selecting Relevant Features

Challenge:
Using all available features initially made the model harder to interpret and included variables that did not significantly affect profit.

Solution:
I used backward elimination with p-values from OLS regression to systematically remove features with p-values greater than 0.05. This ensured only statistically significant predictors were retained.

Conclusion

Learning about data preprocessing and regression gave me a deeper understanding of how data becomes useful in real-world applications. Preparing data correctly is just as important as choosing the right model. By cleaning data and applying regression techniques, I gained practical experience that will help me build more accurate and reliable data-driven solutions in the future.