Are Regression Models Fair Enough to Be Used in Real-World Decision Making?

Currently learning and developing my skills in data science.
Table of Contents
Introduction
What I Learned
What I Built
Challenges and Solutions
Conclusion
Introduction
This week, I learned about data preprocessing and regression, two essential steps in building effective machine learning and data analysis models. Raw data collected from real-world sources is often messy, incomplete, or inconsistent. Before applying any model, the data must be cleaned and prepared so that meaningful patterns can be discovered.
Regression is a powerful technique used to understand relationships between variables and make predictions. By combining preprocessing with regression, I was able to transform raw data into a form that could be analyzed and used for prediction.
What I Learned
Key Concepts
Data preprocessing: Cleaning and preparing raw data for analysis
Handling missing values: Removing or filling in missing data
Feature scaling: Making sure all variables are on a similar scale
Regression: A method used to model relationships between variables and predict outcomes
Frameworks
Pandas: Used to clean, organize, and manipulate datasets
NumPy: Used for numerical operations
Scikit-learn: Used to build and evaluate regression models
Techniques Mastered
Removing duplicate and irrelevant data
Filling missing values using mean or median
Splitting data into training and testing sets
Building a simple linear regression model
Evaluating model performance using error metrics like R², Adjusted R², MSE, and RMSE
What I Built
Project Name: Startup Profit Prediction Using Multiple Linear Regression
Description:
I built a multiple linear regression project to help a startup company predict its monthly profit based on different business metrics. The goal was not only to build an accurate prediction model, but also to identify which factors have the most significant impact on profit using feature selection techniques.
Code Snippet:
#Load the dataset
startupData = pd.read_csv("Assignment-Datasets/assignment3_startup_profit.csv")
startupData.head()
#Explore the dataset
startupData.info()
startupData.describe()
#Encode categorical data
startupData = pd.get_dummies(startupData, columns=['Location'], drop_first=True)
#Split into test and train data
from sklearn.model_selection import train_test_split
X = startupData.drop('Profit', axis=1)
y = startupData['Profit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Multiple linear regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Key Features of the Project
Cleaned and prepared the dataset for modeling
Encoded the categorical location variable and avoided the dummy variable trap
Split the dataset into training (80%) and testing (20%) sets
Built an initial multiple linear regression model using all features
Evaluated the model using R², Adjusted R², MSE, and RMSE
Applied backward elimination to remove statistically insignificant features
Rebuilt and evaluated an optimized regression model
Created visualizations to compare predictions and analyze residuals
Technical Discussion
The initial regression model was trained using all available features to establish a baseline performance. To improve the model, I applied backward elimination, a feature selection technique that removes variables with high p-values (greater than 0.05).
By iteratively eliminating insignificant features and rebuilding the model, I was able to improve interpretability while maintaining or improving predictive performance. This approach helped identify the most influential business factors affecting profit, making the model more useful for real-world decision-making.
Challenges and Solution
- Handling Categorical Data (Location Variable)
Challenge:
The dataset included a categorical variable, Location, which could not be used directly in a regression model.
Solution:
I applied OneHotEncoder to convert the categorical variable into numerical dummy variables. To avoid the dummy variable trap, one category was removed, ensuring the regression model did not suffer from multicollinearity.
- Selecting Relevant Features
Challenge:
Using all available features initially made the model harder to interpret and included variables that did not significantly affect profit.
Solution:
I used backward elimination with p-values from OLS regression to systematically remove features with p-values greater than 0.05. This ensured only statistically significant predictors were retained.
Conclusion
Learning about data preprocessing and regression gave me a deeper understanding of how data becomes useful in real-world applications. Preparing data correctly is just as important as choosing the right model. By cleaning data and applying regression techniques, I gained practical experience that will help me build more accurate and reliable data-driven solutions in the future.



