April 24, 2025
This post provides a detailed overview of important regression models in machine learning
For this read, I’ll give a basic rundown of the following models:
This post is only about the theory behind the models: a short description of how each model works, together with examples on a dataset. Let's get started •ᴗ•
Simple Linear Regression (SLR) is a supervised machine learning algorithm. With supervised learning, the data provides examples where the answer is already known.
Let’s say, for example, that you have a dataset of students with the total hours studied for a test and their test scores. The "answer" in this case could be the score the student received. You want a simple way to predict a student's score, this can be done with SLR.
Now, I’ll explain some terms you might hear often:
The formula for SLR looks like this: y=β₀+β₁x
As someone who is also still learning, I can imagine this being quite overwhelming. Since I don't have any background in math or statistics, it can be quite hard to get a grasp. But an example might help us get a better feeling for how SLR works.
Let’s say we have a dataset of students’ study hours and their test scores. What is the dependent variable and independent variable? Yes, well done! The independent variable (feature) is the hours studied (X). The dependent variable (label) is the test scores (Y).
Hours Studied (X) | Test Score (Y) |
---|---|
1 | 50 |
2 | 55 |
3 | 65 |
4 | 70 |
The formula (y = β₀ + β₁x) would look like the following:
Test Score = 45 + 6 × Hours Studied
Now, let’s say we have a student who studied for 5 hours. What do you think their test score would be? Perfect ·‿· ! The formula would be: Y = 45 + 6 × 5. So, the predicted score would be: 45 + 30 = 75. You can now see that for every extra hour studied, the test score would increase by 6 points.
Before we continue, I want to talk a bit about errors within linear regression. When building a linear regression model, we are trying to draw the best possible line through all data points — in other words, to minimize the errors.
We calculate an error like this:
Error = (Actual value) - (Predicted value)
Example:
If you overpredict (for example, predicting 90 but the real score is 85), the error is negative (-5). If you underpredict, the error is positive.
Now, if you simply sum up all the errors, the positive and negative values can cancel each other out which is not helpful. To solve this, we square each error before summing them. Linear regression minimizes the sum of all squared errors. This process is called Ordinary Least Squares (OLS).
SLR only uses one independent variable (X: studied hours). With Multiple Linear Regression (MLR), you use multiple independent variables to predict the outcome (Y).
Remember our SLR equation: y = β₀ + β₁x. The idea is the same, but now with multiple independent variables.
Putting this together we get:
y = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + ... + βₙXₙ
Breaking it down:
Let’s again use an example — that always works best for me, and maybe for you too •‿•
We go back to our students and test scores!
Before, we predicted the test score based only on hours studied.
But now, we add other factors like hours slept and number of classes attended:
Hours Studied (X₁) | Hours Slept (X₂) | Classes attended (X₃) | Test Score (Y) |
---|---|---|---|
1 | 7 | 4 | 71 |
3 | 6 | 2 | 65 |
2 | 4 | 1 | 45 |
The new features:
Suppose the model creates this equation: Test Score (Y) = 30 + 5(Hours Studied) + 3(Hours Slept) + 2(Classes Attended)
Meaning:
One important concept in MLR is overfitting. If you have a huge dataset with lots of features, and you include all of them, the model might just memorize the training data, but perform badly on new, unseen data.
It’s like only studying past exam papers. You memorize old questions, but when the real exam asks something new, you struggle! Overfitting = memorizing the training data instead of learning real patterns.
So far, we’ve assumed relationships to be linear, meaning you can predict the outcome with a straight line. But sometimes, a straight line just isn’t good enough. Polynomial Regression allows the model to fit a curve instead.
New concepts:
So when should I use Polynomial Regression instead of MLR?
Y = β₀ + β₁X + β₂X² + ... + βₙXⁿ
Suppose we study the effect of fertilizer (X) on crop yield (Y).
If you plot the data, you get a ∩-shaped curve. If you try a straight line, it won't fit well, but a degree 2 polynomial (a U-curve) can fit perfectly.
Thus, our formula would look like: Yield (Y) = β₀ + β₁ × Fertilizer + β₂ × Fertilizer²
The challenge is picking the right degree to avoid overfitting or underfitting.
Support Vector Regression (SVR) tries to fit a curve (often a line) within a certain margin of error. This margin is called epsilon (ϵ). Imagine a "tube" around the line. Points inside the tube is fine, no penalty (small errors). Points outside the tube are penalized (big errors)
Some important terms:
Say we predict a test score of 85. If the real score is 86, that's within a margin of 2 (ϵ=2), no problem. If the real score is 90, that's 5 points away, penalized!
In short, SVR tries to fit a tube of width ϵ around the best fit line, allowing small errors inside but punishing bigger ones outside.
A decision tree is like a flowchart. Each node in the tree asks a yes/no question. Based on the answer, the data gets split into two branches. This continues until we reach a leaf node (final predicted number)
Variance measures how spread out numbers are: [55, 56, 57] = low variance (numbers close together). [40, 60, 80] = high variance (numbers spread out)
Decision Trees split the data to lower the variance after each split, trying to make groups as tight as possible.
We also call this minimizing the Mean Squared Error (MSE).
Instead of building one decision tree, we build many and average their results.
Random Forest helps reduce overfitting that a single decision tree can suffer from.
Imagine asking just one teacher to predict students' test scores, that's a decision tree.
But imagine asking hundreds of teachers, each with different groups of students, then averaging all their answers. That’s Random Forest!
Let's use a well-known idea to understand it better: Imagine 100 students are guessing how many jellybeans are inside a jar. Each student might guess wrong individually, some guess too high, others too low. But if you take the average of all their guesses, the result is surprisingly close to the real number! (The Wisdom of Crowds))
Simple Linear Regression (SLR): Predicts a dependent variable (Y) using a single independent variable (X). It is represented by a straight line. Example: predicting test scores based on study hours.
Multiple Linear Regression (MLR): Uses multiple independent variables (X₁, X₂, X₃) to predict a dependent variable (Y). Example: predicting test scores based on study hours, sleep, and class attendance.
Polynomial Regression: Used when the relationship between variables is non-linear. It fits a curve to the data rather than a straight line. Example: predicting crop yield based on fertilizer, where the relationship is U-shaped.
Support Vector Regression (SVR): Fits a model within a "tube" defined by a margin of error (epsilon). SVR aims to fit as many points within this margin.
Decision Tree Regression: Uses a flowchart structure to make decisions and split data at each node, minimizing variance at each step. The final prediction comes from the leaf nodes.
Random Forest Regression: Using multiple decision trees, where predictions are made by averaging the results of all trees, helping to reduce overfitting.
Each model is suited for different types of data and relationships. Linear models work well for simple, linear relationships, while decision trees and random forests handle more complex, non-linear patterns.
To predict a dependent variable (Y) based on independent variables (X)
2. In Simple Linear Regression, what does β₀ represent?
The point where the line crosses the y-axis (intercept)
3. What is the dependent variable in the example where study hours (X) predict test scores (Y)?
Test scores
4. In Multiple Linear Regression, how many independent variables are used to make a prediction?
Multiple
5. What does "overfitting" mean in machine learning models?
The model memorizes the training data and doesn't work well with new data
6. What is the primary difference between Linear Regression and Polynomial Regression?
Linear Regression fits a straight line, while Polynomial Regression fits a curve
7. What does the degree in Polynomial Regression control?
The number of bends in the curve
8. In Support Vector Regression (SVR), what does the margin of error (epsilon ϵ) represent?
The acceptable level of error before penalizing predictions
9. What is the function of a decision tree in Regression?
It splits data based on yes/no questions to minimize variance
10. How does Random Forest Regression improve over Decision Tree Regression?
By combining predictions from multiple trees to reduce overfitting
11. What does a decision tree aim to minimize during its splits?
Variance or Mean Squared Error (MSE)
12. What is the role of "support vectors" in Support Vector Regression (SVR)?
They are the data points that define the boundaries of the error margin
13. What type of data relationship is best suited for Polynomial Regression?
Curved or non-linear relationship