Regression Models in Machine Learning: A Beginner's Guide

April 24, 2025

This post provides a detailed overview of important regression models in machine learning

Introduction

For this read, I’ll give a basic rundown of the following models:

Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression

This post is only about the theory behind the models: a short description of how each model works, together with examples on a dataset. Let's get started •ᴗ•

Simple Linear Regression

Simple Linear Regression (SLR) is a supervised machine learning algorithm. With supervised learning, the data provides examples where the answer is already known.

Let’s say, for example, that you have a dataset of students with the total hours studied for a test and their test scores. The "answer" in this case could be the score the student received. You want a simple way to predict a student's score, this can be done with SLR.

Now, I’ll explain some terms you might hear often:

Independent variable. An independent variable is also often called a feature. These are the pieces of the given data you use to make the prediction.
Dependent variable. This is the outcome or result of the prediction you want to make. This is also often called the label.
Regression Line (Best Fit Line). This is the line the algorithm calculates that best represents the relationship between X and Y.

Let's take a look at regression equation

The formula for SLR looks like this: y=β₀+β₁x

y = predicted value
x = independent variable
β₀ = intercept (this is where the line crosses the y-axis)
β₁ = slope (how much the predicted value y increases with a one-unit increase of the independent variable x )

As someone who is also still learning, I can imagine this being quite overwhelming. Since I don't have any background in math or statistics, it can be quite hard to get a grasp. But an example might help us get a better feeling for how SLR works.

Example of SLR

Let’s say we have a dataset of students’ study hours and their test scores. What is the dependent variable and independent variable? Yes, well done! The independent variable (feature) is the hours studied (X). The dependent variable (label) is the test scores (Y).

Students

Hours Studied (X)	Test Score (Y)
1	50
2	55
3	65
4	70

The formula (y = β₀ + β₁x) would look like the following:

Test Score = 45 + 6 × Hours Studied

45 = β₀ = intercept. If a student studied for 0 hours, the prediction would be that they get a test score of 45.
6 = β₁ = slope. For every extra hour studied, the prediction would be that the test score increases by 6 points.

Now, let’s say we have a student who studied for 5 hours. What do you think their test score would be? Perfect ·‿· ! The formula would be: Y = 45 + 6 × 5. So, the predicted score would be: 45 + 30 = 75. You can now see that for every extra hour studied, the test score would increase by 6 points.

(Extra) Errors

Before we continue, I want to talk a bit about errors within linear regression. When building a linear regression model, we are trying to draw the best possible line through all data points — in other words, to minimize the errors.

We calculate an error like this:
Error = (Actual value) - (Predicted value)

Example:

Predicted test score = 80
Actual test score = 85
Error = 85 - 80 = 5

If you overpredict (for example, predicting 90 but the real score is 85), the error is negative (-5). If you underpredict, the error is positive.

Now, if you simply sum up all the errors, the positive and negative values can cancel each other out which is not helpful. To solve this, we square each error before summing them. Linear regression minimizes the sum of all squared errors. This process is called Ordinary Least Squares (OLS).

Multiple Linear Regression

SLR only uses one independent variable (X: studied hours). With Multiple Linear Regression (MLR), you use multiple independent variables to predict the outcome (Y).

Multiple Linear Regression Equation

Remember our SLR equation: y = β₀ + β₁x. The idea is the same, but now with multiple independent variables.

Putting this together we get:
y = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + ... + βₙXₙ

Breaking it down:

y = still the predicted value
X₁, X₂, X₃, ..., Xₙ = independent variables (there are now multiple)
β₀ = still the intercept (where the line crosses the y-axis)
β₁, β₂, β₃, ..., βₙ = coefficients; they tell you how important each independent variable is.

Example MLR

Let’s again use an example — that always works best for me, and maybe for you too •‿•

We go back to our students and test scores!
Before, we predicted the test score based only on hours studied.
But now, we add other factors like hours slept and number of classes attended:

Students

Hours Studied (X₁)	Hours Slept (X₂)	Classes attended (X₃)	Test Score (Y)
1	7	4	71
3	6	2	65
2	4	1	45

The new features:

X₁ = Hours studied
X₂ = Hours slept
X₃ = Classes attended
y = Test Score

Suppose the model creates this equation: Test Score (Y) = 30 + 5(Hours Studied) + 3(Hours Slept) + 2(Classes Attended)

Meaning:

For every hour studied, the test score increases by 5 points.
For every extra hour of sleep, it increases by 3 points.
For every class attended, it increases by 2 points.
30 is the intercept: the base prediction if the student didn’t study, sleep, or attend any classes.

Overfitting

One important concept in MLR is overfitting. If you have a huge dataset with lots of features, and you include all of them, the model might just memorize the training data, but perform badly on new, unseen data.

It’s like only studying past exam papers. You memorize old questions, but when the real exam asks something new, you struggle! Overfitting = memorizing the training data instead of learning real patterns.

Polynomial Regression

So far, we’ve assumed relationships to be linear, meaning you can predict the outcome with a straight line. But sometimes, a straight line just isn’t good enough. Polynomial Regression allows the model to fit a curve instead.

New concepts:

Degree. How many bends should the curve have? If we choose a degree of 2, the curve can make a U shape. If we pick a degree of 3, the curve can make an S shape. With a degree of 4, we can make a W shape.
We already talked about overfitting, but there is also underfitting. If the degree is too low, the model will be too simple and won't fit the data well, leading to poor results.
Feature Engineering. Sometimes, instead of changing the model itself, we improve the input data.We add new features to the model to try to get better results without touching the model’s structure.

So when should I use Polynomial Regression instead of MLR?

When you visually plot the data and see a curve instead of a straight line.
When you tried MLR, but the model performs poorly

Polynomial Regression Equation

Y = β₀ + β₁X + β₂X² + ... + βₙXⁿ

X², Xⁿ = the powers of features
βₙ = coefficient for the nth-degree feature

Polynomial Example

Suppose we study the effect of fertilizer (X) on crop yield (Y).

Very little fertilizer = poor crop yield
More fertilizer = better yield
BUT, too much fertilizer = damage to crops, yield decreases

If you plot the data, you get a ∩-shaped curve. If you try a straight line, it won't fit well, but a degree 2 polynomial (a U-curve) can fit perfectly.

Thus, our formula would look like: Yield (Y) = β₀ + β₁ × Fertilizer + β₂ × Fertilizer²

The challenge is picking the right degree to avoid overfitting or underfitting.

Support Vector Regression

Support Vector Regression (SVR) tries to fit a curve (often a line) within a certain margin of error. This margin is called epsilon (ϵ). Imagine a "tube" around the line. Points inside the tube is fine, no penalty (small errors). Points outside the tube are penalized (big errors)

Some important terms:

Epsilon (ϵ): How much error you're willing to tolerate.
Support Vectors: Data points right on the edge or outside the margin — they define the position of the tube.
Kernel Trick: If the relationship is nonlinear, SVR can map the data into higher dimensions to fit it better.

SVR Example

Say we predict a test score of 85. If the real score is 86, that's within a margin of 2 (ϵ=2), no problem. If the real score is 90, that's 5 points away, penalized!

In short, SVR tries to fit a tube of width ϵ around the best fit line, allowing small errors inside but punishing bigger ones outside.

Decision Tree Regression

A decision tree is like a flowchart. Each node in the tree asks a yes/no question. Based on the answer, the data gets split into two branches. This continues until we reach a leaf node (final predicted number)

How does it work

Start with all the data at the root (the top) of the tree.
Try asking all possible yes/no questions and find the one that splits the data best.
Split the data into two groups based on that question.
Keep repeating this process for each group: keep splitting until the groups are small enough, the tree reaches its maximum depth, or no better splits can be found.
Once the tree is finished, use it to make predictions.

In Regression we use Variance

Variance measures how spread out numbers are: [55, 56, 57] = low variance (numbers close together). [40, 60, 80] = high variance (numbers spread out)

Decision Trees split the data to lower the variance after each split, trying to make groups as tight as possible.

We also call this minimizing the Mean Squared Error (MSE).

Random Forest Regression

Instead of building one decision tree, we build many and average their results.

Create many decision trees
Each tree handles a random subset of the data
Each tree makes its own predictions
Final prediction = average of all trees' predictions

Random Forest helps reduce overfitting that a single decision tree can suffer from.

Random Forest Regression Example

Imagine asking just one teacher to predict students' test scores, that's a decision tree.

But imagine asking hundreds of teachers, each with different groups of students, then averaging all their answers. That’s Random Forest!

Let's use a well-known idea to understand it better: Imagine 100 students are guessing how many jellybeans are inside a jar. Each student might guess wrong individually, some guess too high, others too low. But if you take the average of all their guesses, the result is surprisingly close to the real number! (The Wisdom of Crowds))

Summary

Simple Linear Regression (SLR): Predicts a dependent variable (Y) using a single independent variable (X). It is represented by a straight line. Example: predicting test scores based on study hours.

Multiple Linear Regression (MLR): Uses multiple independent variables (X₁, X₂, X₃) to predict a dependent variable (Y). Example: predicting test scores based on study hours, sleep, and class attendance.

Polynomial Regression: Used when the relationship between variables is non-linear. It fits a curve to the data rather than a straight line. Example: predicting crop yield based on fertilizer, where the relationship is U-shaped.

Support Vector Regression (SVR): Fits a model within a "tube" defined by a margin of error (epsilon). SVR aims to fit as many points within this margin.

Decision Tree Regression: Uses a flowchart structure to make decisions and split data at each node, minimizing variance at each step. The final prediction comes from the leaf nodes.

Random Forest Regression: Using multiple decision trees, where predictions are made by averaging the results of all trees, helping to reduce overfitting.

Each model is suited for different types of data and relationships. Linear models work well for simple, linear relationships, while decision trees and random forests handle more complex, non-linear patterns.

Test your knowledge

What is the primary purpose of Linear Regression?

To predict a dependent variable (Y) based on independent variables (X)

2. In Simple Linear Regression, what does β₀ represent?

The point where the line crosses the y-axis (intercept)

3. What is the dependent variable in the example where study hours (X) predict test scores (Y)?

Test scores

4. In Multiple Linear Regression, how many independent variables are used to make a prediction?

Multiple

5. What does "overfitting" mean in machine learning models?

The model memorizes the training data and doesn't work well with new data

6. What is the primary difference between Linear Regression and Polynomial Regression?

Linear Regression fits a straight line, while Polynomial Regression fits a curve

7. What does the degree in Polynomial Regression control?

The number of bends in the curve

8. In Support Vector Regression (SVR), what does the margin of error (epsilon ϵ) represent?

The acceptable level of error before penalizing predictions

9. What is the function of a decision tree in Regression?

It splits data based on yes/no questions to minimize variance

10. How does Random Forest Regression improve over Decision Tree Regression?

By combining predictions from multiple trees to reduce overfitting

11. What does a decision tree aim to minimize during its splits?

Variance or Mean Squared Error (MSE)

12. What is the role of "support vectors" in Support Vector Regression (SVR)?

They are the data points that define the boundaries of the error margin

13. What type of data relationship is best suited for Polynomial Regression?

Curved or non-linear relationship