April 29, 2025
We will walk through how to implement a Simple Linear Regression model step-by-step using Python. You'll learn how to load data, train a model, make predictions, and visualize results using pandas, scikit-learn, and matplotlib.
To read the theory about Linear Regression Models, follow this link
Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
For this example, we will be using a dataset available on Kaggle. . You can find it here: https://www.kaggle.com/datasets/karthickveerakumar/salary-data-simple-linear-regression?select=Salary_Data.csv
YearsExperience | Salary |
---|---|
1.1 | 39343.00 |
1.3 | 46205.00 |
1.5 | 37731.00 |
2.0 | 43525.00 |
2.2 | 39891.00 |
data = pd.read_csv('Salary_Data.csv')
X = data.iloc[:, :-1].values # All columns except the last
y = data.iloc[:, -1].values # Only the last column
We will now use scikit-learn to split the data into a training and test set.
from sklearn.model_selection import train_test_split
# Use 20% of data as test data. Set random_state to keep results reproducible
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
from sklearn.linear_model import LinearRegression
# Create an instance of the model
model = LinearRegression()
# Train the model with the training data
model.fit(X_train, y_train)
# Predict the test set results
y_pred = model.predict(X_test)
# Let's visualize the results for the Training Set first:
plt.scatter(X_train, y_train, color='red')
plt.plot(X_train, model.predict(X_train), color='blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Simple Linear Regression Training Set
To show the predictions for the test set, you follow the same steps.
plt.scatter(X_test, y_test, color='red')
plt.plot(X_train, model.predict(X_train), color='blue') # Line stays the same
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Simple Linear Regression Test Set
It trains the model on the training data.
2. What percentage of the dataset is typically used for testing?
20%
3. When plotting the regression results, why do we reuse the model’s prediction on the training data even when plotting the test set?
Because the regression line is based only on the training data and stays fixed.
4. With pandas, select all columns except the last one
data.iloc[:, :-1].values
5. With pandas, select only the last column
data.iloc[:, -1].values