Underfitting vs Overfitting¶
Overview¶
Overfitting and underfitting are two common problems in machine learning that affect a models ability to generalize new data. Ideally a model should be capable of both recognizing patterns without being sensitive to noise in the data, this would be considered well fit.
Underfitting¶
Underfitting is when a model is too simple to recognize patterns in data. A model might be underfit if it is:
- Too simple of a model
- Undertrained model
- Missing or irrelevant features
Underfit models have many issues, including:
- Poor performance in testing
- High bias
- Slow improvement when training
You can fix underfit models by:
- Using a more complex model
- Adding more relevant features
- Longer or more optimized training
Overfitting¶
Overfitting is when a model gets too familiar with the training data. This means it performs very well on testing data but fails when using unseen data. A model is overfit when:
- It is trained on small or noisy data
- too complex for the given task
- it is overtrained
Overfitting can cause many problems, such as
- Overconfidence for incorrect predictions
- poor real world performance
- being highly sensitive to changes in noise
An overfit model can be fixed by:
- using more training data
- using a simpler model
- using cross validation or early stopping
Demonstration¶
The code below creates a visual demonstration of how overfitting, underfitting, and well fitting a model can affect its performance by comparing models of varying complexities that are trained on the same data.
First, we import necessary libraries:
import numpy as np # linear algebra and arrays
from sklearn.model_selection import train_test_split # spliting data into train/test sets
from sklearn.preprocessing import PolynomialFeatures # adding polynomial features to input data
from sklearn.linear_model import LinearRegression # model used in example
from sklearn.metrics import mean_squared_error # evaluation metric
import matplotlib.pyplot as plt # plotting data points and results
Next, we use NumPy to generate our data, and sklearn to split it into train/test sets.
X = np.sort(5 * np.random.rand(100, 1), axis=0) # generate and sort 100 numbers between 1 and 5 to use as features
y = np.sin(X).ravel() + np.random.normal(0, 0.2, X.shape[0]) # Generate the sine of X with gaussian noise
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # split the data into train and test sets
Then, we define a function that fits a polynomial regression model on a specified degree, then plots its predictions and displays the models mean squared error (MSE).
def plot_model(degree):
# create polynomial features
polynomial = PolynomialFeatures(degree) # create a feature transformer for a specified degree
polynomial_X = polynomial.fit_transform(X_train) # transform training features
# fit regression model using transformed data
model = LinearRegression() # define a linear regression model
model.fit(polynomial_X, y_train) # fit model on polynomial data
# generate smooth curve
X_plot = np.linspace(0, 5, 100).reshape(-1, 1) # generate X plot for predictions curve
polynomial_X_plot = polynomial.transform(X_plot) # transform X plot
y_plot = model.predict(polynomial_X_plot) # predict input points
# predict train and test data
train_predictions = model.predict(polynomial_X) # predict training data
polynomial_X_test = polynomial.transform(X_test) # transform testing data
test_predictions = model.predict(polynomial_X_test) # predict testing data
# calculate MSE of training and testing predictions
train_mse = mean_squared_error(y_train, train_predictions) # evaluate training predictions
test_mse = mean_squared_error(y_test, test_predictions) # evaluate testing predictions
# create a visualization
plt.figure(figsize=(8, 6)) # create an 8*6 figure
plt.scatter(X_train, y_train, color="blue", label="train") # plot training data
plt.scatter(X_test, y_test, color="green", label="test") # plot testing data
plt.plot(X_plot, y_plot, color="red", label=f"Model Predictions") # plot model predictions
plt.title(f"Degree {degree} | Training MSE: {train_mse:.2f} | Testing MSE: {test_mse:.2f}") # title plot with model information
plt.legend() # display legend on graph
plt.show() # print graph
Finally, we call the previous function for the following degrees:
- 1 to show underfitting
- 4 to show a well fit model
- 15 to show overfitting
for degree in [1, 4, 15]: # define degree of an underfit (1), well fit (4), and overfit (15) model
plot_model(degree) # call function with specified degree
As you can see, the underfit (first degree) and overfit (fifteenth degree) models performed far worse than the well fit one (fourth degree). This is because the underfit model was too simple to recognize the curve of the data and the overfit model followed the noise too much, while the well fit model was a balance of the two.
Summary¶
- Model fitting is the process of finding the best parameters to make accurate predictions
- Overfitting is when a model learns the training data too well and becomes sensitive to noise
- Underfitting is when the model is too simple, meaning it cannot find patterns in the data
- Well fit models balance over and under fitting, being able to find patterns without being sensitive to noise
Author and Liscense¶
This notebook was authored by Aiden Flynn and is available under the Apache 2.0 Liscense.