Underfitting vs Overfitting¶

Overview¶

Overfitting and underfitting are two common problems in machine learning that affect a models ability to generalize new data. Ideally a model should be capable of both recognizing patterns without being sensitive to noise in the data, this would be considered well fit.

Underfitting¶

Underfitting is when a model is too simple to recognize patterns in data. A model might be underfit if it is:

  • Too simple of a model
  • Undertrained model
  • Missing or irrelevant features

Underfit models have many issues, including:

  • Poor performance in testing
  • High bias
  • Slow improvement when training

You can fix underfit models by:

  • Using a more complex model
  • Adding more relevant features
  • Longer or more optimized training

Overfitting¶

Overfitting is when a model gets too familiar with the training data. This means it performs very well on testing data but fails when using unseen data. A model is overfit when:

  • It is trained on small or noisy data
  • too complex for the given task
  • it is overtrained

Overfitting can cause many problems, such as

  • Overconfidence for incorrect predictions
  • poor real world performance
  • being highly sensitive to changes in noise

An overfit model can be fixed by:

  • using more training data
  • using a simpler model
  • using cross validation or early stopping

Demonstration¶

The code below creates a visual demonstration of how overfitting, underfitting, and well fitting a model can affect its performance by comparing models of varying complexities that are trained on the same data.

First, we import necessary libraries:

In [1]:
import numpy as np                                   # linear algebra and arrays
from sklearn.model_selection import train_test_split # spliting data into train/test sets
from sklearn.preprocessing import PolynomialFeatures # adding polynomial features to input data
from sklearn.linear_model import LinearRegression    # model used in example
from sklearn.metrics import mean_squared_error       # evaluation metric
import matplotlib.pyplot as plt                      # plotting data points and results

Next, we use NumPy to generate our data, and sklearn to split it into train/test sets.

In [2]:
X = np.sort(5 * np.random.rand(100, 1), axis=0)              # generate and sort 100 numbers between 1 and 5 to use as features
y = np.sin(X).ravel() + np.random.normal(0, 0.2, X.shape[0]) # Generate the sine of X with gaussian noise 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # split the data into train and test sets

Then, we define a function that fits a polynomial regression model on a specified degree, then plots its predictions and displays the models mean squared error (MSE).

In [3]:
def plot_model(degree):
    # create polynomial features
    polynomial = PolynomialFeatures(degree)          # create a feature transformer for a specified degree
    polynomial_X = polynomial.fit_transform(X_train) # transform training features

    # fit regression model using transformed data
    model = LinearRegression()       # define a linear regression model
    model.fit(polynomial_X, y_train) # fit model on polynomial data

    # generate smooth curve
    X_plot = np.linspace(0, 5, 100).reshape(-1, 1)   # generate X plot for predictions curve
    polynomial_X_plot = polynomial.transform(X_plot) # transform X plot
    y_plot = model.predict(polynomial_X_plot)        # predict input points

    # predict train and test data
    train_predictions = model.predict(polynomial_X)     # predict training data
    polynomial_X_test = polynomial.transform(X_test)    # transform testing data
    test_predictions = model.predict(polynomial_X_test) # predict testing data

    # calculate MSE of training and testing predictions
    train_mse = mean_squared_error(y_train, train_predictions) # evaluate training predictions
    test_mse = mean_squared_error(y_test, test_predictions)    # evaluate testing predictions

    # create a visualization
    plt.figure(figsize=(8, 6))                                                                  # create an 8*6 figure
    plt.scatter(X_train, y_train, color="blue", label="train")                                  # plot training data
    plt.scatter(X_test, y_test, color="green", label="test")                                    # plot testing data
    plt.plot(X_plot, y_plot, color="red", label=f"Model Predictions")                           # plot model predictions
    plt.title(f"Degree {degree} | Training MSE: {train_mse:.2f} | Testing MSE: {test_mse:.2f}") # title plot with model information
    plt.legend()                                                                                # display legend on graph
    plt.show()                                                                                  # print graph

Finally, we call the previous function for the following degrees:

  • 1 to show underfitting
  • 4 to show a well fit model
  • 15 to show overfitting
In [4]:
for degree in [1, 4, 15]: # define degree of an underfit (1), well fit (4), and overfit (15) model
    plot_model(degree)    # call function with specified degree
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

As you can see, the underfit (first degree) and overfit (fifteenth degree) models performed far worse than the well fit one (fourth degree). This is because the underfit model was too simple to recognize the curve of the data and the overfit model followed the noise too much, while the well fit model was a balance of the two.

Summary¶

  • Model fitting is the process of finding the best parameters to make accurate predictions
  • Overfitting is when a model learns the training data too well and becomes sensitive to noise
  • Underfitting is when the model is too simple, meaning it cannot find patterns in the data
  • Well fit models balance over and under fitting, being able to find patterns without being sensitive to noise

Author and Liscense¶

This notebook was authored by Aiden Flynn and is available under the Apache 2.0 Liscense.

Kaggle | Github