Residual Analysis

Plotting and Analysing Residuals

The residuals from a fitted model are defined as the differences between the response data and the fit to the response data at each predictor value.

residual = data – fit

You can display the residuals in the Curve Fitter app by clicking Residuals Plot in the Visualization section of the Curve Fitter tab.

Mathematically, the residual for a specific predictor value is the difference between the response value y and the predicted response value ŷ.

r = y – ŷ

Assuming the model you fit to the data is correct, the residuals approximate the random errors. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. However, if the residuals display a systematic pattern, it is a clear sign that the model fits the data poorly. Always bear in mind that many results of model fitting, such as confidence bounds, will be invalid should the model be grossly inappropriate for the data.

A graphical display of the residuals for a first-degree polynomial fit is shown below. The top plot shows that the residuals are calculated as the vertical distance from the data point to the fitted curve. The bottom plot displays the residuals relative to the fit, which is the zero line.

Plots of residuals for a first-degree polynomial fit

The residuals appear randomly scattered around zero indicating that the model describes the data well.

A graphical display of the residuals for a second-degree polynomial fit is shown below. The model includes only the quadratic term, and does not include a linear or constant term.

Plots of residuals for a second-degree polynomial fit

The residuals are systematically positive for much of the data range indicating that this model is a poor fit for the data.

Example: Residual Analysis

This example fits several polynomial models to generated data and evaluates how well those models fit the data and how precisely they can predict. The data is generated from a cubic curve, and there is a large gap in the range of the x variable where no data exist.

x = [1:0.1:3 9:0.1:10]';
c = [2.5 -0.5 1.3 -0.1]; 
y = c(1) + c(2)*x + c(3)*x.^2 + c(4)*x.^3 + (rand(size(x))-0.5);

Fit the data in the Curve Fitter app using a cubic polynomial and a fifth-degree polynomial. The data, fits, and residuals are shown below. You can display residuals in the Curve Fitter app by clicking Residuals Plot in the Visualization section of the Curve Fitter tab.

Plots for a cubic polynomial fit and a fifth-degree polynomial fit

Both models appear to fit the data well, and the residuals appear to be randomly distributed around zero. Therefore, a graphical evaluation of the fits does not reveal any obvious differences between the two equations.

Look at the numerical fit results in the Results pane and compare the confidence bounds for the coefficients.

The results show that the cubic fit coefficients are accurately known (bounds are small), while the quintic fit coefficients are not accurately known. As expected, the fit results for poly3 are reasonable because the generated data follows a cubic curve. The 95% confidence bounds on the fitted coefficients indicate that they are acceptably precise. However, the 95% confidence bounds for poly5 indicate that the fitted coefficients are not known precisely.

The goodness-of-fit statistics are shown in the Table Of Fits pane. By default, the adjusted R-square and RMSE statistics are displayed in the table. The statistics do not reveal a substantial difference between the two equations. To choose statistics to display or hide, right-click the column headers.

The 95% nonsimultaneous prediction bounds for new observations are shown below. To display prediction bounds in the Curve Fitter app, select 95% from the Prediction Bounds list in the Visualization section of the Curve Fitter tab.

Fit plots with prediction bounds for two polynomial fits

The prediction bounds for poly3 indicate that new observations can be predicted with a small uncertainty throughout the entire data range. This is not the case for poly5. It has wider prediction bounds in the area where no data exist, apparently because the data does not contain enough information to estimate the higher degree polynomial terms accurately. In other words, a fifth-degree polynomial overfits the data.

The 95% prediction bounds for the fitted function using poly5 are shown below. As you can see, the uncertainty in predicting the function is large in the center of the data. Therefore, you would conclude that more data must be collected before you can make precise predictions using a fifth-degree polynomial.

In conclusion, you should examine all available goodness-of-fit measures before deciding on the fit that is best for your purposes. A graphical examination of the fit and residuals should always be your initial approach. However, some fit characteristics are revealed only through numerical fit results, statistics, and prediction bounds.