Sometimes when using a Linear Regression to analyse our data, we may think that the model we are using is the best fit for our data because the p-value is significant (p-value< 0.05) and the r-square value is close to 1 (R² is high), however this may not always be the case. We should always assess the appropriateness of the model by defining the residuals and examining the residual plots.
But what is a residual?
A residual is the difference between the observed value of the dependent variable (y) and the predicted value (ŷ). Each data point has one residual.
Residual = Observed value – Predicted value
e = y – ŷ
Note: The sum and the mean of the residuals are equal to zero.
And a residual plot, what is it?
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, this means that our linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
The residual plots show ‘two typical’ patterns: a random pattern (indicating a good fit for a linear model) and a non-random pattern (U-shaped and inverted U), suggesting a better fit for a non-linear model.
How do we do a residual plot in tableau?
1 – On the sheet that you have visualised your scatterplot, go to worksheet menu and select export data
2 – On the menu box select ‘Connect after Export’ and click OK (this will now be saved as an Access file and it will open automatically in Tableau)
3- On a new sheet, drag the recently created field ‘residuals’ to the rows and your independent variable to columns (x axis) – in this example ‘wind speed’
- We can now see that the residuals aren’t randomly distributed, they do follow a pattern – a ‘sigmoid distribution’. This tell us, that contrary of what we believed (by analysis of p-value and r-square) the Linear Regression model isn’t the best fit model to our data. We should now look for other types of models, namely the non-linear models.
Example of a residual plot showing that our linear regression model is the best fit to our data
Random Distribution around zero
Note: when accessing p-value and R-squared don’t forget to analyse the number of observations and degrees of freedom, as these may indicate an artificially high r-squared value.
High number of observations and low degrees of freedom indicate that the high r-square value may be due to external reasons.