What does an r2 value of 0.9 mean?

In statistical analysis the coefficient of determination (more commonly known as R2) is a measure of how well variation in one variable explains the variation in something else, for instance how well the variation in hours of darkness explains variation in electricity consumption of yard lighting.

R2 varies between zero, meaning there is no effect, and 1.0 which would signify total correlation between the two with no error. It is commonly held that higher R2 is better, and you will often see a value of (say) 0.9 stated as the threshold below which you cannot trust the relationship. But that is nonsense and one reason can be seen from the diagrams below which show how, for two different objects,  energy consumption on the vertical or y axis might relate to a particular driving factor or independent variable on the horizontal or x axis.

What does an r2 value of 0.9 mean?

In both cases, the relationship between consumption and its driving factor is imperfect. But the data were arranged to have exactly the same degree of dispersion. This is shown by the CV(RMSE) value which is the root mean square deviation expressed as a percentage of the average consumption.  R2 is 0.96  (so-called “good”) in one case but only 0.10 (“bad”) in the other. But why would we regard the right-hand model as worse than the left? If we were to use either model to predict expected consumption, the absolute error in the estimates would be the same.

By the way, if anyone ever asks how to get R2 = 1.0 the answer is simple: use only two data points. By definition, the two points will lie exactly on the best-fit line through them!

Another common misconception is that a low value of R2 in the case of heating fuel signifies poor control of the building. This is not a safe assumption. Try this thought experiment. Suppose that a building’s fuel consumption is being monitored against locally-measured degree days. You can expect a linear relationship with a certain R2 value. Now suppose that the local weather monitoring fails and you switch to using published degree-day figures from a meteorological station 35km away. The error in the driving factor data caused by using remote weather observations will reduce R2 because the estimates of expected consumption are less accurate; more of the apparent variation in consumption will be attributable to error and less to the measured degree days. Does the reduced R2  signify worse control? No; the building’s performance hasn’t changed.

Footnote: for a deeper, informative and highly readable treatment of this subject see this excellent paper by Mark Stetz. 

Post navigation

Energy Management Register


R-squared is a measure of how well a linear regression model “fits” a dataset. Also commonly called the coefficient of determination, R-squared is the proportion of the variance in the response variable that can be explained by the predictor variable.

The value for R-squared can range from 0 to 1. A value of 0 indicates that the response variable cannot be explained by the predictor variable at all. A value of 1 indicates that the response variable can be perfectly explained without error by the predictor variable.

In practice, you will likely never see a value of 0 or 1 for R-squared. Instead, you’ll likely encounter some value between 0 and 1.

For example, suppose you have a dataset that contains the population size and number of flower shops in 30 different cities. You fit a simple linear regression model to the dataset, using population size as the predictor variable and flower shops as the response variable. In the output of the regression results, you see that R2  = 0.2. This indicates that 20% of the variance in the number of flower shops can be explained by the population size.

This leads to an important question: is this a “good” value for R-squared?

The answer to this question depends on your objective for the regression model. Namely:

1. Are you interested in explaining the relationship between the predictor(s) and the response variable?

OR

2. Are you interested in predicting the response variable?

Depending on the objective, the answer to “What is a good value for R-squared?” will be different.

Explaining the Relationship Between the Predictor(s) and the Response Variable

If your main objective for your regression model is to explain the relationship between the predictor(s) and the response variable, the R-squared is mostly irrelevant.

For example, suppose in the regression example from above, you see that the coefficient  for the predictor population size is 0.005 and that it’s statistically significant. This means that an increase of one in population size is associated with an average increase of 0.005 in the number of flower shops in a particular city. Also, population size is a statistically significant predictor of the number of flower shops in a city.

Whether the R-squared value for this regression model is 0.2 or 0.9 doesn’t change this interpretation. Since you are simply interested in the relationship between population size and the number of flower shops, you don’t have to be overly concerned with the R-square value of the model.

Predicting the Response Variable

If your main objective is to predict the value of the response variable accurately using the predictor variable, then R-squared is important.

In general, the larger the R-squared value, the more precisely the predictor variables are able to predict the value of the response variable.

How high an R-squared value needs to be depends on how precise you need to be. For example, in scientific studies, the R-squared may need to be above 0.95 for a regression model to be considered reliable. In other domains, an R-squared of just 0.3 may be sufficient if there is extreme variability in the dataset.

To find out what is considered a “good” R-squared value, you will need to explore what R-squared values are generally accepted in your particular field of study. If you’re performing a regression analysis for a client or a company, you may be able to ask them what is considered an acceptable R-squared value.

Prediction Intervals

A prediction interval specifies a range where a new observation could fall, based on the values of the predictor variables. Narrower prediction intervals indicate that the predictor variables can predict the response variable with more precision.

Often a prediction interval can be more useful than an R-squared value because it gives you an exact range of values in which a new observation could fall. This is particularly useful if your primary objective of regression is to predict new values of the response variable.

For example, suppose a population size of 40,000 produces a prediction interval of 30 to 35 flower shops in a particular city. This may or may not be considered an acceptable range of values, depending on what the regression model is being used for.

Conclusion

In general, the larger the R-squared value, the more precisely the predictor variables are able to predict the value of the response variable.

How high an R-squared value needs to be to be considered “good” varies based on the field. Some fields require higher precision than others. 

To find out what is considered a “good” R-squared value, consider what is generally accepted in the field you’re working in, ask someone with specific subject area knowledge, or ask the client/company you’re performing the regression analysis for what they consider to be acceptable.

If you’re interested in explaining the relationship between the predictor and response variable, the R-squared is largely irrelevant since it doesn’t impact the interpretation of the regression model.

If you’re interested in predicting the response variable, prediction intervals are generally more useful than R-squared values.

Further Reading:

Pearson Correlation Coefficient
Introduction to Simple Linear Regression

What does R2 value tell you?

R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit).

Is a R2 value of 1 GOOD?

An R2=1 indicates perfect fit. That is, you've explained all of the variance that there is to explain. In ordinary least squares (OLS) regression (the most typical type), your coefficients are already optimized to maximize the degree of model fit (R2) for your variables and all linear transforms of your variables.

What does an R2 of 0.8 mean?

R-squared or R2 explains the degree to which your input variables explain the variation of your output / predicted variable. So, if R-square is 0.8, it means 80% of the variation in the output variable is explained by the input variables.

What does R2 value greater than 1 mean?

Bottom line: R2 can be greater than 1.0 only when an invalid (or nonstandard) equation is used to compute R2 and when the chosen model (with constraints, if any) fits the data really poorly, worse than the fit of a horizontal line.