Econometrics Beat: Dave Giles' Blog: More on the Distribution of R-Squared

Tuesday, October 1, 2013

More on the Distribution of R-Squared

Some time ago, I had a post that discussed the fact that the usual coefficient of determination (R²) for a linear regression model is a sample statistic, and as such it has its own sampling distribution. Some of the characteristics of that sampling distribution were discussed in that earlier post.

You probably know already that we can manipulate the formula for calculating R², to show that it can be expressed as a simple function of the usual F-statistic that we use to test if all of the slope coefficients in the regression model are zero. This being the case, there are some interesting things that we can say about the behaviour of R², as a random variable, when the null hypothesis associated with that F-test is in fact true.

Let's explore this a little.

Consider the usual standard linear multiple regression model,

y = Xβ + ε ; ε ~ N[0 , σ²I_n] (1)

where X is non-random and of full rank, k, and includes an intercept variable as its first column.

Consider the null hypothesis, H₀: β₂ = β₃ = .... = β_k = 0 vs. H_A: Not H₀.

We'll be interested in two statistics:

R² = 1 - Σ [e_i²] / Σ [(y_i - y*)²]
F = [( Σ (u_i²) - Σ (e_i²)) / (k - 1)] / [ Σ (e_i²) / (n - k)] ,

where e_i is the i^th residual when (1) is estimated by OLS; y* is the sample average of the y_i values; and u_i is the i^th residual when (1) is estimated subject to the restrictions implied by H₀. That is, u_i = (y_i - y*); i = 1, 2, ..., n.

With some simple manipulations, we can show that

F = [R² / (k - 1)] / [(1 - R²) / (n - k)] (2)

Now, let's suppose the the null hypothesis being tested using the F-statistic is in fact true. That's to say, there's actually no linear relationship between y and the set of (non-constant) regressors.

In that case, the statistic, F, follows Snedecor's F- distribution with (k - 1) and (n - k) degrees of freedom. Very importantly, note that if H₀ is false, then F follows a non-central F distribution. The non-centrality parameter associated with the latter distribution depends on the values of β₂, β₃, ...., β_k, and σ²; and the values of the X data. Essentially, it's the value of this parameter that determines the power of the F-test.

So, what can we say about the distribution of R² when H₀ is true? Returning to the relationship, (2), we can easily establish that

R² = [(k - 1)F] / [(n - k) + (k - 1)F] .

Then, we can exploit a well-known connection between the (central) F-distribution and the Beta distribution, with support (0 , 1). Specifically, if F follows an F distribution with v₁ and v₂ degrees of freedom, then the random variable [v₁F] / [v₂ + v₁F] follows a Beta distribution, with shape parameters (v₁ / 2) and (v₂ / 2).

This give us the distribution for R² when H₀ is true - that is, when in essence the "population R²" is actually zero. In this case, R² is Beta-distributed with shape parameters [(k - 1) / 2] and [(n - k) / 2].

Now, if we have a Beta-distributed random variable with shape parameters a and b (say) the mean of the random variable is [a / (a + b)], and its variance is (ab)/ [(a + b)² (a + b + 1)] . So, we see right away that:

E[R²] = [(k - 1) / (n - 1)] and Var.[R²] = [2(k - 1)(n - k)] / [(n + 1) (n - 1)²] .

Among other things, notice that the (sample) R² is an upward-biased estimator of the population R² (zero) when H₀ is true. However, both the mean and variance of R² approach zero as n → ∞, so the latter statistic is both mean-square and weakly consistent for the population R² in this case.

You might ask yourself, what emerges if we go through a similar analysis using the "adjusted" coefficient of determination? Is the "adjusted R²" more or less biased than R² itself, when there is actually no linear relationship between y and the columns of X?

10 comments:

AnonymousOctober 2, 2013 at 6:45 AM
Very interesting. I think the result doesn't apply to the case of univariate regression or simple scatter plot. Otherwise, it suggests var(R-squared) equals zero!
ReplyDelete
Replies
KM StationMay 7, 2014 at 10:23 AM
What would be the estimate of the variance of R-sq-adj?
ReplyDelete
Replies
AnonymousJuly 1, 2015 at 8:59 PM
I am not sure about the derivation of the variance: shouldn't it be (2(k-1)(n-k) ) / ( (n-1)^2(n+1) ) . Thank you
ReplyDelete
Replies
UnknownJanuary 6, 2016 at 3:40 AM
Nice. Are there any results for the distribution of $R^2$ for nonzero $\beta$?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Pages

Tuesday, October 1, 2013

More on the Distribution of R-Squared

10 comments: