## Tuesday, October 1, 2013

### More on the Distribution of R-Squared

Some time ago, I had a post that discussed the fact that the usual coefficient of determination (R2) for a linear regression model is a sample statistic, and as such it has its own sampling distribution. Some of the characteristics of that sampling distribution were discussed in that earlier post.

You probably know already that we can manipulate the formula for calculating R2, to show that it can be expressed as a simple function of the usual F-statistic that we use to test if all of the slope coefficients in the regression model are zero. This being the case, there are some interesting things that we can say about the behaviour of R2, as a random variable, when the null hypothesis associated with that F-test is in fact true.

Let's explore this a little.

Consider the usual standard linear multiple regression model,

y = Xβ + ε      ;    ε ~ N[0 , σ2In]                                               (1)

where X is non-random and of full rank, k, and includes an intercept variable as its first column.

Consider the null hypothesis,  H0:  β2 = β3 = .... = βk = 0    vs.    HA:  Not H0.

We'll be interested in two statistics:
1. R2 = 1 - Σ [ei2] / Σ [(yi - y*)2]
2. F =   [( Σ (ui2) - Σ (ei2)) / (k - 1)] / [ Σ (ei2) / (n - k)] ,
where ei is the ith residual when (1) is estimated by OLS; y* is the sample average of the yi values; and ui is the ith residual when (1) is estimated subject to the restrictions implied by H0. That is, ui = (yi - y*); i = 1, 2, ..., n.

With some simple manipulations, we can show that

F =  [R2 / (k - 1)] / [(1 - R2) / (n - k)]                                        (2)

Now, let's suppose the the null hypothesis being tested using the F-statistic is in fact true. That's to say, there's actually no linear relationship between y and the set of (non-constant) regressors.

In that case, the statistic, F, follows Snedecor's F- distribution with (k - 1) and (n - k) degrees of freedom. Very importantly, note that if H0 is false, then F follows a non-central F distribution. The non-centrality parameter associated with the latter distribution depends on the values of β2, β3, ...., βk, and σ2; and the values of the X data. Essentially, it's the value of this parameter that determines the power of the F-test.

So, what can we say about the distribution of R2 when H0 is true? Returning to the relationship, (2), we can easily establish that

R2 = [(k - 1)F] / [(n - k) + (k - 1)F] .

Then, we can exploit a well-known connection between the (central) F-distribution and the Beta distribution, with support (0 , 1). Specifically, if F follows an F distribution with v1 and v2 degrees of freedom, then the random variable [v1F] / [v2 + v1F] follows a Beta distribution, with shape parameters (v1 / 2) and (v2 / 2).

This give us the distribution for R2 when H0 is true - that is, when in essence the "population R2" is actually zero. In this case, R2 is Beta-distributed with shape parameters [(k - 1) / 2] and [(n - k) / 2].

Now, if we have a Beta-distributed random variable with shape parameters a and b (say) the mean of the random variable is [a / (a + b)], and its variance is (ab)/ [(a + b)2 (a + b + 1)] . So, we see right away that:

E[R2] = [(k - 1) / (n - 1)]     and     Var.[R2] = [2(k - 1)(n - k)] / [(n + 1) (n - 1)2] .

Among other things, notice that the (sample) R2 is an upward-biased estimator of the population R2 (zero) when H0 is true. However, both the mean and variance of R2 approach zero as n → ∞, so the latter statistic is both mean-square and weakly consistent for the population R2 in this case.

You might ask yourself, what emerges if we go through a similar analysis using the "adjusted" coefficient of determination? Is the "adjusted R2" more or less biased than R2 itself, when there is actually no linear relationship between y and the columns of X?

1. Very interesting. I think the result doesn't apply to the case of univariate regression or simple scatter plot. Otherwise, it suggests var(R-squared) equals zero!

1. No - you have misunderstood. In that case, k=2 (see the definition of k), and var. (R^2)=(n-2)/[n(n-1)^2] > 0.

2. I see. Thanks.

2. What would be the estimate of the variance of R-sq-adj?

1. If the null hypothesis is true, then the variance of the adjusted R-squared is (k - 1) / [n(n - k)], and this is greater than or equal to the variance of the R-squared itself in this case. I'll prepare s short post on this.

2. 3. I am not sure about the derivation of the variance: shouldn't it be (2(k-1)(n-k) ) / ( (n-1)^2(n+1) ) . Thank you

1. Thank you - yes it should be, and I have now corrected it.

4. Nice. Are there any results for the distribution of $R^2$ for nonzero $\beta$?

1. In that case the F statistic will be non-central F in distribution. R-squared should then be able to be written in terms of a non-central Beta statistic. The mean and variance could then easily be established.