Monday, June 29, 2015

The Econometrics of Temporal Aggregation - VI - Tests of Linear Restrictions

This post is one of several related posts. The previous ones can be found here, here, here, here and here. These posts are based on Giles (2014).

Many of the statistical tests that we perform routinely in econometrics can be affected by the level of aggregation of the data. Here, Let's focus on time-series data, and with temporal aggregation. I'm going to show you some preliminary results from work that I have in progress with Ryan Godwin. These results relate to one particular test, but work covers a variety of testing problems.

I'm not supplying the EViews program code that was used to obtain the results below - at least, not for the moment. That's because the results that I'm reporting are based on work in progress. Sorry!

As in the earlier posts, let's suppose that the aggregation is over "m" high-frequency periods. A lower case symbol will represent a high-frequency observation on a variable of interest; and an upper-case symbol will denote the aggregated series.

               Yt = yt + yt - 1 + ......+ yt - m + 1 .

If we're aggregating monthly (flow) data to quarterly data, then m = 3. In the case of aggregation from quarterly to annual data, m = 4, etc.

Now, let's investigate how such aggregation affects the performance of standard tests of linear restrictions on the coefficients of an OLS regression model. The simplest example would be a t-test of the hypothesis that one of the coefficients is zero. Another example would be the F-test of the hypothesis that all of the "slope" coefficients in such a regression model are zero.

Consider the following simple Monte Carlo experiment, based on 20,000 replications.
The data-generating process (DGP) is of the form:

              yt = β0 + β1 xt + ut    ;    ut ~ N[0 , 1]     ;   where    xt = 0.1t + N[0 , 1]  .

However, the model that is actually estimated uses aggregated data:

             Yt = β0 + β1Xt + vt  .

We've looked at two aggregation levels - m = 3, and m = 12, but only results for m = 3 are discussed below.

The Monte Carlo experiment looks first of all at the distortion in the size (significance level) of the t-test of the hypothesis, H0: β1 = 0, when we have this type of aggregation. The alternative hypothesis is HA: β1 > 0. Then we consider the power of the test, and how that is affected by the aggregation of the data.

Recall that if there is no mis-specification of the model (here, if the same data that appear in the DGP were used in the estimation of the model), the the t-test is Uniformly Most Powerful against the HA that we're using.

The results that follow are based on β0 = 1, but they are invariant to the values of this parameter and the variance of the error term. Various sample sizes are considered, ranging from T = 12 to T = 5,000. By setting β1 = 0 in the DGP above, we can force the null hypothesis to be true. Then, by assigning positive values to β1 in the DGP, we can ensure that H0 is false. Using increasingly positive values of β1 will enable us to trace out the power curve for the test.

In the following tables, α* is the "nominal" size of the test. It's the significance level that we think we're using. That is, we pick a significance level (α*) and then, based on the null distribution of the t-test statistic (which is Student-t with 2 degrees of freedom), we have a critical value, c(α*).

In the experiment, the actual size of the test will be the number of times that the t-test statistic exceeds c(α*), expressed as a proportion of 20,000, when the null is true. We call the difference between the actual and nominal sizes of the test the extent of the "size distortion" that arises due to the difference between the DGP and the estimated model.

There will undoubtedly be some size distortion coming from the aggregation of the data, and this may be present whether T is small or large. Let's take a look at this.

In the table above, we consider three typical values for the nominal significance level - 1%, 5%, and 10%. The figures in black in the table are the actual significance levels for case of temporal aggregation with m = 3. (Recall, this is like aggregating monthly flow data into quarterly flow data.)

We see that there is considerable size distortion as a result of using the aggregated data. Moreover, this distortion actually tends to increase as the sample size increases!

If it were not for the data aggregation, the distribution of the t-statistic would, of course, become standard normal as T → ∞. Looking more carefully at our simulation results it turns out that the reason for the size distortion when T is large is that the variance of the sampling distribution of the t-statistic is not converging to one in value.

Temporal aggregation introduces a particular form of moving average process into the data. This suggests that it may be wise to compute the t-tests using the Newey-West standard errors. The results of doing this are shown in red in the table above. We see that although this essentially eliminates the size distortion when m = 3 and the T is extremely large, there is still a problem for moderately sized samples.

Finally, the last column in the table, in blue, shows the results for T = 5,000 when m = 12. (This is the case where we aggregate monthly flow data into annual flow data.) In that case, even using the Newey-West standard errors, the size distortion for the t-test is 50% to 300% even for this very large sample size. Other results support the result that the higher the level of aggregation, the greater the size distortion in the t-test.

These results show, quite dramatically, the impact that temporal aggregation can have on the "real" significance level of the t-test. There is an excessive tendency for the test to reject the null hypothesis when it is in fact true. This means, in practice, that there will be a tendency to "over-fit" our regression model. The same is true for the F-test of a set of linear restrictions on the regression coefficients.

Now let's investigate the power of the t-test. To do this we need to consider the rate at which the null hypothesis, that β1 = 0, is rejected when that hypothesis is false. Remember that we can do this by simply assigning positive values to β1 in the DGP.

Here is one illustrative result for the power of the Newey-West-corrected t-test, with and without temporal aggregation of the data:
Comparing the red and blue power curves, we see that temporal aggregation reduces the power of the test, once the size distortion is taken into account. Even if you don't "size-adjust" the test - which you wouldn't so in practice - aggregation reduces the "raw power" of the t-test when the null hypothesis is very false.

In other words, just when you want the test to lead to a rejection of the null hypothesis, it will do so with lower probability than it should. You'll tend to "under-fit" the model more often than you should, and wrongly omitting relevant regressors has dire consequences! As far as I know, this result isn't one that you'll find anywhere in the literature.

The main lesson to be taken away from this post is the following one. When you're working with time-series data that have been aggregated (e.g., monthly to quarterly) before you use them in regression analysis, be very careful when you apply the usual t-tests and F-tests. The significance levels that you think you're using will actually be an understatement of the truth; and the tests' powers will be lower than they would be if the data were not temporally aggregated.


Giles, D. E., 2014. The econometrics of temporal aggregation: 1956 - 2014. The A. W. H. Phillips Memorial Lecture, New Zealand Association of Economists Annual Meeting, Auckland, July.

© 2015, David E. Giles


  1. Wow, very in depth yet still easy to understand. Thank you for this.

  2. Dear Dave,
    Thank you for this illustrative example. My question is not exactly related to the subject of your post. As you illustrated, the finite sample properties of tests are studied by investigating the size and power properties. You reported size distortions to assess the size properties of the test. My first question is about the level of the size distortions. How much distortions is need to conclude that a test is useless? Is there an interval that we can construct around a nominal size value to gauge the significance of distortions? Same type of questions can also be relevant for the power properties. The “size adjusted power” is simply rejection rates obtained when the DGP satisfies an alternative hypothesis. Although, the power property is used to compare alternative tests, we can still ask question regarding to the level of the power. As your power curve shows, the level of power also depends on the parameter value assumed under the alternative hypothesis. For example, when beta1=0.8 the power is around 80% which means that the false null is rejected 80 times out of 100 times. Again, the question is that what should be the level of the power to conclude that the test has good finite sample properties?
    Thank you.

    1. Osman - thanks for the comment and questions. I'm preparing a separate post to answer them more fully than would be the case here. Give me a day!

  3. Thank you. I am waiting for your post.


Note: Only a member of this blog may post a comment.