## Tuesday, April 30, 2013

### Some Official Data Come With Standard Errors!

Without intending to, I seem to have been on a bit of a rant about data quality and reliability recently! For example, see here, here, and here.

This post is about a related topic that's bugged me for a long time. It's to do with the measures of uncertainty that some statistical agencies (e.g., Statistics Canada) correctly report with some of their survey-based statistics.

A good example of what I have in mind is the Labour Force Survey (LFS) from Statistics Canada.

The following passage it quoted from Section 7 (p.26) of the Guide to the Labour Force Survey:
"The Labour Force Survey collects information from a sample of households. Somewhat different figures might have been obtained if a complete census had been taken using the same questionnaires, interviewers, supervisors, processing methods, etc. as those actually used in the Labour Force Survey. The difference between the estimates obtained from the sample and those that would give a complete count taken under similar conditions is called the sampling error of the estimate, or sampling variability. Approximate measures of sampling error accompany Labour Force Survey products and users are urged to make use of them while analysing the data. Three interpretation methods can be used to evaluate the precision of the estimates: the standard error, and two other methods also based on standard error: confidence intervals and coefficients of variation."    [Underlined emphasis added, DG]
So, what's been bugging me?

It's just great that information about the sampling variability of the surveyed data is reported, and I do hope that users of these numbers take note of the highlighted sentence in the above passage. In truth, I suspect that many users are blissfully unaware of this! My question is:
"How can we use this information about the sampling errors when we use the LFS data in a regression model?"
Suppose that we're using a variable such as the size of the Canadian labour force (15 years and older) as an explanatory variable in a regression model. If we use the numbers reported by Statistics Canada, then we're really using just a point estimate of the regressor that we're actually interested in. These point estimates for the years 2008 to 2012 inclusive are 17,087.4; 16,813.1; 17,041.0; 17,306.2; and 17,507.7 (thousands of people), respectively. From the following table, we see that the largest coefficient of variation for all of Canada that is less than each of these figures is an approximate c.v. of 10%.

So, the way to think about the data for 2008 to 2012 is as follows. The numbers for each year are point estimates of the annual averages. The standard error associated with each value is (approximately) 10% of the point estimate in each case. With these standard errors in parentheses, the values for 2008 to 2012 inclusive should really be thought of as being 17,087.4 (1,708.7); 16,813.1 (1,681.3); 17,041.0 (1,704.1); 17,306.2 (1,730.6); and 17,507.7 (1,750.8) thousands of people, respectively.

Given that we have this information about the sampling errors associated with these numbers, it would be interesting to incorporate it into the estimation of the model. This should enable us to compute more reliable "standard errors" for the estimated regression coefficients, for example. To put it the other way around, ignoring this information undoubtedly results in standard errors that are biased downwards.

This sounds like an "errors in variables" situation to me. It's well known that there's an identification problem associated with a model of this type, but various solutions to this have been suggested in the vast literature on this topic. Examples include Bayesian methods, instrumental variables estimation, and method of moments using first, second, and third-order moments.

A former graduate student of mine, Chad Stroomer, did some work on this problem for his M.A. research project in 2008. It shouldn't be too hard to do something sensible with this information.

However, I can't recall ever seeing anyone use the sampling error information in a regression analysis in practice.

If anyone can point me to an empirical application where such information has been used in a formal way in a regression model, I'd love to know about it.