It's easy to forget that many of the standard results that we learn in statistics are based fairly and squarely on the assumption that the data are obtained by using simple random sampling.
In reality, this is rarely the case. Stratified and cluster sampling are routinely used by our statistical agencies, and frequently the sampling process is much more complicated than that. So-called "Complex Survey" designs, which involve multi-stage stratification and clustering are very common.
This is something to watch out for in practice.
When it comes to regression analysis, modifying the standard errors for our estimated coefficients to allow for cluster sampling is pretty standard fare. However, often this is far from enough. It's not all that common to see empirical researchers really taking on board the full ramifications of the complex nature of the survey that yielded the data they're using.
Why is this? I'm sure there are several reasons, including:
- A lot of them simply aren't aware of these issues. They live in a state of blissful ignorance.
- Someone told them that they had to report modified standard errors. They do so, and they think that paying lip-service to the problem is sufficient to convince readers that they know what they're doing.
- Someone told them that they had to report modified standard errors. They do so, but they don't realize what they've done, or why, or that it probably isn't sufficient.
In some ways it's bit like reporting het-consistent standard errors when estimating a non-linear model (e.g., logit or probit), and thinking that you've "fixed the problem". You haven't - in the presence of heteroskedasticity the estimators for the coefficients themselves are inconsistent in these models.
But I digress - you can read more of me griping about the latter point here.
The use of complex survey data has widespread ramifications for econometricians - these go well beyond the estimation of regression models.
This basic point has been taken up by two of my colleagues here at UVic - Judith Clarke and Nilanjana Roy - in their recent and current research program.
For example, they have a really nice paper (Clarke and Roy, 2011), for which the first part of the abstract reads:
"We examine inference for Generalized Entropy and Atkinson inequality measures with complex survey data, using Wald statistics with variance–covariance matrices estimated from a linearization approximation method. Testing the equality of two or more inequality measures, including sub-group decomposition indices and group shares, are covered. We illustrate with Indian data from three surveys, examining pre-school children’s height, an anthropometric measure that can indicate long-term malnutrition. Sampling involved an urban/rural stratification with clustering before selection of households.........."
Notice the last sentence above - this is a good example of a complex survey.
If you're attending the 2012 Conference of the Canadian Economics Association, to be held in Calgary next month, you might be interested in a presentation by one of Judith's Ph.D. students, Ahmed Hoque. His talk is titled, "Straightforward Variance Estimation for the Gini Coefficient with Stratified and Clustered Survey Data".
I can recommend it - despite the 8:30 a.m. start time on 9 June!
Clarke, J. A. & N. Roy, 2011. On statistical inference for inequality measures calculated from complex survey data. Empirical Economics, in press. (W.P. version here.)
© 2012, David E. Giles