Econometrics Beat: Dave Giles' Blog: 01/01/2016

Sunday, January 24, 2016

(Legally) Free Books!

(An earlier version of this post inadvertently included links to "pirated" material. This has now been rectified, and the post has been completely re-written.)

There are several Econometrics books, and comprehensive sets of lecture notes, that can be accessed for free. These include a number of excellent books by world-class econometricians.

Here a few that will get you started:

Diebold, Francis X.: Econometrics; Time Series Econometrics; Forecating; Elements of Forecasting. (All are available at http://www.ssc.upenn.edu/~fdiebold/Textbooks.html)
Hyndman, R. and G. Athanasopoulos: Forecasting: Principles and Practice
Hansen, Bruce E.: Econometrics
MIT Open Courseware: Time Series Analysis
Sheppard, Kevin: Python for Econometrics
Shi, Xiaoxia: Econometrics Methods (Lectures)
Train, Kenneth: Discrete Choice Methods With Simulation
Verbic, Miroslav: Advances in Econometrics: Theory and Applications

Thanks to Donsker Class for supplying several of these links.

If you know of others I'd love to hear about them.

Friday, January 22, 2016

Modelling With the Generalized Hermite Distribution

"Count" data occur frequently in economics. These are simply data where the observations are integer-valued - usually 0, 1, 2, ....... . However, the range of values may be truncated (e.g., 1, 2, 3, ....).

To model data of this form we typically resort to distributions such as the Poisson, negative binomial, or variations of these. These variations may account for truncation or censoring of the data, or the over-representation of certain count values (e.g., the "zero-inflated" Poisson distribution).

Covariates (explanatory variables) can be included into the model by making the mean of the distribution a function of these variables. After all, that's exactly what we do in a linear regression model.

If the "count" data form a time-series, then there are other issues that have to be taken into account.

However, the discrete distributions that we typically use have a number of limitations. The fact that the Poisson distribution is, of necessity, "equi-dispersed" (its variance equals its mean) is a big limitation. This leads us to consider distributions such as the negative binomial, in which he variance exceeds the mean. This enables us to model "over-dispersed" data, which are encountered frequently in practice.

The standard distributions are also limited in terms of what they can model in terms of distributional shapes. In particular, there are limitations on modal values in the data.

For instance, in the case of the Poisson distribution, these limitations are the following. If the parameter (λ) of the Poisson distribution is an integer, then there are two adjacent modes with equal modal height, at x = λ and x = λ-1. If lambda is non-integer, then there is a single mode at int(λ), the integer part of λ.

In the case of the negative binomial distribution, there is a single mode.

This suggests that standard discrete distributions of the type that we typically use to mode l"count" data will not be very satisfactory if our data exhibit multi-modality.

We need to look to alternative distributions.

Here's an example of what I mean.

In an earlier post, I discussed some of my work involving the use of the so-called Hermite distribution, introduced by Kemp and Kemp (1965). As an example, I showed the distribution of data relating to the number of financial crises in various countries, as reproduced here:

You can see that, apart from being multi-modal, this empirical distribution is over-dispersed (its variance is approximately twice its mean).

In Giles (2010) I used the Hermite distribution, and various covariates, to model these data using maximum likelihood estimation.

The Hermite distribution can be generalized in various ways. Recently, Moriña et al. (2015) have released a terrific R package, called hermite, that makes it really easy to model "count data" using the Generalized Hermite distribution. We now have a convenient way of dealing with data that exhibit both over-dispersion and multi-modality.

I strongly recommend this new addition to R.

References

Giles, D. E., 2010. Hermite regression analysis of multi-modal count data. Economics Bulletin, 30(4), 2936–2945.

Kemp, C. D. and A. W. Kemp, 1965. Some properties of the ‘Hermite’ distribution. Biometrika, 52, 381-394.

Moriña, D,, M. Higueras, P. Puig, and M. Oliveira, 2015. Generalized Hermite distribution modelling with the R package hermite. The R Journal, 7(2), 263-274.

Saturday, January 16, 2016

Why Does "Pi" Appear in the Normal Density

Every now and then a student will ask me why the formula for the density of a Normal random variable includes the constant, π, or more correctly (2π)^-½.

The answer is that this term ensures that the density function is "proper" - that is, the integral of the function over the full real line takes the value "1". The area under the density, or "total probability", is "1".

Some students are happy with this (partial) answer, but others want to see a proof. Fair enough!

However, there's a trick to proving that this integral (area) is "1" in value. Let's take a look at it.

Difference-in-Differences With Missing Data

This brief post is a "shout out" for Irene Botusaru (Economics, Simon Fraser University) who gave a great seminar in our department yesterday.

The paper that she presented (co-authored with Federico Guitierrez), is titled "Difference-in- Differences When the Treatment Status is Observed in Only One Period". So, the title of this post is a bit of an abbreviation of what the paper is really about.

When we conduct DID analysis, we need to be able to classify information about the behaviour/characteristics of survey respondents into a 4-way matrix. Specifically we need to be able to observe the respondents before and after a "treatment"; and in each case we need to know which respondents were treated, and which ones were not.

Usually, a true panel of data, observed at two or more time-periods, facilitates this.

However, what if we simply have repeated cross-sections of data, taken at different time-periods? In this case we aren't necessarily observing exactly the same respondents when we look at the cross-sections for two different time-periods. Typically, in the cross-section after the treatment we'll know which respondents were treated and which ones weren't. However, there will be no way of partitioning the respondents in the pre-treatment cross-section into "subsequently treated" and "not treated" groups.

Two of the four cells in the matrix of information that we need will be missing, so conventional DID can't be performed.

This is the problem that Irene and Federico consider.

A natural response is introduce some sort of proxy variable(s) to deal with the missing data, and of course this will introduce an estimation bias, even asymptotically. This paper basically takes this approach. The result is a GMM estimation strategy, together with a test that the underlying assumptions are satisfied.

This is a really nice paper - well motivated, technically solid, and with a nice empirical example and application. I urge you to take a look at it if DID is in your econometrics tool-kit (and even if it's not!)

I'm sure that Irene and Federico would appreciate hearing about situations where you've encountered this missing data problem, and how you've responded to it.

Econometrics Beat: Dave Giles' Blog

Pages