Permutation tests, which I'll be discussing in this post, aren't that widely used by econometricians. However, they shouldn't be overlooked.

Let's begin with some background discussion to set the scene. This might seem a bit redundant, but it will help us to see how permutation tests differ from the sort of tests that we usually use in econometrics.

When you took your first course in economic statistics, or econometrics, no doubt you encountered some of the basic concepts associated with testing hypotheses. I'm sure that the first exposure that you had to this was actually in terms of "classical", Neyman-Pearson, testing.

It probably wasn't described to you in so many words. It would have just been "statistical hypothesis testing". The whole procedure would have been presented, more or less, along the following lines:

Now, you probably didn't need to be reminded of all of this. However, there are some key aspects to this hypothesis testing story that I've highlighted above, and which are worth remembering when we consider (below) a different possible way of proceeding.

Let's suppose that we want to test some hypothesis, and we have a sample of size n that we plan to use. This sample could be very small, and just how we obtained it is not that important. In particular,

This sounds different, already!

We also have in mind some alternative hypothesis, and we'll see shortly why this is important. Then, using our sample of data, {x

Now comes the interesting part. We "shuffle" (

Finally, we ask the question, "Is S

This is where the alternative hypothesis comes in. The meaning of "extreme" is determined by H

What we have here is a testing method that is "non-parametric", or "distribution free". However, it's more than that. We can show that the the test will be "exact", in the sense that

Now, let's be clear. All of this comes at a cost. Actually, there are three types of "cost" involved.

First, we're going to be making an assumption about our data. This is actually quite a mild assumption, especially when we compare it with the various strong assumptions that underlie classical hypothesis testing. And often it will clear from the context of our testing problem that this assumption is obviously satisfied.

This key assumption is usually described as "

Second, this procedure that I've outlined isn't going to be applicable to all testing problems, including some obvious and simple ones. For example, if we have a single sample (as above) and our test statistic is chosen to be the sample average, x* (or the sample median, say), then every S

Third, you can see immediately that even with very modest sample sizes, the number of possible permutations becomes extremely large, very quickly. For example, if n = 10 there are 10! = 3,628,800 permutations; and if n = 100 there are 9.33x10

Yikes!

We can't do much about the second of these "costs". We'll have to focus on testing problems that are suited to being solved using this particular tool. However, the

Now, why does a permutation test actually work at all?

Even with today's computing power, the large number of permutations associated with even modest sample sizes is still a challenge. For this reason, in practice we typically just randomly sample a large number of permutations from the total number that are available. This is easy to do, and although the resulting test is no longer (literally) "exact", we can make it as accurate as wish by choosing a suitable number of random permutations.

Let's begin with some background discussion to set the scene. This might seem a bit redundant, but it will help us to see how permutation tests differ from the sort of tests that we usually use in econometrics.

**Background Motivation**When you took your first course in economic statistics, or econometrics, no doubt you encountered some of the basic concepts associated with testing hypotheses. I'm sure that the first exposure that you had to this was actually in terms of "classical", Neyman-Pearson, testing.

It probably wasn't described to you in so many words. It would have just been "statistical hypothesis testing". The whole procedure would have been presented, more or less, along the following lines:

- We want to test the validity of statement ("null hypothesis") about a parameter associated with a well-defined underlying population.
- We also state clearly what situation will prevail if the hypothesis to be tested is
*not true*. We call this the "alternative hypothesis". - Then we take a
*carefully constructed*sample from the population of interest. For example, we might use simple random sampling, so that all sample values are mutually independent of each other. - We combine the sample values into a single statistic. Usually, this would be a statistic that had already been found to be a "good" estimator of the parameter under test. Typically, this estimator would have to be transformed (
*e.g.*, "standardized") to make it "pivotal" - that is, having a sampling distribution that does not depend on any other unknown parameters. - Then, we ask - "if in fact the null hypothesis
*were actually true*, is the value of our test statistic "surprising", or not? - If it
*is surprising*, we would*reject*the null hypothesis.*If not*, we would*not reject*the null hypothesis.

You would have met two ways of implementing step 6, above:

**- by checking if the test statistic's value was "more extreme" than a critical value (or values) obtained by choosing an acceptable significance level in advance; and using the sampling distribution of your test statistic,**

*Either**based on the assumptions about the underlying population and the sampling method that was used*.

**- by asking, "if the null hypothesis really were true, how likely is it that I'd see a value for my test statistic that's as extreme (or more extreme) than what I've actually observed here (again,**

*Or**taking into account*

*the assumptions about the underlying population and the sampling method that was used*). The answer to this question is, of course, the familiar p-value.

Now, you probably didn't need to be reminded of all of this. However, there are some key aspects to this hypothesis testing story that I've highlighted above, and which are worth remembering when we consider (below) a different possible way of proceeding.

- First, notice that there's this idea that we know (or can reasonably assume) quite a lot about the underlying population. For instance, we might assume that the population values follow a Normal distribution, with a known variance, but a mean that is unknown and whose value we're testing. That's quite a stretch, in many cases. (Yes, we can think about our test's "robustness" to these assumptions, but that's a different issue.)
- Second, the type of sampling that we use is crucial. Usually, simple random sampling is assumed. In reality, much (most) of the data that we use have been obtained using "complex surveys", involving multi-level cluster sampling and stratified sampling. This changes everything!

- Finally, we have to be able to precisely specify the null and alternative hypotheses (
*in advance of seeing the sample data*), and these hypotheses have to relate to some parameter(s) in the population.

Suppose that the underlying population is N[μ, σ^{2}], where σ^{2}is known, and we obtain a sample of n observations using simple random sampling. Then, you'll recall that the sample average, x* is the most efficient unbiased estimator of μ. Using x* as our test statistic sounds like a good idea, but its sampling distribution is N[μ, σ^{2}/n], so it depends on the unknown value of μ. This test statistic isn't "pivotal".

However, if we transform x* into z = (x* - μ) / (σ / √(n)), this new statistic has a sampling distribution that is N[0 , 1].It is now pivotal. (That's what matters - not that we have a standardized distribution!)

If all of the assumptions that we've made are correct, then we can easily test the null hypothesis H_{0}: μ = 1 (say) against the alternative hypothesis, H_{A}: μ ≠ 1 (say). If we assume that H_{0}is true, or μ = 1, then we can compute a value for z, because now it's as if everything is known.

Now we can proceed as outlined above, because we have aLet's think about a totally different way of testing some hypothesis that interests us. There are three "bulleted" points just before the example above. Let's toss them all out of the window and start afresh!valuefor our test statistic, and we know its sampling distribution if H_{0}were true.

**Permutation Tests**Let's suppose that we want to test some hypothesis, and we have a sample of size n that we plan to use. This sample could be very small, and just how we obtained it is not that important. In particular,

*. Moreover, the (null) hypothesis that we're wanting to test***we're not going to assume that it was obtained randomly***. It can just be some general statement. In fact, we're not going to that much interested in "populations" as such, so***doesn't have to be expressed in terms of a statement about some parameter associated with a population****.***we certainly don't have to make a whole bunch of heroic assumptions involving properties such as "normality", or "constant variance"*This sounds different, already!

We also have in mind some alternative hypothesis, and we'll see shortly why this is important. Then, using our sample of data, {x

_{1}, x_{2}, x_{3}, ...., x_{n}}, we calculate a statistic that seems to be relevant when it comes to examining the possible truth of our null hypothesis. For instance we may want it to be one that increases monotonically as we move further from H_{0}in the direction of H_{A}.*We'll call the value of our statistic from this original sample the "observed statistic", and label it S***This statistic does not have to be pivotal.**_{obs}.Now comes the interesting part. We "shuffle" (

*i.e*., permute) the n sample observations. Perhaps they're now ordered {x_{3}, x_{2}, x_{n},....., x_{1}}. We re-calculate the test statistic. Then we repeat the process for every possible permutation of the sample. This will give us n! test statistics (including S_{obs}). Let's call their values S_{π}, where π refers to a particular permutation of the sample.Finally, we ask the question, "Is S

_{obs}very different from the other S_{π}values?" Or, more particularly, what fraction of the n! values of S_{π}are as "extreme", or "more extreme", than our original S_{obs}?"This is where the alternative hypothesis comes in. The meaning of "extreme" is determined by H

_{A}being one-sided or two-sided.*The fraction of the S*_{π}values that are as extreme, or more extreme, than S_{obs}is the p-value for our test.What we have here is a testing method that is "non-parametric", or "distribution free". However, it's more than that. We can show that the the test will be "exact", in the sense that

*!***it will have exactly the significance level that we decide we want, no matter how small the sample is**Now, let's be clear. All of this comes at a cost. Actually, there are three types of "cost" involved.

First, we're going to be making an assumption about our data. This is actually quite a mild assumption, especially when we compare it with the various strong assumptions that underlie classical hypothesis testing. And often it will clear from the context of our testing problem that this assumption is obviously satisfied.

This key assumption is usually described as "

*exchangeability*" of the data. That is, shuffling the observed data points keeps the data-set just as likely as the original one._{π}will have the same value of S_{obs}, and we'll get nowhere.Third, you can see immediately that even with very modest sample sizes, the number of possible permutations becomes extremely large, very quickly. For example, if n = 10 there are 10! = 3,628,800 permutations; and if n = 100 there are 9.33x10

^{157}.Yikes!

*computational cost*can be handled quite easily, as we'll see below.Now, why does a permutation test actually work at all?

Permutation tests date back to the 1930's, and were first proposed by Fisher (1935), Pitman (1937a, 1937b, 1938) and others. This is certainly not a new idea! However, as you can imagine, in the 1930's these tests could be used only with very small samples and this limited their appeal to some degree.Basically, if the data are exchangeable then every permuted sample will be as likely to occur as any other one. So every S_{π}will be as likely to arise as any other one. Any observed differences will be small and random if H_{0}is true. If S_{obs}is "unusual", then something is amiss and we would be inclined to reject our null hypothesis.

Even with today's computing power, the large number of permutations associated with even modest sample sizes is still a challenge. For this reason, in practice we typically just randomly sample a large number of permutations from the total number that are available. This is easy to do, and although the resulting test is no longer (literally) "exact", we can make it as accurate as wish by choosing a suitable number of random permutations.

Usually (but not always), when we use a sample of all possible permutations the resulting test is referred to as a "randomization test". However, be aware that some authors use the terms "permutation" and "randomization" interchangeably in this context. And sometimes the term "exact test" is used in either case.

Obviously, permutation/randomization tests are still going to be computationally

As I've noted already, I think that students of econometrics need to be more aware of permutation tests, even though they aren't well-suited for some testing problems. A brief historical perspective is given by David (2008).

*intensive*. But they're very manageable. There are several R packages for various types of randomization tests. These include the "coin" package; the "jmuOutlier" package; and the "RATest" package.As I've noted already, I think that students of econometrics need to be more aware of permutation tests, even though they aren't well-suited for some testing problems. A brief historical perspective is given by David (2008).

Back in 1995, Peter Kennedy wrote a very useful "overview" paper about permutation tests (Kennedy, 1995) for an econometrics audience. His paper also contains a useful summary of the philosophical arguments for and against permutation tests - something that I won't go into here.

A lot has happened in the associated literature since 1995, of course, especially when it comes to applying these tests in the context of multiple regression analysis. I'll say a bit more about the latter in the follow-up post that I'm currently preparing.

**Some Examples**

**Let's take a look at a couple of simple examples that will illustrate the use of permutation/randomization tests. The first involves a "treatment effect" - a classic situation where permutation tests are used.**

Example 1: Suppose that two econometrics students, Jane and John, sit a pre-semester test and a final exam. Here are their % scores:

__Jane__

__John__

__(Sample) Average__

Pre-Semester Test: 70 75 72.5

Final Exam.: 76 72 74.0

Based on this (tiny) sample we're interested in whether or not the course improves students' knowledge of econometrics. You might think of conducting a two-sample t-test of the null hypothesis that the (population) mean scores are the same in the test and in the exam., against a one-sided (positive) alternative hypothesis.

However, nothing is known or has been assumed about the underlying population distributions; about how the sample was selected,

*etc*. So, the conditions that are needed for the t-test may not hold, and our inferences could be badly distorted. We certainly can't appeal to standard asymptotic results, given the very small sample size that we have here.
Can a permutation test help us?

Consider the null hypothesis that there is

Notice that there are 4 scores in the table above, and the 2 scores associated with the pre-semester test can be assigned in [4! / (2! 2!)] = 6 ways. Here are those 6 possible permutations:

1 P P F F 4.5

2 P F P F 1.5 (S

3 P F F P 0.5

4 F P P F -0.5

5 F P F P -1.5

6 F F P P -4.5

*no improvement*in knowledge, and the alternative hypothesis that there is a*positive*improvement. We don't need a fancy test statistic. For instance, we can simply use the difference between the "after" and "before" sample average scores. Then, the*observed*value of our test statistic is S_{obs}= (74.0 - 72.5) = 1.5. Yes, this is positive, but is it different from zero in any important sense? (I'm trying to avoid the term "statistically significant" here!)Notice that there are 4 scores in the table above, and the 2 scores associated with the pre-semester test can be assigned in [4! / (2! 2!)] = 6 ways. Here are those 6 possible permutations:

__70 72 75 76__S_{π}= (X^{*}_{F}- X^{*}_{P})**π**1 P P F F 4.5

2 P F P F 1.5 (S

_{obs})3 P F F P 0.5

4 F P P F -0.5

5 F P F P -1.5

6 F F P P -4.5

Here, "P" and "F" refer to the pre-semester test and final exam. respectively. X

^{*}_{P}and X^{*}_{F }are the sample average (permuted) scores in the test and in the final exam. S_{π}denotes our test statistic for permutation π (= 1 to 6).
Given that we have a one-sided alternative hypothesis, the question is, "what fraction of the six S

_{π}values is greater than or equal to S_{obs}?" The answer is (2/6) = 0.333.*associated with our permutation test. On the basis of its value I, personally,***This is the p-value***wouldn't reject*the null hypothesis. I'd conclude that the course led to*no improvement*in students' knowledge.
If we had a 2-sided alternative hypothesis - namely that the course led to

*no change*in performance, then the p-value would be 0.667. Yes, we're doubling the one-sided p-value in this case, because the distribution of the S_{π}values happens to be symmetric. More importantly, we get this answer by noting that four out of the six | S_{π}| values are greater than or equal to | S_{obs }|.
Now, you might wonder if the particular form of test statistic that we chose affected the result in any substantive way. Actually, the answer is "no". If we converted the S

_{π}statistics into (pseudo-) "t-statistics", using the usual standardization, and then proceeded as above, we'get exactly the same p-value for the permutation test. S should be monotone with respect to changes in the data that make the null less likely. The choice may (often will) affect power of the test.
In this first example it was a simple matter to spell out all of the possible permutations that we needed to consider, and then we could apply an exact permutation test. Now let's look at a second simple example which is also a classic permutation test. In this case, because of the sample size, random selection among all possible permutations has to be used.

**Example 2:**

Suppose that we want to test the hypothesis that there is no correlation between the supply (Q) of a particular good and its market price, P. We'd probably want to have a one-sided alternative hypothesis, saying that there is a positive such correlation.

Also, suppose that we have a sample of 20 prices and the corresponding 20 quantities supplied:

**P:**5.0, 4.8, 4.7, 4.0, 5.3, 4.1, 5.5, 4.7, 3.3, 4.0, 4.0, 4.6, 5.3, 3.0, 3.5, 3.9, 4.7, 5.0, 5.2, 4.6

**Q:**60, 59, 58, 47, 65, 48, 67, 70, 55, 63, 62, 65, 71, 56, 59, 60, 74, 77, 78, 62

^{18}possible permutations of the observations on P. So, we'll draw a random selection of 50,000 of these permutations in constructing our randomization test.

The sample correlation for the 20 original values of P and Q is r

_{obs}= 0.6183. We'll denote the sample correlation associated with Q and a particular set of permuted P observations by r

_{π}(π = 1, ..., 50000). The (one-sided) p-value is the fraction of these 50,000 values of r

_{π}that are greater than or equal to r

_{obs}. You'll find both EViews and R code for this example on the code page of this blog. (I could have used Steve Garren's jmuOutlier R package, but I wanted you to see the code details.)

The p-values that I got using EViews and R were 0.0017 and 0.0016 respectively. The distribution of the 50,000 correlation coefficients generated in R looks like this:

Again, if we wanted to compute a two-sided p-value this would be the fraction of the 50,000 values of | r

_{π}| that are greater than or equal to | r

_{obs }|. Modifying the R code, we get a value of 0.00348 for this p-value.

In any case, on the basis of the very small p-value I'd (personally)

*reject the hypothesis*that there is no correlation between price and quantity supplied.

Finally, although it was quite natural to use the sample correlation coefficient, r, as the test statistic for our hypothesis, again this is not actually necessary. Recalling the formula for r, we can use the test statistic, S

_{π}= Σ(P

_{i}Q

_{i}). (The range of summation is over the sample size, n.) It differs from r

_{π}, only by a location shift and a scaling. Constructing a permutation test of H

_{0}using S

_{π}instead of r

_{π}, we get

*exactly the same p-value*(s). The EViews program code that I've supplied illustrates this. (If you're not an EViews user, you can open the program file with any text editor, and there are enough comments in the code for you to see what's going on.)

The important things to note about these two examples are:

- The permutation/randomization tests were constructed and applied without the need for any particular knowledge or assumptions about some underlying population.
- There was no need for the sampling to be strictly random.
- It was legitimate to "shuffle" (permute) the observations, treating them as if they were "exchangeable". (This will not be the case for some testing problems.)
- The tests are very simple, conceptually, to construct.
- Although they are modestly computer-intensive, this "cost" is on a par with bootstrapping, for example.

By now, you're probably wanting to see how these tests can help in the situation that we're most familiar with as econometricians - regression analysis. My next post will illustrate this, and it will also demonstrate that permutation/randomization tests are

*exact*when it comes to their significance level.**References**

Fisher, R. A., 1935.

*The Design of Experiments*. Oliver & Boyd, Edinburgh.

Kennedy, P. E., 1995. Randomization tests in econometrics.

*Journal of Business and Economic Statistics*, 13, 85-94.
Pitman, E., 1937a. Significance tests which may be applied to samples from any populations: I.

*Journal of the Royal Statistical Society, B*, 4, 119-130.
Pitman, E., 1937b. Significance tests which may be applied to samples from any populations: II,.

*Journal of the Royal Statistical Society, B*, 4, 225-232.
Pitman, E., 1938. Significance tests which may be applied to samples from any populations: III.

*Biometrika*, 29, 322-335.
Great post. I recall hearing the term "exact test", but I never got interested enough to really understand it.

ReplyDeleteInteresting post but I think there is a mistake: In hypothesis testing we either test H0: No difference vs. H1: difference (in any direction) (two-sided testing) or we test one-sided: H0: mean of group 1 ≥ mean of group 2 vs. H1: mean of group 1 < mean of group 2 OR H0: mean of group 1 < mean of group 2 vs. H1: mean of group 1 ≥ mean of group 2. But we never test H0: μ = 1 (say) against the alternative hypothesis, HA: μ > 1 (say) because the two hypotheses would not"cover" the range of possible results.

ReplyDeleteThanks, I've amended the wording of the example.

DeleteThanks for the great post. I just wanted to ask if it is possible to add an option to the end of the post which makes PDF file which is suitable for printing which eliminates adds and other things so that it consumes less ink?

ReplyDeleteThanks!

You can download a pdf version that I've made for you from https://www.dropbox.com/s/akpy34s3uex9g6x/davegiles.blogspot.com-Econometrics%20Beat%20Dave%20Giles%20Blog.pdf?dl=0

DeletePlease note that this file is copyright and is not to be posted online without my prior written permission. DG