Tuesday, October 22, 2013

Solution to the Segmented Regression Problem

Here's my solution to the "segmented regression" problem that I posed yesterday. Thanks for the comments and suggestions!

You'll recall that what we wanted to do was to end up with a fitted least squares "line" looking like this:

In particular, the "kink" in the line is at a pre-determined point - in this example when x = 30.

Here's how we can achieve this:

The basic regression model is

              yi = α + β xi + εi    ;     i = , 2, ...., n .                                                              (1)

Suppose that we want the line segments to join when x = x*. Then, define a dummy variable, Di, such that:

             Di = 0      ;    if xi ≤ x*
             Di = 1      ;    if xi > x*

The two line segments in the graph above have different intercepts and different slopes, so would probably think of modifying model (1) to become:

             yi = α + β xi + γ Di + δ (xiDi) + εi    ;     i = , 2, ...., n .                                    (2)

That's a good start, but we still have force the join-point to be at x*.

This requirement amounts to the following restriction on the parameters of the model:

             γ + δ x* = 0,

where x* is just a known number (30 in my example above).

Using this restriction to eliminate γ from equation (2), we get:

             yi = α + β xi + δ D(xi - x*) + εi    ;     i = 1, 2,..., n .                                     (3)

Here is the EViews output for my estimated regression model:

The EViews workfile is on the code page for this blog, and the data I used are available on the data page.

We can then generate within-sample forecasts, separately, for observations 1 to 30, and observations 30 to 100. If these series are called YFORC1 and YFORC2, this is (part of) what we get:

Notice that YORC1 = YFORC2 at observation 30, as required.

If we then gather X, Y, YFORC1 and YFORC2 into a group, and produce a scatter-plot, here's the result we wanted:
So, it all comes down to the use of a dummy variable and a restriction of the regression coefficients. One without the other won't work.

Ryan commented on the post in question, and suggested that (in my notation) we estimate the model:

                     yi = α (xi - 30) + β Di (x-30) + εi .

This produces the following results:

Ryan gets the join-point alright, but the fit over the first sub-sample doesn't look very convincing. Sorry, buddy!

© 2013, David E. Giles


  1. thanks dave. very interesting. you allow for a changing slope at a known point using the introduction of a dummy variable and a second coefficient. ( dimitry: my mistake. it seems like you were close ) neat stuff. and it also seems like an approach that could be extended to multiple change points ( as long as you know what they are beforehand ) also.

    1. Mark - yes, this definitely extends to any number of (known) change-points.

  2. This paper may be of some interest.

    1. See page 425-426

    2. The segmented regression idea has been around in the stats. literature since the 60's or 70's

    3. Dave, could you point out the first time that this particular trend segmented regression appeared in the literature, if you happen to know? I am very curious to know. Thanks!

    4. Not exactly sure, but I have put some early references in a new post (26 October 2013, here:

  3. I maintain that unless you are looking for a regression that doesn't give you garbage results, my method is the clear winner buddy.

    Thanks for looking at my attempt. I checked the EViews code and played around, very educational. A nice way of exploring the algebra of forcing the regression surface through a fixed a point, with some dummy variables intuition in there. I think this would be a great problem in most econometrics texts. Have your students already been subjected to this? I wonder about bias, and am curious if you have a DGP in mind for this problem.

    1. Hah ! :-) :-)
      There's a restriction being imposed on the parameters, so if this is false, then teh estimator will be biased (& inconsistent). Students - ECON 545, supp. exercises 2!

  4. Thanks for the awesome blog, Dave!
    If x* is unknown, one can find its least squares estimate by minimizing SSR over a set of candidate thresholds. I believe, in the current example such an estimate, i.e. argmin-SSR(x*), happens to be equal to 24. I hope I got this right.