Showing posts with label Data. Show all posts
Showing posts with label Data. Show all posts

Friday, March 29, 2019

Infographics Parades

When I saw Myko Clelland's tweet this morning, my reaction was "Wow! Just, wow!"

Myko (@DapperHistorian) kindly pointed me to the source of this photo that he tweeted about:


It appears on page 343 of Willard Cope Brinton's book, Graphic Methods for Presenting Facts (McGraw-Hill, 1914).

Myko included a brief description in his tweet, but let me elaborate by quoting from pp.342-343 of Brinton's book, and you'll see why I liked the photo so much:
"Educational material shown in parades gives an effective way for reaching vast numbers of people. Fig. 238 illustrates some of the floats used in presenting statistical information in the municipal parade by the employees of the City of New York, May 17, 1913. The progress made in recent years by practically every city department was shown by comparative models, charts, or large printed statements which could be read with ease fro either side of the street. Even though the day of the parade was rainy, great crowds lined the sidewalks. There can be no doubt that many of the thousands who saw the parade came away with the feeling that much is being accomplished to improve the conditions of municipal management. A great amount of work was necessary to prepare the exhibits, but the results gave great reward."
Don't you just love it? A gigantic mobile poster session!

© 2019, David E. Giles

Sunday, January 13, 2019

Machine Learning & Econometrics

What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?

Those are big questions, but I think that they're ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I'd take a long, hard look at what's going on in ML.

Here's a very rough answer - it comes from a post by Larry Wasserman on his (now defunct) blog, Normal Deviate:
"The short answer is: None. They are both concerned with the same question: how do we learn from data?
But a more nuanced view reveals that there are differences due to historical and sociological reasons.......... 
If I had to summarize the main difference between the two fields I would say: 
Statistics emphasizes formal statistical inference (confidence intervals, hypothesis tests, optimal estimators) in low dimensional problems. 
Machine Learning emphasizes high dimensional prediction problems. 
But this is a gross over-simplification. Perhaps it is better to list some topics that receive more attention from one field rather than the other. For example: 
Statistics: survival analysis, spatial analysis, multiple testing, minimax theory, deconvolution, semiparametric inference, bootstrapping, time series.
Machine Learning: online learning, semisupervised learning, manifold learning, active learning, boosting. 
But the differences become blurrier all the time........ 
There are also differences in terminology. Here are some examples:
Statistics       Machine Learning
———————————–
Estimation        Learning
Classifier          Hypothesis
Data point         Example/Instance
Regression        Supervised Learning
Classification    Supervised Learning
Covariate          Feature
Response          Label 
Overall, the the two fields are blending together more and more and I think this is a good thing."
As I said, this is only a rough answer - and it's by no means a comprehensive one.

For an econometrician's perspective on all of this you can't do better that to take a look at Frank Dielbold's blog, No Hesitations. If you follow up on his posts with the label "Machine Learning" - and I suggest that you do - then you'll find 36 of them (at the time of writing).

If (legitimately) free books are your thing, then you'll find some great suggestions for reading more about the Machine Learning / Data Science field(s) on the KDnuggets website - specifically, here in 2017 and here in 2018.

Finally, I was pleased that the recent ASSA Meetings (ASSA2019) included an important contribution by Susan Athey (Stanford), titled "The Impact of Machine Learning on Econometrics and Economics". The title page for Susan's presentation contains three important links to other papers and a webcast.

Have fun!

© 2019, David E. Giles

Tuesday, November 27, 2018

More Long-Run Canadian Economic Data

I was delighted to hear recently from former grad. student, Ryan Macdonald, who has worked at Statistics Canada for some years now. Ryan has been kind enough to draw my attention to all sorts of interesting items from time to time (e.g., see my earlier posts, here and here).

I always appreciate hearing from him.

His latest email was prompted by my post, A New Canadian macroeconomic Database.

Ryan wrote:
"I saw your post on long run data and thought you might be interested in a couple of other long-run datasets for your research.  If I remember correctly you are familiar with the GDP/GNI series, Long-run Real Income EstimatesI also added the long-run Bank of Canada commodity price series that go back to 1870 to it.  There is also a dataset for the provinces with estimates going back to 1950 or 1926 depending on the variable: Long-run Provincial and Territorial Data ."
Thanks for this information, Ryan.This will be very helpful, and I'd be more than happy to publicize any further such developments.

© 2018, David E. Giles

Thursday, November 22, 2018

A New Canadian Macroeconomic Database

Anyone who's undertaken empirical macroeconomic research relating to Canada will know that there are some serious data challenges that have to be surmounted.

In particular, getting access to long-term, continuous, time series isn't as easy as you might expect.

Statistics Canada has been criticized frequently over the years by researchers who find that crucial economic series are suddenly "discontinued", or are re-defined in ways that make it extremely difficult to splice the pieces together into one meaningful time-series.

In recognition of these issues, a number of efforts have been made to provide Canadian economic data in forms that researchers need. These include, for instance, Boivin et al. (2010), Bedock and Stevanovic (2107), and Stephen Gordon's on-going "Project Link".

Thanks to Olivier Fortin-Gagnon, Maxime Leroux, Dalibor Stevanovic, &and Stéphane Suprenant we now have an impressive addition to the available long-term Canadian time-series data. Their 2018 working paper, "A Large Canadian Database for Macroeconomic Analysis", discusses their new database and illustrates its usefulness in a variety of ways.

Here's the abstract:
"This paper describes a large-scale Canadian macroeconomic database in monthly frequency. The dataset contains hundreds of Canadian and provincial economic indicators observed from 1981. It is designed to be updated regularly through (the) StatCan database and is publicly available. It relieves users to deal with data changes and methodological revisions. We show five useful features of the dataset for macroeconomic research. First, the factor structure explains a sizeable part of variation in Canadian and provincial aggregate series. Second, the dataset is useful to capture turning points of the Canadian business cycle. Third, the dataset has substantial predictive power when forecasting key macroeconomic indicators. Fourth, the panel can be used to construct measures of macroeconomic uncertainty. Fifth, the dataset can serve for structural analysis through the factor-augmented VAR model."
Note - these are monthly data! And they're freely available. Although the paper doesn't appear to provide the source for accessing the data, Dalibor kindly pointed out to me that there's a download link here, on his webpage. This link will give you the data in spreadsheet form, together with all of the necessary background information.

The only slight concern that I have about this resource - and I don't want to sound ungrateful - is the issue of the updating of the data over time. You'll note from the abstract that the database "...... is designed to be updated regularly through (the) StatCan database....". Given my comments (above) about some of the issues that we've all faced for a very long time when it comes to StatCan data, I  know that updating this new database on a regular basis is going to be a bit of a challenge.

Added 8 March 2019: I'm glad to learn that new update of the database is now available here.

However, let's not let this concern detract from the considerable benefits that we'll all derive from having access to this rich set of Canadian macroeconomic time-series.

Thanks, again, to the authors for constructing this database, and for making it freely available!

References

Bedock, N. & D. Stevanovic, 2017. An empirical study of credit shock transmission in a small open economy. Canadian Journal of Economics, 50, 541–570.

Boivin, J., M. Giannoni, & D. Stevanovic, 2010. Monetary transmission in a small open economy: more data, fewer puzzles. Technical report, Columbia Business School, Columbia University.

Fortin-Gagnon, O., M. Leroux, D. Stevanovic, & S. Suprenant, 2018. A large Canadian database for macroeconomic analysis. CIRANO Working Paper 2018s-25.

Gordon, S., 2018. Project Link - Piecing together Canadian economic history. Département d'économique, Université Laval.

© 2018, David E. Giles

Sunday, September 2, 2018

September Reading List

This month's list of recommended reading includes an old piece by Milton Friedman that you may find interesting:
  • Broman, K. W. & K. H. Woo, 2017. Data organization in spreadsheets. American Statistician, 72, 2-10.
  • Friedman, M., 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675-701.
  • Goetz, T. & A. Hecq, 2018. Granger causality testing in mixed-frequency VARs with (possibly) cointegrated processes. MPRA Paper No. 87746.
  • Güriş, B., 2018. A new nonlinear unit root test with Fourier function. Communications in Statistics - Simulation and Computation, in press.
  • Honoré, B. E. & L. Hu, 2017. Poor (Wo)man's bootstrap. Econometrica, 85, 1277-1301. (Discussion paper version.)
  • Peng, R. D., 2018. Advanced Statistical Computing. Electronic resource.
© 2018, David E. Giles

Monday, July 2, 2018

Some Reading Suggestions for July

Some summertime reading:
  • Chen, T., DeJuan, J., & R. Tian, 2018. Distributions of GDP across versions of  the Penn World Tables: A functional data analysis approach. Economics Letters, in press. 
  • Clements, K.W., H. Liu, & Y. Tarverdi, 2018. Alcohol consumption, censorship and misjudgment. Applied Economics, online
  • Jin, H., S. Zhang, J. Zhang,& H. Hao, 2018. Modified tests for change points in variance in the possible presence of mean breaks. Journal of Statistical Computation and Simulation, online
  • Pata, U.K., 2018. The Feldstein Horioka puzzle in E7 countries: Evidence from panel cointegration and asymmetric causality analysis. Journal of International Trade and Economic Development, online.
  • Sen, A., 2018. A simple unit root testing methodology that does not require knowledge regarding the presence of a break. Communications in Statistics - Simulation and Computation, 47, 871-889.
  • Wright, T., M. Klein, &K. Wieczorek, 2018. A primer on visualizations for comparing populations, including the issue of overlapping confidence intervals. American Statistician, online.

© 2018, David E. Giles

Friday, May 5, 2017

Here's What I've Been Reading

Here are some of the papers that I've been reading recently. Some of them may appeal to you, too:
© 2017, David E. Giles

Friday, September 9, 2016

Spreadsheet Errors

Five years ago I wrote a post titled, "Beware of Econometricians Bearing Spreadsheets". 

The take-away message from that post was simple: there's considerable, well-documented, evidence that spreadsheets are very, very, dangerous when it comes to statistical calculations. That is, if you care about getting the right answers!

Read that post, and the associated references, and you'll see what I mean.

(You might also ask yourself, why would I pay big bucks for commercial software that is of questionable quality when I can use high-quality statistical software such as R, for free?)

This week, a piece in The Economist looks at the shocking record of publications in genomics that fall prey to spreadsheet errors. It's a sorry tale, to be sure. I strongly recommend that you take a look.

Yes, any software can be mis-used. Anyone can make a mistake. We all know that. However, it's not a good situation when a careful and well-informed researcher ends up making blunders just because the software they trust simply isn't up to snuff!  


© 2016, David E. Giles

Wednesday, September 30, 2015

Reading List for October

Some suggestions for the coming month:

© 2015, David E. Giles

Wednesday, August 12, 2015

Classic Data Visualizations

My thanks to Veronica Johnson at Investech.com for drawing my attention a recent piece of theirs relating to Classic Data Visualizations.

As they say:
"A single data visualization graphic can be priceless. It can save you hours of research. They’re easy to read, interpret, and, if based on the right sources, accurate, as well.  And with the highly social nature of the web, the data can be lighthearted, fun and presented in so many different ways. 
What’s most striking about data visualizations though is that they aren’t as modern a concept as we tend to think they are. 
In fact, they go back to more than 2,500 years—before computers and tools for easy visual representation of data even existed."
Here are the eleven graphics that they highlight:

Saturday, January 31, 2015

Some Suggested Reading

  • Bahoc, F., H. Leeb, and B. M. Potscher, 2014. Valid confidence intervals for post-model-selection predictors. Working Paper, Department of Statistics, University of Vienna.
  • Baumeister, C. and J. D. Hamilton, 2014. Sign restrictions, structural vector autoregressions, and useful prior information. NBER Working Paper No. 20741.
  • Bjerkholt, O., 2015. Fellowship elections in the Econometric Society 1933-1948. Working Paper, Department of Economics, University of Oslo.
  • Deuchert, E. and M. Huber, 2014. A cautionary tale about control variables in IV estimation. Discussion Paper No. 2014-39, School of Economics and Political Science, University of St. Gallen.
  • Doornik, J. A. and D. F. Hendry, 2014. Statistical model selection with 'Big Data'. Discussion Paper 735, Department of Economics, University of Oxford.
  • Duendack, M., R. W. Palmer-Jones, and W. R. Reed, 2014. Replications in economics: A progress report. Working Paper No. 26/2014, Department of Economics and Finance, University of Canterbury.

© 2015, David E. Giles

Tuesday, November 11, 2014

Read Before You Cite!

Note to self - file this post in the "Look Before You Leap" category!

Looking at The New Zealand Herald newspaper this morning, this headline caught my eye:

"How Did Sir Owen Glenn's Domestic Violence Inquiry Get $7 Billion Figure Wrong?"

$7 Billion? Even though that's (only) New Zealand dollars, it still sounds like a reasonable question to ask, I thought. And (seriously) this is a really important issue, so, I read on.

Here's part of what I found (I've added the red highlighting):

Tuesday, September 2, 2014

Getting Quandl Data Into EViews

I've sung the praises of Quandl before - e.g., see here. What's not to like about millions of free time series data - especially when they're linked back to their original sources so that updating and accuracy is the least of your worries.

If you can then get your favourite statistics/econometrics package or programming language to access and import these data seamlessly, so much the better. The less you "handle" the data (e.g., by copying and pasting), the less likely you are to introduce unwanted errors. 

One of the great strengths of Quandl is that it facilitates data importing very nicely indeed:

A case in point is with EViews, where it's achieved using an EViews "Add-in". I've recently put together a handout that deals with this for the students in one of the courses I'm teaching this term. There's nothing in it that you couldn't learn from the Quandl and EViews sites. However, it's a "step-by-step guided tour", and I thought that it might be of use more generally. 

You can download it here.

© 2014, David E. Giles

Tuesday, August 19, 2014

David Mimno on "Data Carpentry"

There's a post on David Mimno's blog  today titled, "Data Carpentry".

I like it a lot, because it emphasises just how much effort, time and creativity can be required in order to get one's data in order before we can get on with the fun stuff - estimating models, testing hypotheses, making forecasts, and so on. I know that this was something that I didn't fully appreciate when I was starting my career. And when I did get the message, I found it rather irksome!

However, the message isn't going to change, so we just have to live with it, and accept the realities of working with "real" data.

In his post, David explains why he doesn't like the oft-used term"data cleaning" (which makes us sound like "data janitors"), and why he prefers the term "data carpentry". Certainly, the latter has more constructive overtones.

As he says:
"To me these imply that there is some kind of pure or clean data buried in a thin layer of non-clean data, and that one need only hose the dataset off to reveal the hard porcelain underneath the muck. In reality, the process is more like deciding how to cut into a piece of material, or how much to plane down a surface. It’s not that there’s any real distinction between good and bad, it’s more that some parts are softer or knottier than others. Judgement is critical.
The scale of data work is more like woodworking, as well. Sometimes you may have a whole tree in front of you, and only need a single board. There’s nothing wrong with the rest of it, you just don’t need it right now."
A nice post, and a very nice "take" on a crucial part of the work that we do.


© 2014, David E. Giles

Wednesday, May 28, 2014

June Reading List


Put away that novel! Here's some really fun June reading:
  • Berger, J., 2003. Could Fisher, Jeffreys and Neyman have agreed on testing?. Statistical Science, 18, 1-32.
  • Canal, L. and R. Micciolo, 2014. The chi-square controversy. What if Pearson had R? Journal of Statistical Computation and Simulation, 84, 1015-1021.
  • Harvey, D. I., S. J. Leybourne, and A. M. R. Taylor, 2014. On infimum Dickey-Fuller unit root tests allowing for a trend break under the null. Computational Statistics and Data Analysis, 78, 235-242.
  • Karavias, Y. and E. Tzavalis, 2014. Testing for unit roots in short panels allowing for a structural breaks. Computational Statistics and Data Analysis, 76, 391-407.
  • King, G. and M. E. Roberts, 2014. How robust standard errors expose methodological problems they do not fix, and what to do about it. Mimeo., Harvard University.
  • Kuroki, M. and J. Pearl, 2014. Measurement bias and effect restoration in causal inference. Biometrika, 101, 423-437.
  • Manski, C., 2014. Communicating uncertainty in official economic statistics. Mimeo., Department of Economics, Northwestern University.
  • Martinez-Camblor, P., 2014. On correlated z-values in hypothesis testing. Computational Statistics and Data Analysis, in press.

© 2014, David E. Giles

Friday, May 9, 2014

Replication in Economics

I was pleased to receive an email today, alerting me to the "Replication in Economics" wiki at the University of Göttingen:
"My name is Jan H. Höffler, I have been working on a replication project funded by the Institute for New Economic Thinking during the last two years and found your blog that I find very interesting. I like very much that you link to data and code related to what you write about. I thought you might be interested in the following:
 
We developed a wiki website that serves as a database of empirical studies, the availability of replication material for them and of replication studies: http://replication.uni-goettingen.de

It can help for research as well as for teaching replication to students. We taught seminars at several faculties internationally - also in Canada, at UofT - for which the information of this database was used. In the starting phase the focus was on some leading journals in economics, and we now cover more than 1800 empirical studies and 142 replications. Replication results can be published as replication working papers of the University of Göttingen's Center for Statistics.

Teaching and providing access to information will raise awareness for the need for replications, provide a basis for research about the reasons why replications so often fail and how this can be changed, and educate future generations of economists about how to make research replicable.

I would be very grateful if you could take a look at our website, give us feedback, register and vote which studies should be replicated – votes are anonymous. If you could also help us to spread the message about this project, this would be most appreciated."
I'm more than happy to spread the word, Jan. I've requested an account, and I'll definitely be getting involved with your project. This look like a great venture!


© 2014, David E. Giles

Friday, April 4, 2014

There's an App for That

I was looking for econometrics-related "apps" for my Android tablet. Very little of interest came up for "Econometrics", but there certainly are some nice data-related apps.

Here are a few examples:



© 2014, David E. Giles

Sunday, March 23, 2014

Data Transfer Advice From Francis Smart

I always enjoy reading the posts by Francis Smart on his Econometrics by Simulation blog. A couple of days ago he wrote a nice piece titled, "It is Time for RData Files to Become the Standard for Data Transfer". 

Francis made some very good points about the handling of large amounts of data, and he provided some convincing examples regarding the compression rates and opening times for RData files as compared with other options. The comments to his post are also very relevant.

If you're using or exchanging large data files, you'll find Francis's post most helpful.


© 2014, David E. Giles

Monday, December 30, 2013

A Cautionary Bedtime Story

Once upon a time, when all the world and you and I were young and beautiful, there lived in the ancient town of Metrika a young boy by the name of Joe.

Saturday, December 28, 2013

Statistical Significance - Again

With all of this emphasis on "Big Data", I was pleased to see this post on the Big Data Econometrics blog, today.

When you have a sample that runs to the thousands (billions?), the conventional significance levels of 10%, 5%, 1% are completely inappropriate. You need to be thinking in terms of tiny significance levels.

I discussed this in some detail back in April of 2011, in a post titled, "Drawing Inferences From Very Large Data-Sets". If you're of those (many) applied researchers who uses large cross-sections of data, and then sprinkles the results tables with asterisks to signal "significance" at the 5%, 10% levels, etc., then I urge you read that earlier post.

It's sad to encounter so many papers and seminar presentations in which the results, in reality, are totally insignificant!


© 2013, David E. Giles