Want to get your hands on some data?


Even before I started my PHD, I knew that I wanted to focus on empirical work. While I have a deep respect for theorists, my comparative advantage and, more importantly, my interests lie in testing new theories and evaluating public policies.

Getting started was very difficult! I knew the topic that I wanted to research (actually I knew what I did not want to research), I had a neophyte’s econometric skills, and knew the software well. However, if your intention is to go beyond an insipid descriptive analysis and to produce publishable academic work, I realized that you also need a great idea. Although there truly is no replacement for the scientific method approach, one thing that helped and inspired me was to see other people’s work and what data is out there. Navigating the waters might expand your vision to things that you did not think possible. Who knew that you could use fishing patterns to predict illegal fishing? Or that there is a link between a country’s index of corruption and the propensity of its diplomats to pay for traffic tickets in New York.

To help you with the process of getting started, I compiled some data sources that I have used in the past. The list is by no means comprehensive but it should (at least ) increase your resources and perhaps inspire your next AER paper. Please feel free to send me any suggestions. The list is organized by topic. I rate each source based on how easy it is to download/understand the data.

General


Chilehas an amazing database of public data available. From education, to health and finances, the country provides an amazing tool for real-time analysis. There are several interesting papers using their data (here and here). More importantly, they have an established process to request more reserved data. (My rating 8.5/10)

Uruguay - Like Chile, Uruguay is a pioneer on data transparency in Latin America. They provide all of their data in different formats (xml, txt, csv), which makes it easy to get started. I used their data on speeding tickets for the city of Montevideo when I demonstrate how to use the python for data analysis. (My rating 8.5/10)

Procurement


Brazil provides data for their government purchases. Though there are several papers that use this data, (here Ferraz) I have not personally explored their datasets. (My rating X/10)

Ecuador. In 2008 Ecuador passed a law that required all public institutions to do all their procurement through a centralized website. The site provides an easy way to download a high level overview of all purchases done in the country since 2008. For a metadata analysis there is a search tool that unfortunately only allows you to search for individual purchases (a good incentive to learn data scraping). Because it is relatively new, the site contains some mistakes and inaccuracies. (My rating 7/10)

US procurement data – I really like this site as it provides several user friendly ways to download the data and to analyze it. They have great API’s and you can download the data in different formats. (My rating 9/10)

Miscellaneous


Chicago crime - Although there are several cities in the US that provide crime data or crime maps, I think Chicago’s portal is excellent. It provides crime data since 2001 until the present (lags only 1 week). Additionally, they have several methods to download, visualize, and analyze the data. You can also combine this geo-localized dataset with other sources from the site. (My rating 10/10).

Sports

Recently I came up with what I thought was a brilliant idea only to later find out that I was about 10 years too late. The authors in this paper look at the relationship between American football losses and domestic violence. I won’t spoil the ending. A good source for data on sports is ShrpSports.

ICPSR – has a great tool for social data. They have, among others, census, legal systems, and social indicators. You can browse by geographic location, study, or researcher.

Big Data


Amazon has a great repository of BIG public databases. When I mean big, I mean human genome, Wikipedia dumps big. For other big datasets (click here).

Hope this gives you a start, there is always plenty else to see. Feel free to send me any suggestions.


Comments