Database of Data Sets

I would love to see a comprehensive database of data sets that is searchable by variable name, frequency, survey methodology, target population, etc. ICPSR is okay for finding data, but doesn’t do a good job of putting that data in context and doesn’t facilitate comparisons between similar-but-not-the-same data sets. The fact that notes like this have to exist is a sign that decentralized access to data poses an obstacle to research.

It doesn’t have to be like the St Louis FRED or IPUMS, where you can actually access the data. All it needs is a link to wherever the data is hosted, or contact info for the researchers who developed it in the case of one-off projects. Maybe even a link to the corresponding ICPSR page. What it does need is an exhaustive list of available data items in each set and a standard format for every summary entry. Look at the Wikipedia page for any  country. On the right side of the page is a sidebar with major relevant features, and most sidebars have roughly the same structure. How big is the panel? What’s the attrition rate? What’s the survey frequency? Right there, in your face.

The biggest problem I have with the search tools currently available is that the quantitative  specifics of the data (i.e. the most important part) are never plainly available and are buried deep in the data file. The right data is always out there, but finding it is unacceptably laborious. I don’t want to have to spend a half hour that horrible DataFerett app only to find out that SIPP doesn’t ask people if they received vocational training. I don’t want to have to rely on NORC to post a chart of which variables are available in which years, because who knows where they posted it? I don’t want to have to pore over original survey questionnaires just to find out if the variable I need is even in there. The BLS tries admirably to keep their employment data sets organized, but why should we make them bend over backwards to do it? They even give me SAS, SPSS, and Stata files for importing and recoding data from the NLS Investigator– daiyenu, shouldn’t that be enough?

I should be able to type something like “real income” into a search box and immediately get a list of every data set that records real income by whatever unit of analysis. I should be able to specify date ranges and broad classifying categories. Let’s say the CPS comes up, and I click on it. Immediately I see the sample size, the frequency, the target population, who produces it,  a few links to very similar data (think “related artists” on Spotify), a link to where the CPS data can be downloaded, and a brief description of the sample design. Maybe even a link to Scopus or Web of Knowledge with a search for “Current Population Survey” already punched in. Because I searched for “real income,” variables with description and tag matches appear first. I can click on “See All Variables” and have instant, searchable access to the full list of every variable available in every year. I do not need summary statistics, I do not need an FTP link, I do not need SAS code. I just want to know what’s in the damn data set, and I don’t want to have to actually download the data to find it.

Everyone on the internet uses “tags” for everything (except me because I’m lazy). How stupidly easy is it to search StackExchange, Reddit, and Last.FM? Very, and it’s because people tag stuff. This databse could add a tag to each variable in the case where two similar variables don’t have enough common words in their descriptions to come up together in searches. Tag different measurements of the same concept with the same tag: instant side-by-side comparison capability. Better yet, actually be able to query variables from different data sets and run a side-by-side comparison the way you can online with  computers and cars.

This is a big project, but if enough effort goes into building the structure very little effort will be needed to maintain it.

UPDATE: Someone in the government made a valiant attempt with http://www.data.gov/, but it’s buggy and none of the data download links seem to work. They went way too big way too soon.

UPDATE 2: I think SIPP actually does ask people if they received vocational training, but it’s buried somewhere in a many-dozen page PDF file in a part of the website I will never be able to find again.

Advertisements

Leave a comment

Filed under Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s