There are some great industry standard datasets out there: Iris, the 20 newsgroups, anything from UCI, and the Yelp academic dataset come to mind. There are even some great non-traditional ML datasets and/or lists (we've probably tweeted them all out) that can be fun.
But I have a special place in my heart for funny, random data that you don't stumble across everyday. It can be a little bit harder to find (I suppose this is sort of a self-fulfilling property) but in my experience, it's well worth the extra digging. In this post, I'm going to go over 7 datasets that I've found over the years that I think are worth sharing. They are be a little bit obscure, but I can assure you they are quite a bit of fun!
Pigeons were once a high tech form of communication. Some were even used as spies during WWII (I'm currently reading about this in Double Cross: The True Story of the D-Day Spies by Ben Macintyre which is excellent). While I'd guess that spy pigeons are probably going to remain a thing of the past, pigeon racing is alive and well.
Another great study. Officially named, An Investigation for Determining the Optimum Length of Chopsticks, a few researchers set out to determine what the optimal length for chopsticks is. Lucky for us they recorded their data fairly meticulously, and came up with some pretty hilarious units to measure just how effective a pair of chopsticks performed (my personal favorite is "Food Pinching Efficiency"). One can only hope that chopstick designers worldwide have taken this research into account when developing new chopstick designs.
I have this strange fascination with old datasets. While these crop yields aren't exactly straight for the source (the project maintainer is using forecast/estimate models), it's still really interesting to get a view into the macro economy of the 13th century!
This dataset is surprisingly granular (see what I did there). You can cut things up by County, Estate, Manor, and even the type of crop that was being raised--19 are tracked! In addition, the database spans from 1210 all the way to 1500.
Thanks to Bruce Campbell, PhD for doing all the research and putting this great little database together.
Reference: Bruce M. S. Campbell (2007), Three centuries of English crops yields, 1211‑1491 [WWW document]. URL http://www.cropyields.ac.uk [15/06/2015]
The official name for this academic paper is Correlation of Performance Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects. The official description of the study is absolutely classic:
Group of volunteers was given LSD, their mean scores on math exam and tissue concentrations of LSD were obtained at n=7 time points.
Or in others words, How good your are at math when you're on LSD.
As you might imagine, scores did not improve with usage....
So you might not be aware of this but cup stacking, or "Sport Stacking", is actually a sport. If you're wondering what exactly it is, it's pretty much what it sounds like: stacking cups. Fun anecdote: one of my friends was actually the Texas State Champion.
In any event, the data is available from the WSSA website (that's the World Sport Stacking Association) which allows you to search through different divisions, age groups, competitors, and even state/country records.
I'm not quite sure how this data gets collected, but it turns out there's a repository of historical marijuana prices. I suppose you can find pretty much anything on the Internet. Luckily someone already did the hard work of scraping the requisite data. All I had to do was combine and organize the CSV cesspool into one nice, neat data file.
This dataset turned out to be fairly interesting given the political aspects behind marijuana legalization. Turns out (I suppose unsurprisingly), there's a ton of differentiation at the state level in prices.
Here's another cool, historical dataset. Apparently the Spaniards kept fairly meticulous records of how much silver they were producing during the colonial era. Definitely a great opportunities for some historical fiscal/monetary policy analytics here!
Have any favorite, obscure datasets we've left out? Feel free to drop us a line at firstname.lastname@example.org and we'll add them to our list. Thanks!