Everything kind of sucks everywhere at the moment. One thing that doesn’t suck is the Facebook group called ‘Dogspotting‘. In Dogspotting members post pictures of dogs they’ve seen in the street, and other players rate them with points, these points are accrued all in the hope of winning ‘The Big Prize’. The rules are strictly enforced by a team of dedicated admins, knowing that they are under the scrutiny of not only the dogspotting people’s court but also the hacks at the dogspotting gossip and gab magazine.
It’s silly but it’s the best kind of silly. Also its extremely popular, the group has almost a quarter of a million members and its still growing. I thought I’d take a look at how and when people spot dogs, partly to help me on the way to win The Big Prize and partly just for fun, so I took every single dogspotting post from the groups inception to now: 229,971 posts scraped using this code by github user minimaxir.
What I’ve got: every single dogspot ever made
What I’m going to do with it: See what when and why people spot dogs to win the Big Prize
I do quite a few projects which get a few cool graphs in them but no interesting conclusions or discoveries, and so instead of just leaving them to rot in my ‘odd_projects’ folder, I thought I’d start publishing some short posts outlining what i did (like really just an outline, I probably wont go very deep into the theory) and sharing the graphs, so here goes:
What I have: The MNIST database, a database of 70,000 handwritten digits labelled by what number they’re meant to be
What I’m going to do with it: Use principal component analysis to compare relative difficulties of classifying handwritten digits
My gut instinct tells me that NFL running backs are some of the most poorly treated athletes in the world of sport. The rules against hurting running backs are significantly less strict than those for wide receivers or quarterbacks which leads to a significantly larger amount of career ending injuries. Teams know the fragility of running backs and are less inclined to offer them guaranteed money on their contract (money which will be given even in the case that a player cannot keep playing due to an injury) which significantly lowers the career earnings of an unlucky running back. Further to that, coaches treat running backs as expendable due to the simplicity of their task and will often drop an injured one for a healthier model, which due to the pyramid scheme nature of the NFL there will always be. Given enough data and enough time I would like to prove all of the above is true.
However for this post, I want to show that a running back’s age affects their ability to play in the league. Not only that as a running back gets older they are less likely to get a job, but simply being on the wrong side of 30 will dramatically reduce their chance of having a job.
What I have: A database of all players currently active in the league, and a historic database of all drafted players.
What I’m going to do with it: See that running backs over 30 are disproportionately cut from NFL rosters compared to other skill positions.
Do you think if you flipped a coin in a mint, it would show heads more than tails? Imagine if we set up a small coin-stadium in or adjacent to the mint where the coin was made, where other coins would sit around watching the coin get flipped. Say we flipped the coin outside of the stadium first a bunch of times and showed that it was relatively 50/50 whether it was going to be heads or tails, but then we went back to this mint-stadium and flipped the coin 3,879 times, and it turned up heads 2,219 times. With a simple statistical test, you can show that the probability of a 50/50 coin giving this result in the stadium is 0.000000000256%.
Football is not a coin. However every team – no matter how good or bad – plays 16 games in the regular season: 8 of those at their own stadium and 8 of those at an opponents stadium, so a good team will play at home as much as a bad team will. Yet when you run through the stats the ‘home field advantage’, i.e that the home team are more likely to win than the away team, is more statistically significant () than the detection of the Higgs boson ().
What I’ve got: 14 years of regular season NFL data (2000-2014) – a few thousand games, half a million plays.
What I’m going to do with it: Try and find which bits of a football game are affected by ‘home field advantage’ in a (fairly) rigorous manner.
Andy Dalton is actually very good this season! Johnny Football is actually going to play football! A divisional matchup that the NFL thought would be less one-sided at the start of the season when they decided the TNF games!
Amongst all these heart wrenching story-lines weren’t you desperate for someone to post some simple (and some less simple) graphs to clear the air? Well look no further.
Today I’m starting up the first of hopefully many On Any Given Axes features, where I take a game that I’m watching and share graphs that I’ve made about it. I’ll share the graphs on twitter and copy the tweets here, and will try to respond to any interesting comments on either, so do keep in touch!
What I’ve got: A divisional matchup with two maverick quarterbacks
What I’m going to do with it: Watch it and graph it.
[Disclaimer: I’m British and trying to talk about the NFL, so it’s pretty likely I’m going to sound like either an idiot or an alien while trying to describe what’s going on here, my only request is that you send abuse using the anonymous field at the bottom which goes straight to my email instead of the comment box which everyone can see]
Imagine walking down the street and someone with a clipboard and a bored expression asks you the question “How many glasses of water did you have in the last week?”. You probably don’t really know the answer, and the person asking doesn’t really give too much of a crap. Maybe you could guess at any number between 30 and 40 glasses of water with an equal amount of belief, but you have to choose a number – are you just as likely to choose 32 as 35?
Maybe not, and in the NFL when the guy with the ball gets tackled or stopped at the end of a run and the officials only get a few seconds and a compromised view to decide where it stops, Will every yard line be treated equally?
What I’ve got: A spreadsheet containing every single play run in the NFL from 2000-2014 (500,000 in all)
What I’m going to do with it: Show that the referees subconsciously change the outcome of a play based on where the painted lines are on a field, and subsequently show that it doesn’t matter.
Doing stats during an election is like talking in a crowded room. Me doing stats during an election is like whispering lines from the phonebook in a crowded room while everyone else talks through megaphones giving away Amazon gift card codes. I may not have banks of telethonners calling and polling choice constituencies and asking questions like “which party leader would you most like to meet in the smoking area of a club?” but one thing I do have is 47 hours of people tweeting about hating Nigel Farage et al.
What I’ve got: 47 hours of tweets mentioning any of the parties or party leaders with the words ‘hate’ or ‘love’.
What I’m going to do with it: Construct a way of quantifying the hate of each party. Find out each party’s hate factor, each leader’s hate factor and the difference between the two. Plot a graph of hate against time over the full 47 hours.
I’ve got no business writing a blog about statistics. This isn’t going to be zeitgeisty and impactful because I’m neither of those things, and it isn’t going to ‘make statistics fun’ because statistics generally isn’t fun. However, since we’ve accepted that, we can have some fun by asking questions that the smart people don’t have time for. That’s what gutterstats is about, stupid questions with stupid answers. Today’s question is: who’s hungry? We’re gonna look at the kind of person who tweets about their stomach and see how they differ from the ‘average’ tweeter.