I think my new thing is starting my posts with disclaimers, so:
DISCLAIMER: The technical background to this post is a bit too technical for a general interest blog. Because of that combined with my desire to get this post out to a general audience (specifically the readership of dogspotting) the results have been sat in my ‘odd_projects’ folder for months as I’ve tried to work out how to make this both technically correct and interesting to a general reader. Luckily in those months, I have read a good blogpost on word embeddings written by someone much smarter than me which has taken the time to explain thoroughly what word embeddings are and why they are interesting. So I have decided that instead of trying to be rigorous with this post and prove everything I am just going to assert to you what is going on with the technical side and hope that you believe me. I guess the disclaimer then is I can’t explain in detail how this works, but if you take my word for it this will be super SUPER fun. Another one is that this post will work almost infinitely better on a laptop/PC than it will on your phone! Sorry if you’re on your phone and it kinda sucks!
What I have: literally every god damn dogspotting comment there ever was, EVER. 1.2 MILLION DOGGONE DOG COMMENTS.
What I’m going to do with it: Make my computer learn a vector representation of dogspotting language, including DoggoLingo words. Using this data, my computer taught itself that doggo – dog = pupper – puppy = bork – bark, it can plot the positions of words in DoggoLingo space, find that all swearwords are clustered together in this DoggoLingo space and that the dogspotting admins’ names are used in different contexts than everyone else’s names. I sound fully unhinged right? Please bare with me I almost promise that it’s worth it.
DISCLAIMER 1: This is the first of many disclaimers on this post: I am terribly placed to write this. First up, I’m a guy. So although I care about sexism I’m probably not the best person to talk about it. Also, I genuinely do not like movies – I think since TV has made good long form series the standard it doesn’t make sense for me to pay £10 to go to see some superhero origin story with some rich white dudes who I’m supposed to know the names of. I’m sure the medium has its benefits and I’m being unfair but in the spirit of journalistic integrity before I write this post I need to admit that over the last year I’ve watched the entire series of Peep Show through more times than I’ve gone to the cinema.
Anyway, me being a philistine aside, I found the Kaggle dataset of IMDB movies and wanted to see whether we could also see the sexism in Hollywood in the data.
What I’ve got: IMDB data for 5000 movies
What I’m going to do with it: Show that movies featuring women are rated lower, have lower budgets, but are more profitable than movies featuring men. Also that films with men in have titles which are incredibly phallic.
[NOTE: I wrote this blogpost ages ago to pitch to another website, for whatever reason it fell through but I feel the need to point out a couple of things:
1. Since writing this, it turns out that Byron is a really nasty company, so if you take anything from this it is DO NOT BUY FROM BYRON, the burgers ain’t that good anyway. As a result I have replaced all use of the word byron with CRAPPY BURGER JOINT.
2. Since I was expecting it to be on another site, the style of it is a bit more sweary, probably just a one off.
3. My friends aren’t crappy and actually I don’t know anyone who does this so don’t think this is aimed at y’all.]
It is 2016 and we still have major issues dealing with the restaurant bill. Too many times you have 10 people sat around a table in Zizzi who each either have to rationalise that “£20≈£18.95 with a tip right?” or sit there for several excruciating minutes waiting for the card machine to go around each person while the dad from the next family up angrily catches your eye from the “Please wait to be seated” sign. Then, in this time crisis enforced upon you by the social pressure of being in eyesight of ‘the sign’, you have a major decision to make: either try and relearn how to use your calculator app to work out how much your meal was or split the bill evenly. What I’m here to show you is that because of this option, its very easy for your crappy friends to take your money.
What I’ve got: The ability to simulate random meals drawn from the CRAPPY BURGER JOINT menu
What I’m going to do with it: Prove that having a bad friend can cost you money
Everything kind of sucks everywhere at the moment. One thing that doesn’t suck is the Facebook group called ‘Dogspotting‘. In Dogspotting members post pictures of dogs they’ve seen in the street, and other players rate them with points, these points are accrued all in the hope of winning ‘The Big Prize’. The rules are strictly enforced by a team of dedicated admins, knowing that they are under the scrutiny of not only the dogspotting people’s court but also the hacks at the dogspotting gossip and gab magazine.
It’s silly but it’s the best kind of silly. Also its extremely popular, the group has almost a quarter of a million members and its still growing. I thought I’d take a look at how and when people spot dogs, partly to help me on the way to win The Big Prize and partly just for fun, so I took every single dogspotting post from the groups inception to now: 229,971 posts scraped using this code by github user minimaxir.
What I’ve got: every single dogspot ever made
What I’m going to do with it: See what when and why people spot dogs to win the Big Prize
I do quite a few projects which get a few cool graphs in them but no interesting conclusions or discoveries, and so instead of just leaving them to rot in my ‘odd_projects’ folder, I thought I’d start publishing some short posts outlining what i did (like really just an outline, I probably wont go very deep into the theory) and sharing the graphs, so here goes:
What I have: The MNIST database, a database of 70,000 handwritten digits labelled by what number they’re meant to be
What I’m going to do with it: Use principal component analysis to compare relative difficulties of classifying handwritten digits
Doing stats during an election is like talking in a crowded room. Me doing stats during an election is like whispering lines from the phonebook in a crowded room while everyone else talks through megaphones giving away Amazon gift card codes. I may not have banks of telethonners calling and polling choice constituencies and asking questions like “which party leader would you most like to meet in the smoking area of a club?” but one thing I do have is 47 hours of people tweeting about hating Nigel Farage et al.
What I’ve got: 47 hours of tweets mentioning any of the parties or party leaders with the words ‘hate’ or ‘love’.
What I’m going to do with it: Construct a way of quantifying the hate of each party. Find out each party’s hate factor, each leader’s hate factor and the difference between the two. Plot a graph of hate against time over the full 47 hours.
I’ve got no business writing a blog about statistics. This isn’t going to be zeitgeisty and impactful because I’m neither of those things, and it isn’t going to ‘make statistics fun’ because statistics generally isn’t fun. However, since we’ve accepted that, we can have some fun by asking questions that the smart people don’t have time for. That’s what gutterstats is about, stupid questions with stupid answers. Today’s question is: who’s hungry? We’re gonna look at the kind of person who tweets about their stomach and see how they differ from the ‘average’ tweeter.