# pupper2vec: analysing internet dog slang with machine learning

I think my new thing is starting my posts with disclaimers, so:

DISCLAIMER:
The technical background to this post is a bit too technical for a general interest blog. Because of that combined with my desire to get this post out to a general audience (specifically the readership of dogspotting) the results have been sat in my ‘odd_projects’ folder for months as I’ve tried to work out how to make this both technically correct and interesting to a general reader. Luckily in those months, I have read a good blogpost on word embeddings written by someone much smarter than me which has taken the time to explain thoroughly what word embeddings are and why they are interesting. So I have decided that instead of trying to be rigorous with this post and prove everything I am just going to assert to you what is going on with the technical side and hope that you believe me. I guess the disclaimer then is I can’t explain in detail how this works, but if you take my word for it this will be super SUPER fun. Another one is that this post will work almost infinitely better on a laptop/PC than it will on your phone! Sorry if you’re on your phone and it kinda sucks!

What I have: literally every god damn dogspotting comment there ever was, EVER. 1.2 MILLION DOGGONE DOG COMMENTS.

What I’m going to do with it: Make my computer learn a vector representation of dogspotting language, including DoggoLingo words. Using this data, my computer taught itself that doggo – dog = pupper – puppy = bork – bark, it can plot the positions of words in DoggoLingo space, find that all swearwords are clustered together in this DoggoLingo space and that the dogspotting admins’ names are used in different contexts than everyone else’s names. I sound fully unhinged right? Please bare with me I almost promise that it’s worth it.

# Sexism in films in data: IMDB, Hollywood and Language

First up:

DISCLAIMER 1: This is the first of many disclaimers on this post: I am terribly placed to write this. First up, I’m a guy. So although I care about sexism I’m probably not the best person to talk about it. Also, I genuinely do not like movies – I think since TV has made good long form series the standard it doesn’t make sense for me to pay £10 to go to see some superhero origin story with some rich white dudes who I’m supposed to know the names of. I’m sure the medium has its benefits and I’m being unfair but in the spirit of journalistic integrity before I write this post I need to admit that over the last year I’ve watched the entire series of Peep Show through more times  than I’ve gone to the cinema.
Anyway, me being a philistine aside, I found the Kaggle dataset of IMDB movies and wanted to see whether we could also see the sexism in Hollywood in the data.

What I’ve got: IMDB data for 5000 movies
What I’m going to do with it: Show that movies featuring women are rated lower, have lower budgets, but are more profitable than movies featuring men. Also that films with men in have titles which are incredibly phallic.

# TeBOW Week 11: Playoffs, Picks and Power Rankings

I’m back to give some more uninformed picks! I’m currently in my office trying to get my code to recognise the large scale structure of the universe (which is easier than it sounds, but I’m finding it harder than it probably is). So I don’t quite have the time to go over last weeks picks. They seemed to do alright, my only worry was that my desire for the model to work was making me support teams I didn’t like in the hopes that the status quo was preserved. TeBOW has turned me into a monster.

This week I have added in the capacity for the model to simulate the rest of the season, which means that I can start to give percentage chances for teams to get to the playoffs. Very literally I am coding these features minutes before I put them up here so if something weird happens then blame me, but also include a bit of pity in your scorn. I had to get this out before NO/CAR! The battle of the “should be in playoffs but pretty unlucky”

Anyway, enough of the foreshadowing, lets go for the 1000th power rankings you’ve read this week!

# (Introducing) TeBOW’s NFL PICKS: Week 10

America, statisticians and the world at large have had a pretty crappy week. What better week then to introduce my overly simplistic statistical model to attempt to predict the outcome of American Football games, TeBOW!

TrueSkill (extended) Based OWins.

The model takes only the outcome of games that have happened and manages to calculate the rating and consistency of a given team. This allows us to do two things, firstly we can power rank the teams based on their games so far and also we can make predictions about the future games that are going to happen. Every week until the end of the season I will publish the power rankings on a Monday, and then the predictions on a Thursday.

TeBOW is so-called as not only is Tim Tebow a meme and I’m addicted to those page views, but also the model completely ignores any potentially relevant information about the performance of the team, pass yardage, interceptions, etc. All TeBOW cares about is wins no matter what, and I think this is fair to his legacy.

# The Spreadsheet Offense: Analysing historical Fantasy Football data

[Typical disclaimer: I’m British and I just like making graphs, I don’t know as much about NFL as my wild assertions might imply. I’ve played fantasy football for one year now and I nearly got beat by someone who drafted Aaron Rodgers and all kickers, so take this advice with a large helping of salt]

It is with a heavy heart that I am about to reveal the basis of my fantasy draft strategy to the 13 other members of the Edinburgh nerds fantasy football league. My squad ‘THE LEGION OF BABY BOOM’ had a troubled season last year, as I picked Eddie Lacy with the second pick of the draft as he dropped from 230 points on the 2014 season to 120 points in 2015. I also held out until the later rounds to take a Quarterback, picking Sam Bradford and Teddy Bridgewater in successive rounds. I actually remember taking Teddy and seeing pick after pick not taking Bradford thinking “God what losers, I’m going to get both of them! #1, let’s go boomers!”. Subsequently I had a circus show at Quarterback, starting at points Josh McCown, Brian Hoyer etc. If you don’t have context to anything I’ve said above and I’m just naming random millionaires then let it be known that every name I said above played as if they were deliberately trying to disappoint me. I was not the victor of “VONTASY MILLERBALL”.

Anyway, the 2015 season was a clear sign to me that I am not a great NFL scout. Going on pure feeling again is going to get me embarrassed, especially since I spend far too long in a day reading about NFL to lose so badly again. So I decided to use what I have, a huge dataset of NFL players and a love of scattergraphs and histograms to try and override my awful instincts on draft day.

What I’ve got: The fantasy record of every player playing in the NFL from 2000-2015
What I’m going to do with it: Dump a load of graphs which attempt to make the readers of this blog win their fantasy league*, GUARANTEED**
*Assuming NFL.com Classic scoring
**The attempt is guaranteed, nothing else

# Crappy friends at crappy burger joints: A statistical analysis of 10 million meals

[NOTE: I wrote this blogpost ages ago to pitch to another website, for whatever reason it fell through but I feel the need to point out a couple of things:
1. Since writing this, it turns out that Byron is a really nasty company, so if you take anything from this it is DO NOT BUY FROM BYRON, the burgers ain’t that good anyway. As a result I have replaced all use of the word byron with CRAPPY BURGER JOINT.
2. Since I was expecting it to be on another site, the style of it is a bit more sweary, probably just a one off.
3. My friends aren’t crappy and actually I don’t know anyone who does this so don’t think this is aimed at y’all.]

It is 2016 and we still have major issues dealing with the restaurant bill. Too many times you have 10 people sat around a table in Zizzi who each either have to rationalise that “£20≈£18.95 with a tip right?” or sit there for several excruciating minutes waiting for the card machine to go around each person while the dad from the next family up angrily catches your eye from the “Please wait to be seated” sign. Then, in this time crisis enforced upon you by the social pressure of being in eyesight of ‘the sign’, you have a major decision to make: either try and relearn how to use your calculator app to work out how much your meal was or split the bill evenly. What I’m here to show you is that because of this option, its very easy for your crappy friends to take your money.

What I’ve got: The ability to simulate random meals drawn from the CRAPPY BURGER JOINT menu
What I’m going to do with it: Prove that having a bad friend can cost you money

# “WE ARE NOT A CULT”: Analysing the lifestyle of Dogspotting through data

Everything kind of sucks everywhere at the moment. One thing that doesn’t suck is the Facebook group called ‘Dogspotting‘. In Dogspotting members post pictures of dogs they’ve seen in the street, and other players rate them with points, these points are accrued all in the hope of winning ‘The Big Prize’. The rules are strictly enforced by a team of dedicated admins, knowing that they are under the scrutiny of not only the dogspotting people’s court but also the hacks at the dogspotting gossip and gab magazine.

It’s silly but it’s the best kind of silly. Also its extremely popular, the group has almost a quarter of a million members and its still growing. I thought I’d take a look at how and when people spot dogs, partly to help me on the way to win The Big Prize and partly just for fun, so I took every single dogspotting post from the groups inception to now: 229,971 posts scraped using this code by github user minimaxir.

What I’ve got: every single dogspot ever made

What I’m going to do with it: See what when and why people spot dogs to win the Big Prize

# Analysing Handwritten Digits Using Principal Component Analysis

I do quite a few projects which get a few cool graphs in them but no interesting conclusions or discoveries, and so instead of just leaving them to rot in my ‘odd_projects’ folder, I thought I’d start publishing some short posts outlining what i did (like really just an outline, I probably wont go very deep into the theory) and sharing the graphs, so here goes:

What I have: The MNIST database, a database of 70,000 handwritten digits labelled by what number they’re meant to be

What I’m going to do with it: Use principal component analysis to compare relative difficulties of classifying handwritten digits

# Is the NFL ageist against running backs?

My gut instinct tells me that NFL running backs are some of the most poorly treated athletes in the world of sport. The rules against hurting running backs are significantly less strict than those for wide receivers or quarterbacks which leads to a significantly larger amount of career ending injuries. Teams know the fragility of running backs and are less inclined to offer them guaranteed money on their contract (money which will be given even in the case that a player cannot keep playing due to an injury) which significantly lowers the career earnings of an unlucky running back. Further to that, coaches treat running backs as expendable due to the simplicity of their task and will often drop an injured one for a healthier model, which due to the pyramid scheme nature of the NFL there will always be.  Given enough data and enough time I would like to prove all of the above is true.

However for this post, I want to show that a running back’s age affects their ability to play in the league. Not only that as a running back gets older they are less likely to get a job, but simply being on the wrong side of 30 will dramatically reduce their chance of having a job.

What I have: A database of all players currently active in the league, and a historic database of all drafted players.

What I’m going to do with it: See that running backs over 30 are disproportionately cut from NFL rosters compared to other skill positions.

# Is ‘home field advantage’ worth taking down a banner for?

Do you think if you flipped a coin in a mint, it would show heads more than tails? Imagine if we set up a small coin-stadium in or adjacent to the mint where the coin was made, where other coins would sit around watching the coin get flipped. Say we flipped the coin outside of the stadium first a bunch of times and showed that it was relatively 50/50 whether it was going to be heads or tails, but then we went back to this mint-stadium and flipped the coin 3,879 times, and it turned up heads 2,219 times. With a simple statistical test, you can show that the probability of a 50/50 coin giving this result in the stadium is 0.000000000256%.

Football is not a coin. However every team – no matter how good or bad – plays 16 games in the regular season: 8 of those at their own stadium and 8 of those at an opponents stadium, so a good team will play at home as much as a bad team will. Yet when you run through the stats the ‘home field advantage’, i.e that the home team are more likely to win than the away team, is more statistically significant ($8sigma$) than the detection of the Higgs boson ($5sigma$).

What I’ve got: 14 years of regular season NFL data (2000-2014)  – a few thousand games, half a million plays.

What I’m going to do with it: Try and find which bits of a football game are affected by ‘home field advantage’ in a (fairly) rigorous manner.