pupper2vec: analysing internet dog slang with machine learning

I think my new thing is starting my posts with disclaimers, so:

 The technical background to this post is a bit too technical for a general interest blog. Because of that combined with my desire to get this post out to a general audience (specifically the readership of dogspotting) the results have been sat in my ‘odd_projects’ folder for months as I’ve tried to work out how to make this both technically correct and interesting to a general reader. Luckily in those months, I have read a good blogpost on word embeddings written by someone much smarter than me which has taken the time to explain thoroughly what word embeddings are and why they are interesting. So I have decided that instead of trying to be rigorous with this post and prove everything I am just going to assert to you what is going on with the technical side and hope that you believe me. I guess the disclaimer then is I can’t explain in detail how this works, but if you take my word for it this will be super SUPER fun. Another one is that this post will work almost infinitely better on a laptop/PC than it will on your phone! Sorry if you’re on your phone and it kinda sucks!

What I have: literally every god damn dogspotting comment there ever was, EVER. 1.2 MILLION DOGGONE DOG COMMENTS.

What I’m going to do with it: Make my computer learn a vector representation of dogspotting language, including DoggoLingo words. Using this data, my computer taught itself that doggo – dog = pupper – puppy = bork – bark, it can plot the positions of words in DoggoLingo space, find that all swearwords are clustered together in this DoggoLingo space and that the dogspotting admins’ names are used in different contexts than everyone else’s names. I sound fully unhinged right? Please bare with me I almost promise that it’s worth it.

Continue reading