I think my new thing is starting my posts with disclaimers, so:
DISCLAIMER: The technical background to this post is a bit too technical for a general interest blog. Because of that combined with my desire to get this post out to a general audience (specifically the readership of dogspotting) the results have been sat in my ‘odd_projects’ folder for months as I’ve tried to work out how to make this both technically correct and interesting to a general reader. Luckily in those months, I have read a good blogpost on word embeddings written by someone much smarter than me which has taken the time to explain thoroughly what word embeddings are and why they are interesting. So I have decided that instead of trying to be rigorous with this post and prove everything I am just going to assert to you what is going on with the technical side and hope that you believe me. I guess the disclaimer then is I can’t explain in detail how this works, but if you take my word for it this will be super SUPER fun. Another one is that this post will work almost infinitely better on a laptop/PC than it will on your phone! Sorry if you’re on your phone and it kinda sucks!
What I have: literally every god damn dogspotting comment there ever was, EVER. 1.2 MILLION DOGGONE DOG COMMENTS.
What I’m going to do with it: Make my computer learn a vector representation of dogspotting language, including DoggoLingo words. Using this data, my computer taught itself that doggo – dog = pupper – puppy = bork – bark, it can plot the positions of words in DoggoLingo space, find that all swearwords are clustered together in this DoggoLingo space and that the dogspotting admins’ names are used in different contexts than everyone else’s names. I sound fully unhinged right? Please bare with me I almost promise that it’s worth it.
Ok so in my last post about dogspotting, I talked about some of the language which dogspotters created and used, such as “pupper” and “doggo”, “cloud”, “bork” etc. In the meantime, a fellow dogspotter Shirley Lee did some awesome research into the origins of ‘DoggoLingo’ including doing a community survey on reactions and adoption of the words. The excitement around this post and by the internet in general to DoggoLingo made me think that I should try and do a bit more work on this.
I then had the idea to use the results of an awesome machine learning paper: Efficient Estimation of Word Representations in Vector Space. In this paper, Mikolov et al introduce new ways of learning what we will refer to as word vectors, I’ll first explain what a word vector is before moving on to how we learn them. I will intersperse with dog pics to make it all go a bit smoother.
How would you explain a word to a computer? Computers work purely using numbers and relations between numbers, so any representation of a word by a computer has to be a set of numbers. Lets talk about a few options before we talk about what we’re doing here.
The first idea would be for a computer to treat every single word separately. You could tell a computer “right, every time I input the number 1, it means i’ve said the word ‘aardvark’, then when i input the number 2, it means i’ve said the word ‘aardvarks’, then number 3 it means i’ve said.. number 123454455237, it means i’ve said the word ‘tree’, then number 123454455238 it means i’ve said the word ‘trees'” and so on, then we would essentially represent each word by a number. This is called one hot encoding and it has several drawbacks: first of all there are an absolute crapload of words, and it is unnecessarily cumbersome to have to represent every single one of them. Another drawback is that this representation does not represent any relations between words, tree and leaves are as different as tree and flying.
But we don’t represent words like that! We represent words with characters, so lets try a character based representation. We could give our computer words with a vector (a fancy word for some numbers stacked on top of each other). So we give each letter from a-z the numbers 1-26 and represent:
dig = [d, i, g] = [4, 8, 6]
dog = [d, o, g] = [4, 15, 6]
puppy = [ p, u, p, p, y] = [16, 21, 16, 16, 25]
This is more efficient than the one hot encoding as we only have to remember 26 numbers and then how those numbers are constructed to make the words, but it has its own downsides: if you were to look at [4, 8, 6], [16, 21, 16, 16, 25] and [4, 15, 6], and asked to choose two most similar vectors, which would you choose? Probably the first and the third, they look very similar. Dig and dog are much closer in character space than dog and puppy and that is not how our brain represents these words. We want a vector representation which manages to represent relations between words that we expect. We want a vector representation that knows when a dog is a puppy and when a tree has a leaf.
This is where Mikolov et al. come in.
The way we create word representations is called word2vec, the algorithm essentially takes a bunch of text and creates word vectors, i.e. every word in your vocabulary has an 100-dimensional vector which represents its meaning to the computer.
All this means is that if you have the word dog, word2vec would represent it has dog = [0.543, 0.321, -0.537, …, 0.645] <- this having 100 elements, all being just decimal numbers. Then each other word will have a different set of 100 numbers describing it.
I hear you cry “fgs how can you represent meaning as numbers?! you’re stupid”, firstly ouch, secondly lets use a little example. Say we had the words puppy, dog, toddler, teenager, person, and we wanted to represent each one of these words with a 2-dimensional vector (two numbers). We could say “right, the first number is how human the subject of the word is, and the second number is how young the subject of word is” and try and make a representation on that.
in this case we would have something like:
dog = [0, 1]
puppy = [0, 0]
toddler = [1, 0]
teenager = [1, 0.5]
person = [1, 1]
Then if someone was like “oh i saw the cutest [0, 0.6] yesterday” you would be able to infer that they had seen a cute dog that was not a puppy but not yet a dog. Or they might say “OH GOD THE WORLD IS BEING TAKEN OVER BY [0.7, 1]” they would be warning you that some horrible scientist had created a human-dog hybrid and it was taking over the world.
This leads on to something very cool and important to what we are doing later. If in this youth-humanness space do
puppy – dog = [0-0, 0 – 1] = [0, -1]
toddler – person = [1 – 1, 0 – 1] = [0, -1]
That means that the difference between puppy and dog is the same as the difference between toddler and person, so you can do addition and subtraction of words in these vector representations. The same is true in the representations we are going to make. Word2vec is god damn crazy.
So we have introduced an example where we could believe that these vectors could represent meaning, but how do we actually make the vectors? This is where the creators of word2vec really changed the game.
First of all, we do not want to be involved with the creation of these vectors. How annoying would it be if you were just on facebook or something and your computer all of a sudden asked you “hey bud, know you’re just chilling atm, but can you please explain to me what a pupper is?”. So we want the creation of these vectors to be fully automated. So what word2vec proposed was using a skipgram model. The idea (which is quite common in machine learning) is to make effectively a game for your computer to try and get good at.
The game is to take a sentence “thats such a cute dog”, blank out one of the words:
“thats such a ____ dog” and get the computer to guess what the word that has been blanked out is. If the computer is correct it is rewarded if it is incorrect it is told what the word is and it learns to do better at guessing. The way it does better at guessing is by learning the vectors for each of the words, if it gets it wrong and says amplifier then it will learn that the vectors for amplifier and dog probably aren’t very complimentary, but the vectors for cute and dog are!
You can start to see how similar words are grouped together, if you have the problem:
“thats such a ___ dog”, good fillings would be: cute, big, angry, funny, bad, good
which would put those vectors closer together and away from amplifier, but if you had
“OMG WHAT A ___ LITTLE DOG”, this is even more specific and only words such as: cute, tiny, funny, lovely will be there, which means that cute, tiny, funny, lovely will be closer to each other than words like big.
I am so so many words in and I haven’t even really talked about dogspotting yet so I think thats enough backstory LETS GO:
What I did was to take every dogspotting comment ever made and get the computer to make word vector representations of every word it found. So while this included normal words like “hello” and “food”, it also represented DoggoLingo words like “pupper” and “bork”. I used a really cool python library called gensim which can take a huge amount of text and implements a very fast learning of these word vectors. This means that if you have ever commented on dogspotting then you helped my computer not only learn the basis of language but also probably taught it a bit about dogs, so thanks!
I feel the need to reiterate this because its so cool but EVERY RESULT BELOW THIS POINT WAS LEARNED BY MY COMPUTER FROM DOGSPOTTING COMMENTS
First off, I tried the arithmetic thing: I found these relations, all which we expect but all of which were heckin’ cool!
this means that I would take the vectors for each word and add or subtract them, then look for the closest vector to the result. One thing that is interesting is that purely through looking through the corpus, (as seen in the gentleman example) word2vec can pick up on inherent biases in text corpora, including to do with gender and race. This has been in the news recently specifically to do with word embeddings and is very concerning.
OK so basically everything above is giving you context for this next graph which i believe is the coolest thing i have ever and will ever produce. I took the most common words in the corpus and did what is known as a t-SNE projection to try and take the vectors down from 100-dimensions to 2-dimensions. What that means that instead of them living in some insane 100 dimensional space, they are brought down into 2-dimensional space keeping as much information as possible about their relative positions, which means we can plot them on a scatter graph!! Forgive me but this is so cool that I learned how to use a new plotting library just so I could make a graph for you guys so here goes:
(in short, this is a plot showing similarities between different word vectors, so words that are closer together the computer think are related!)
So the way we can do this is, you can mouse around on this as much as you want ( i have spent hours and hours doing that) but i’ve labelled some particularly interesting bits which illustrate the power of this model.
The fact that the word vectors had such a solid cluster in A surprised me, but when i found out that it was people’s names, it made more sense: the context in which you use someone’s name in a dogspotting comment is fairly uniform, if you’re tagging someone so they will see the picture you’ll either just put their name there or “John Smith look at this cute dog!!”, this means that names are used in a very standard context, which clustered them together. However there are a few names which aren’t in A. if you look at the cluster next to C it is full of names like (jeff, coco, molly). Why are they apart from A? This is due to the fact dogspotting has a team of dedicated admins who are tagged in whenever anything is bad or breaks the rules. This means their names are used in different context so they aren’t close to the main block of names.
I thought that was super interesting and specific to dogspotting, but then i also found some rules related business around D. If you hover over the area around there you get words such as (refer, rules, low, hanging, fruit) which are all words that relate to the rules of dogspotting and normally used when someone has posted something which isn’t eligible for points. You also see some words like (points, bonuses, ronin) to the right hand side of H, which are to do with the rules of dogspotting. To the left hand side of H we have a cluster which is just (fuck, hell, shit, heck), which is what i’d like to call “swear island”: most swear words are used in similar contexts, so they’re all localised in a similar area.
I has words such as (babies, pups, doggos, puppers) which is pupper island which has generic words for dogs, but if you want more specific words such as breeds (corgi, huskey, lab) then they are above cluster G including more descriptive words such as (bear, seal, floof, fluff, blep).
The group to the left of F is my favourite cluster, as it is the compliment cluster, (majestic, handsome, sweet, cutie, lovely).
There is almost literally nothing to conclude from this, I just wanted to find a way to present this scatter graph to a general audience. I’d implore you to just hover over it and try and find relations that I may have missed. There are a lot! Let me know in the comments if you find anything cool on the graph and tell us the x,y reference! 🙂