Sexism in films in data: IMDB, Hollywood and Language

First up:

DISCLAIMER 1: This is the first of many disclaimers on this post: I am terribly placed to write this. First up, I’m a guy. So although I care about sexism I’m probably not the best person to talk about it. Also, I genuinely do not like movies – I think since TV has made good long form series the standard it doesn’t make sense for me to pay £10 to go to see some superhero origin story with some rich white dudes who I’m supposed to know the names of. I’m sure the medium has its benefits and I’m being unfair but in the spirit of journalistic integrity before I write this post I need to admit that over the last year I’ve watched the entire series of Peep Show through more times  than I’ve gone to the cinema.
Anyway, me being a philistine aside, I found the Kaggle dataset of IMDB movies and wanted to see whether we could also see the sexism in Hollywood in the data.

What I’ve got: IMDB data for 5000 movies
What I’m going to do with it: Show that movies featuring women are rated lower, have lower budgets, but are more profitable than movies featuring men. Also that films with men in have titles which are incredibly phallic.


DISCLAIMER 2: Ok, here is my second disclaimer of this post, this one regards the way I treat gender here, which isn’t great. The dataset I took from Kaggle has information about the movies including the names of the (up to) 3 main actors in the film, however the data itself does not include the gender of those people. Therefore for this analysis I needed to find a proxy to take the name of a celebrity and try to infer their gender. The way I did this was to use the wikipedia python library and download the bio for each person, and then the first gendered pronoun out of “he”, “she”, “him” or “her” (I limited it to these for the sake of this analysis, though I know it’s not exhaustive) that appears in their bio I have assumed that their gender is represented by this. For many reasons this sucks: your gender isn’t defined by which pronouns some anonymous person on the internet has given you and for another thing it could just be wrong because of the way some bios are constructed. However I couldn’t think of a better way to do it besides attempting to label ≈10,000 people by hand and for the test cases my method seemed to work fine.

films with all men vs films with not all men

What are we doing here? So in this dataset I have split the films into two categories:
Category AllMen: Films where all of the actors named ( where at least 1 and at most 3 are named) are men

Category NotAllMen: Films where a majority (or exactly half) of the actors named are women

Anyway, now that we’ve ascertained that I’m not the right person to have done this analysis and I’ve done it in a pretty ham-fisted way, lets see what the data tells us

films with women in are rated worse


Lets look at our first graph

Right, this might look confusing so give me a quick explanation. This graph is known as a cumulative histogram. What that means is that if you pick a number at the bottom, say a rating of 5, then go up to the line you’re interested in, the y-value where the line is represents the percentage of films that have a rating above that star rating. That is why the distribution starts at 100% and ends at 0%, as you cant have a rating higher than 10 stars, no matter what sequel of captain america (or iron man or whatever films they’re still making) it is.

We see that nearly across the board the blue line is significantly lower than the red line, which shows a clear difference in the way films are rated.

I can put it another way, which is simpler, less pretty but just as devastating:

ratingpercentage of AllMen films with ratings higher than thispercentage of NotAllMen films with rating higher than thishow more likely a AllMen film is to get higher than this rating

So straight away you may notice a few things, a film is 7% more likely to be higher than 5 star rated if all of the main characters are male. With this amount of data, this is a statistically significant deviation – and maybe you’ll say “hey who cares its just a bunch of nerds on IMDb who cares what they think” but how many times have you decided between two films by their comparative IMDb ratings? This analysis shows that by using this method you are preferentially picking male dominated films. 

films with women in have lower budgets

We already know that women are paid less in terms of fees for films, but it turns out that the films they’re in are also less well funded. Lets get our old friend the cumulative histogram out.

Looks kind of similar right? Well thats because it is! Just like films which aren’t entirely male led are rated lower on IMDb, films which aren’t entirely male led have lower budgets than ones that are. There is an interesting feature to this graph though, because past the dotted line when films start to cost more than a million dollars (lets call it the ‘Hollywood region’), the gap between the two categories really begins to widen, which to me implies an endemic issue in big budget films that you just have to have men leading it otherwise you’re going to get nerds online rating it badly.

Here is the same thing in tabular form:

budgetpercentage of AllMen films with ratings higher than thispercentage of NotAllMen films with rating higher than thishow more likely a AllMen film is to get higher than this rating

TEN PERCENT. If your film has women in it you have a TEN PERCENT lower chance of getting over $4 million in funding. That is absolutely wild. Thats enough to turn a 59% into a less than coin flip 48%.

films dominated by men have stupid Freudian titles

Right ok lets get on to something funny. I took all of the titles of the films with men in and all the titles of films with women in and counted the words in each. I then took the individual counts for each word and compared them to each other, i.e. saw how many times a word occurred in films dominated by men compared to those which have women in them. The result is unsurprising but genuinely hilarious. Like something out of Gilette’s advertising meeting.

Words that feature most prominently in AllMen films and not in NotAllMen films: BIG, HARD, BLOOD, RISE, PLANET
Words that feature most prominently in NotAllMen films and not in AllMen films: love, it, girl, house, with

I mean I try to keep it civil on this blog but the male dominated films have words in them which are basically talking about penises and war and then the films with women in are ones about domestication and love. Again, this is Gillette’s marketing strategy: give bloody war stuff to men and give pink stuff to women.

Fuck movies, and fuck Gilette. Anyway if you don’t already hate movies then here’s the god damn kicker:


So we’re as a society rating women’s films worse, we’re giving them less money, though we’re at least giving them better names than BLOOD RISE PLANET. All the while, though, as we’re doing this, the films with women in are more profitable! Cumulative histogram lets go:

It’s a little different this time huh! it looks exactly the same except the colors are switched huh! Well finally we’ve found an example of something that men in hollywood are underrepresented in: the amount of money their terrible terrible films make!

So what I’ve done here is taken the gross revenue that a film has made and divided it by the amount of money it cost to make and then divided the gross by the budget. What that means is say a new film like “Spiderman 8: This Time He Has Wings Too, Also We’ll Talk About The Dog He Had As A Kid Which Is Kind Of Emotional” has $100 million in budget, and then 100 million poor saps go and watch that shit and it makes $800 million, then the effective money-getting potential is 8x. It has multiplied its input money by 8 in return.

What that means is that for a film to even make its own money back it has to have a ratio of at least 1. And oh lord, films need to step up their game. Films with AllMen in them only break even 49.7% of the time, films with NotAllMen do significantly better (but still fairly badly) at 55.3%. But across the board the films that people are going to see and making money are films that arent male dominated. So although all of the people logging into IMDb and rating films seem to think these films suck, the people actually going to the movies don’t seem to think so, and even though hollywood executives are spending millions of pounds funding BLOOD RISE PLANET, they’re getting nothing back. Finally some god damn justice.


I’m too ranty to have any conclusions, I’m too angry to conclude anything  and I’m two hours late to eat lunch. Thats how mad I am. Also this is 1500 words so I should probably stop. Also you don’t come here to read my opinions you come here to see linegraphs with poorly annotated axis. So I’ll keep it brief:

Movies suck, Hollywood sucks, Gillette sucks, society sucks.

One thought on “Sexism in films in data: IMDB, Hollywood and Language

  1. Erica says:

    check the third graph. i’m not entirely sure but it could be misleading due to the second graph.

    I think higher budget films have a tendency towards lower ratios than lower budget films. or in other words “it’s harder for higher budget film to have an equal ratio to a lower budget film”. If this is true (you’d need to check this) then your third graph might just be showing that phenomenon given that NotAllMen movies have a lower budget (due to your graph 2)

    Here’s an example of this phenomenon which may not hold for all budgets: Given 2 films, Titanic and Night of the living Dead.
    Titanic – 2 billion gross / 200 million = ratio 10
    Night of the living dead – 30 million / 113k = ratio of 263
    If this phenomenon holds true for smaller differences like the difference in budget between NotAllMen and AllMen movies (as noted in graph 2), then it would have an effect on graph 3 (though possibly not a significant effect…i don’t know. but i’m guessing it’s likely).

    You’d have to make a graph of the ratio on the y axis and the budget on the x axis. If there’s an obvious downward trend as the budget increases then the implications about gender and graph 3 might be baseless. You could also just plot both NotAllMen and AllMen on this graph and see if there’s a noticeable difference in the trendlines throughout. Not as fancy as the histogram but it also doesn’t have the flaw of ignoring the effect that budget might have on the ratio.

    I’m not a statistician so I might be wrong or didn’t think my comment through well enough. but it’s worth checking out. What do you think?

Comments are closed.