For my meditation this week, I decided to take a look at sexism in movie scripts. I came to this topic after a long deliberation. When I read that we were to tap into unconscious, subconscious, and the collective, my thoughts turned immediately to the latent biases everyone has. But there are many datasets I could’ve used, and many biases that could be examined– I eventually settled on movies because it’s a topic I know very well, and whose history of sexism has been well documented. It’s also a topic that’s been studied before: this study looked at how much dialogue different genders have on screen, and Russ Putman maintains a fantastic twitter with female script intros here. I wanted to see if I could generate the sexist introductions of the kind that Mr. Putman has documented thoroughly, using a dataset culled from famous movies.

I started by downloading a dataset I found from UC Santa Cruz, which is culled from the Internet Movie Script Database ( I then wrote a ruby script to find every sentence that surrounded certian keywords. Here, I look for sentences that contain the word ‘sexy’:'corpus.txt', 'w') do |writer| 
  Dir.glob('imsdb_scenes_dialogs_nov_2015/scenes/**/*').each do |file|
    next unless File.file?(file)
    content =
    content.scan(/.*sexy .*/) do |match|
      writer.write match
      writer.write "\n"

I could’ve saved all of my keywords in one file, but I wanted to inspect each keyword individually to see if I should use it. Some words were more fruitful than others. For instance, I started looking at ‘pretty’, ‘beautiful’, and other variations on that word:

She looks up. She is beautiful and terrified.

Points to a gorgeous dead woman with the word GRIFTER on her stomach.

But besides examples like those, I also turned up a lot of uses to describe environments and scenery. That wouldn’t do.

I then looked at hair colours, noticing a pattern in which women were consistently described by their hair. (For anybody who’s taken a screenwriting class, this is apparently a terrible practice, but it’s been very successful with these produced and famous scripts.)

TAMARA, 28, blonde, wears the shortest mini one can imagine,

We barely notice the redhead kneeling between his legs, face buried in his crotch.

I noticed that too many instances of ‘blonde’, despite being gendered, were for men; same goes for redheads. The corpus can include false positives, but not nearly as many as there were for these words.

I then switched to looking at ‘sexy’ and ‘naked’ and ‘nude’. These were generally great– even when they described men, they were typically sexualized. These were part of the corpus that I eventually settled on.

She’s naked but for the tiny silver crucifix she wears around her neck.

MODEL, lying nude in a pool of blue paint.

Go boards, pachinko machines, sexy little MANGA WAIFS in schoolgirl outfits doling out drinks.

Finally, here are a few of my favourite quotes that my little website has generated. It’s only a Markov chain– it would’ve likely worked out better if I created a ruleset. I like the unpredictability of it, though, and creating a ruleset for sexism would’ve been emotionally exhausting.




Click here for the github

Click here for the finished product !