Summary:

The question that keeps me awake at night: Is Die Hard a Christmas movie? It takes place during Christmas. Is that enough to count as a Christmas movie?

Let me admit right now that I thoroughly believe Die Hard to be a Christmas movie. But I promise to not let that get in the way of objective analysis.

Dependencies:

Process:

So just how can we got about providing objective analysis on this quintessential question? A natural place to start is with the scripts of movies.

Hypothesis:

If Die Hard is a Christmas movie, the script will be similar to established Christmas movies' scripts.

One way to objectively compare words is to make the words numbers. This is the main idea behind word embeddings, and the model doc2vec lets this idea operate at the paragraph and document level. This model allows us to take a collection of movie scripts and convert each into a vector.

Data:

Movie scripts for 74 Christmas movies were gathered in plain text, including Die Hard and Die Hard 2 based on lists such as this. Many lists include controversial choices, such as Batman Returns and Gremlins as well.

Train:

Here is an example of what the training code looks like:


model = gensim.models.doc2vec.Doc2Vec(vector_size=700, min_count=2, epochs=40)
model.build_vocab(train_corpus)

vector_size refers to the dimension of the vectors generated, min_count excludes extremeley rare words in the corpus.

Evaluate:

The code below takes a movie script and infers what the vector would look like for the script. This new vector is compared to the trained vector for the movie, and the most similar movie script is found. Then, we keep track of how well this works for each movie.


ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

print(collections.Counter(ranks))

Yippee-ki-yay, Melon Farmer

Conclusive results are unfortunately not here yet.

ranks

Above are the ranks for the most similar documents for each script based on inferred vectors with a vector_size of 800. By this metric it is hard to tell just how much has been learned by the underlying shallow neural network used by doc2vec.

Out of curiosity, I used t-SNE to plot these high-dimensional vectors labeled with their movie names:

800

For random comparison, here is the t-SNE results from code with only 50-dimensional vectors:

Figure_1

Future Work

More movie scripts are needed, especially non-Christmas movie scripts. This would enable using the Tag features of doc2vec and provide more reasonable results.

The quest continues.