Some Superficial Text Mining on the National Security Strategy 1

Arnon Kohavi September 12, 2018 0 Comments
Superficial Text Mining on the National Security Strategy

Some Superficial Text Mining on the National Security Strategy 1

On Tuesday, Jay Ulfelder came up with a cool idea about mining the text of the new National Security Strategy. He gave it the old college try, but it didn’t end up where he would have liked–such us life. So I’m going to step in!

The first step in a linguistic analysis is obtaining the raw text. Luckily for us, Ulfelder did complete the annoying but useful busywork of scraping the text out of the original PDFs, so I’m just going to use those text files.

Some PDFs give you the ability to select text, like you would with a Word document. These are typically machine-generated; i.e. somebody typeset their document and then compiled or “printed” it directly to PDF. That gives users of the PDF more-or-less perfect access to the text content of the document.

Other PDFs are basically just pictures of printed pages.  This means that they needed to be scraped using optical character recognition.  In general, OCR software is very good at what it does, but we can expect it to make a meaningful number of mistakes over our corpus of hundreds of thousands of characters. As such, I’m going to use an aggressive filtering technique. I’m going to make the entire corpus of text lower-case, and then remove all the non-alphabetic characters.  

This will generate a fairly “clean” dataset by NLP standards, but at the expense of broad-spectrum accuracy (e.g. we lose things like arabic-numeral references to years), and flexibility (e.g. this data is useless for sentence-level analysis).

Since I’m only doing superficial data play, I’m fine with just working with word-level analysis (though a deeper dive might be a good topic for another day). For my purposes, that means counting words.  The simplest way I know to tokenize words is to use the Python NLTK library–It has an adequate tokenizer with a simple API.Potential Gotcha! If you want to run this code, you’re going to need to install nltk, and then also install the nltk data.

Here’s my (heavily annotated) Python Script:

import re
from os import listdir
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Print out a header for our CSV
print "Year,Word,Count"

# A handy list of words which confer almost no semantic value idempotently
stpwrds = stopwords.words('english')

# Loop over all the files in my txts directory
for f in listdir('txts'):

    # This ugly thing opens a text file, demotes all of its uppercase characters, and removes all non-letter characters
    f2words = word_tokenize(re.sub(r'[^a-z ]',r'', open('txts/'+f,'r').read().lower()))

    # 'words' is python dictionary (similar to a list in R) that maps from a word to the number of occurrences we've counted
    words = {}

    # Loop over every word in the text file
    for word in f2words:

        # Make sure we don't have any spaces floating around our word
        word = word.strip()

        # Skip all of those stopwords
        if word in stpwrds: continue

        # And write down a tickmark for the words we have left
        if word in words: words[word] += 1
        else: words[word] = 1

    # Loop over the words we recorded, and print out their counts
    for word, count in words.iteritems():
        print ",".join([str(f),word,str(count)])

This prints a CSV to stdout containing every distinct word and its respective count for each NSS, designated by the year in which it was published. (I captured the CSV to a file by redirecting the output). From here, I’m going to switch to R for the Analytics.

One thing we can look at very easily is the approximate length of the documents.

Bill Clinton oversaw the most National Security Strategies (almost one for each year he was in office).  He also saw the longest, weighing in at more than 25,000 words.  Presidents Bush and Obama, by contrast, have only produced two each, or one per term.  There isn’t any evident trend in the relationship between time and length–they seem to hover around 13,000 words no matter how many there have been or how long they’ve been happening (or who’s in office).

Another tack we can take against the dataset is the count of distinct words.

Note how it is fitted by two separate regression lines, making the transition around 10,000 on the x-axis.

So, what are these super-common words? Why do they bubble up to the top of the ranking? If we take a look at the top scorers, the reasons should be pretty obvious:


These are exactly the words you should expect to see every sentence in a US national security policy document. “States” pulls duty in a wide variety of contexts: as the “United States”, in “failed states”, or “state” as in a status. “Security” happens to be in the title: the National Security Strategy. “us” is the stripped down version of “U.S.”, in addition to absorbing instances of its more pedestrian synonym.  Et cetera, et cetera.  My one final note on this table is that were I not using NLTK’s list, I would classify “also” as a stopword.

Like Jay Ulfelder, I also lack the time to devote to this project the attention it deserves. However, also like Jay Ulfelder, all my work is preserved in a Git Repo. Perhaps someone else can pick up the gauntlet from here.

Leave a Reply

Your email address will not be published. Required fields are marked *