Content Analysis in Python

This page is currently not much more than an extended advertisment for doing content analysis in Python. In time it might expand to a full tutorial, should anyone express interest in reading one. In the meantime it’ll hopefully just whet your appetite.

The scripts presented here are not intended to teach programming; I assume you have at least a vague idea about that already. Nor are they intended to exemplify fine coding style. The point is to show how easy things can be, if you pick the right tools. Now, to business…

Keyword in Context in 21 lines

Consider the following text snippet taken from the front page of the old Identity Project homepage at Harvard:

The concept of identity seems to be all the rage now in the social sciences. A critical focus of process oriented scholarship concerns why and how the social groups to which we belong whether ethnic, national, or transnational influence the knowledge, interpretations, beliefs, preferences, and strategies, that underlie both our individual and collective behavior…

The full (but still only three paragraph) text can be downloaded as idtext.txt. That’s what we’ll work with.

Very often we want to see a keyword in context (KWIC), e.g. the word and the three words either side. With the following script we will be able to type:

python idtext.txt identity 3

and get back all the instances of the word ‘identity’ in the following document. When it’s working, output will arrive in the console, and should look like this:

               The concept of [identity] seems to be
        is, broadly speaking, [identity] discourse; that is,
                how to define [identity;] nor is there
                 and scope of [identity;] nor is there
            for evidence that [identity] indeed affects knowledge,
             agreement on how [identity] affects these components
                 how to treat [identity] as a variable.
      conceptualize and study [identity.] We prefer to
                 this way: If [identity] is a key
develop conceptualizations of [identity] and, more importantly,
   technologies for observing [identity] and identity change
       observing identity and [identity] change that will
     techniques for analyzing [identity] have consisted of
      researchers to approach [identity] research with a
         methods of analyzing [identity] as an independent

The script is reproduced below. We’ll go through it line by line, since there aren’t very many.

import sys, string, re

# command line arguments
file = sys.argv[1]
target = sys.argv[2]
window = int(sys.argv[3])

with open(file) as a:
    text = 

tokens = text.split() # split on whitespace
keyword = re.compile(target, re.IGNORECASE)

for index in range( len(tokens) ):
    if keyword.match( tokens[index] ):
        start = max(0, index-window)
        finish = min(len(tokens), index+window+1)
        lhs = string.join( tokens[start:index] )
        rhs = string.join( tokens[index+1:finish] )
        conc = "%s [ %s ] %s" % (lhs, tokens[index], rhs)

First, we import some modules that provide useful functions. Next we get the command line arguments. (Any text after # is a comment.) sys.argv is an array containing everything on the command line. Thus, sys.argv[0], which we ignore, is the script name (computers count from zero), sys.argv[1] is the filename, sys.argv[2] is the keyword, and sys.argv[3] is the context window size. sys.argv[3] is treated as a string by default, so we convert it to an integer with the int function.

Having got the relevant information, we open the file and read contents into the variable text. Next we split the text into words using the split function of the string module. split assumes that words are anything separated by whitespace. This won’t work generally, but it’ll do for now.

We could simply look for exact copies of the keyword, but often a substring match is more useful; trailing bits of punctuation won’t spoil our match. Also, we don’t care about case. To make this all happen we compile a regular expression from the target.

Finally we walk through the array of words, looking for our matches. If keyword matches the array element at the current index, we want to print out the matching word, surrounded by its context. We compute start and finish indices of the context explicitly to ensure we don’t ask for a negative index or one past the end of our array. Finally, we construct the left and right hand sides of the concordance using a simple % template, and print out the result.

There are no doubt hundred of ways to improve and extend this script, but it does what it is meant to. So, on to more interesting tasks.

Dictionary-based content analysis in 38 lines

The heart of most content analyses is a dictionary that assigns words to categories. In its simplest form a dictionary is just a set of words under different heading, e.g. this one, which is saved in a file called egdict.txt.




In this file each line starting with >> signs indicates the name of a category and every word beneath the category name is a category member. Simple, but adequate for basic dictionary-based content analysis.

We’d like to be able to read in this dictionary file, and analyse a document with it by saying:

python egdict.txt idtext.txt is the script, egdict.txt is the dictionary, and idtext.txt is the text. From this line we get, the number of times words from each dictionary category appeared in the text:

Default : 0 
science : 13
self : 7 
group : 7

The Default category here contains all words that don’t appear under any heading. The code for this is shown below:

import sys, string, re

# command line arguments
dictfile = sys.argv[1]
textfile = sys.argv[2]

with open(textfile) as a:
    text = string.split( ) # lowercase the text

with open(dictfile) as d:
    lines = d.readlines()

dic = {}
scores = {}

current_category = "Default"
scores[current_category] = 0

# inhale the dictionary
for line in lines:
    if line[0:2] == '>>':
        current_category = string.strip( line[2:] )
        scores[current_category] = 0
        line = line.strip()
        if len(line) > 0:
            pattern = re.compile(line, re.IGNORECASE)
            dic[pattern] = current_category

# examine the text
for token in text:
    for pattern in dic.keys():
        if pattern.match( token ):
            categ = dic[pattern]
            scores[categ] = scores[categ] + 1

for key in scores.keys():
    print(key, ":", scores[key])

Once again, we’ll take it from the top. Much should be familiar from the previous script. We import some useful stuff, parse the command line arguments, and read in the text. Then we read in the dictionary file. This time we use readlines() rather than read() because we want to process it line by line.

Next we set up some data structures to represent the content dictionary. We shall make use of two hashtables (called dictionaries in python) dic and scores.

For those who have not met a hashtable before, it is a mapping from keys to values. Given a key, a hashtable returns the single object that is associated with it. Hashtables lie at the heart of most scripting languages such as perl and python.

The first hashtable, dic, will be a mapping from word patterns to category names. The second, scores, will map category names to the number of times a member of that category has been recognized in the text. The first thing to do is to initialize a working category name, here the default category, and set its count to zero. Then we start reading the dictionary file.

We work through the lines in the dictionary file, checking to see if we’ve met another category name (beginning with >>). If we haven’t, then we compile the current line into pattern (so we can do case invariant substring matching), and add it as a key to dic. The value that this key will retrieve is set to the working category name. When we meet another category name, we switch the working category name to that, and carry on filling dic.

With the most important hashtable constructed, we can run through the text computing frequency statistics. Each time we see a word we check which, if any, of dic‘s keys matches it. As soon as a key matches, we find out which category dic maps the key to. We then add one to the count in scores indexed by the category’s name. Finally, we cycle through the keys of scores (the category names), and print out their values.

There is certainly more to dictionary-based content analysis than this, but there’s only so much we can show in a few lines of code. And there’s certainly more to python than this e.g. functions, modules, classes, and some great built-in libraries; we just didn’t need them.

Having a Go

If these simple scripts have tempted you to try this at home, then you’ll want to know how to install python, learn more of the language, and make use of the many excellent libraries available.

Installing Python

If you run Mac OSX, python (version 2) is already installed, but you probably want python (version 3). You can download that from the python homepage or use brew. Then every time you see python in the code above, replace it with python3. (Nothing terrible will happen if you forget, but your print output may look a little wonky.)

Windows users can also get a recent python version from the python homepage

Naturally, everything mentioned above is free.

Learning Python

The python homepage has a tutorial and lots of documentation. When I was learning python - a very long time ago now - I found the best book was Mark Lutz and David Asher’s ‘Learning Python’, published by O’Reilly. (Avoid the similarly titled but much larger ‘Programming Python’ by Mark Lutz.) These days I’m sure there’s a lovely online tutorial. The free and amusingly titled free ebook/pdf Learn Python the Hard Way looks pretty good to me.

Finding Libraries

It’s quite possible, and potentially rather fun to roll your own text processing code in python. The language does a lot for you already, from downloading pages from websites to processing xml and dealing with databases. However, some things move faster with a good targeted library.

Many useful libraries are linked from the python homepage. Of particular relevance to text processing applications is the Natural Language Processing Toolkit. NLTK implements a wide range of models from the natural language processing literature. If this aspect of content analysis interests you, you may want to have Manning and Schutze’s classic but very readable text Foundations of Statistical Natural Language Processing to hand.

Happy programming.

If you found that useful…

ko-fi page