Content Analysis in Python

This page is currently not much more than an extended advertisment for doing content analysis in Python. In time it might expand to a full tutorial, should anyone express interest in reading one. In the meantime it’ll hopefully just whet your appetite.

The scripts presented here are not intended to teach programming; I assume you have at least a vague idea about that already. Nor are they intended to exemplify fine coding style. The point is to show how easy things can be, if you pick the right tools. Now, to business…

Keyword in Context in 21 lines

Consider the following text taken from the front page of the old Identity Project homepage at Harvard.

The concept of identity seems to be all the rage now in the social
sciences. A critical focus of process oriented scholarship
concerns why and how the social groups to which we belong 
whether ethnic, national, or transnational influence the
knowledge, interpretations, beliefs, preferences, and strategies, that
underlie both our individual and collective behavior. In addition, it
appears that scholars have come to recognize that much discourse by
actors is, broadly speaking, identity discourse; that is, actors use
particular adjectives that describe the self and others in order to
achieve goals, and these articulated self descriptions also serve as
motivations for behavior.

It is accurate to say, however, that there is not much consensus on
how to define identity; nor is there consistency in the procedures
used for determining the content and scope of identity; nor is there
agreement on where to look for evidence that identity indeed affects
knowledge, interpretations, beliefs, preferences, and strategies; nor
is there agreement on how identity affects these components of
action. At its simplest, the problem is that in social science there
is no consensus on how to treat identity as a variable. Not that we
should fetishize consensus but its absence reflects the dearth
of work on some basic questions about how to conceptualize and study
identity. We prefer to put the problem this way: If identity is a key
independent variable explaining political, economic, and social
behavior, how does it vary, why does it vary, and how would one know
variation if one saw it?

The aim of this project is to develop conceptualizations of identity
and, more importantly, to develop technologies for observing identity
and identity change that will have application in the social
sciences. Heretofore the usual techniques for analyzing identity have
consisted of hard-to-replicate discourse analysis or lengthy
individual interviews, at one extreme, or the use of large-N surveys
at the other. Yet, much social science research relies on historical
and contemporaneous texts. We hope to develop computer-aided
quantitative and qualitative methods for analyzing a large number of
textual sources in order to determine the content, intensity, and
contestation of individual and collective identities at any particular
point in time and space. These methods will add to the portfolio of
existing methods. It will allow researchers to approach identity
research with a wider range of tools, including more rigorous and
replicable methods of analyzing identity as an independent (and
dependent) variable.

Very often we want to see a keyword in context (KWIC), e.g. the word and the three words either side.
With the following script we will be able to type:

python idtext.txt identity 3

and get back all the instances of the word ‘identity’ in the following document.
When it’s working, output will arrive in the console, and should look like this:

The concept of [identity] seems to be
is, broadly speaking, [identity] discourse; that is,
how to define [identity;] nor is there
and scope of [identity;] nor is there
for evidence that [identity] indeed affects knowledge,
agreement on how [identity] affects these components
how to treat [identity] as a variable.
conceptualize and study [identity.] We prefer to
this way: If [identity] is a key
develop conceptualizations of [identity] and, more importantly,
technologies for observing [identity] and identity change
observing identity and [identity] change that will
techniques for analyzing [identity] have consisted of
researchers to approach [identity] research with a
methods of analyzing [identity] as an independent

The script is reproduced below. We’ll go through it line by line, since there aren’t very many.

import sys, string, re

# command line arguments
file = sys.argv[1]
target = sys.argv[2]
window = int(sys.argv[3])

a = open(file)
text = 

tokens = text.split() # split on whitespace
keyword = re.compile(target, re.IGNORECASE)

for index in range( len(tokens) ):
    if keyword.match( tokens[index] ):
        start = max(0, index-window)
        finish = min(len(tokens), index+window+1)
        lhs = string.join( tokens[start:index] )
        rhs = string.join( tokens[index+1:finish] )
        print "%s [%s] %s" % (lhs, tokens[index], rhs)

First, we import some modules that provide useful functions. Next we get the command line arguments. (Any text after ‘#’ is a comment.) sys.argv is an array containing everything on the command line. Thus, sys.argv[0], which we ignore, is the script name (computers count from zero), sys.argv[1] is the filename, sys.argv[2] is the keyword, and sys.argv[3] is the context window size.sys.argv[3] is treated as a string by default, so we convert it to an integer with int().

Having got the relevant information, we open the file and read contents into the variable text. Next we split the text into words using thesplit function of the string module. split assumes that words are anything separated by whitespace. This won’t work generally, but it’ll do for now.

We could simply look for exact copies of the keyword, but often a substring match is more useful; trailing bits of punctuation won’t spoil our match. Also, we don’t care about case. To make this all happen we compile a regular expression from the target.

Finally we walk through the array of words, looking for our matches. If keyword matches the array element at the current index, we want to print out the matching word, surrounded by its context. We compute start and finish indices of the context explicitly to ensure we don’t ask for a negative index or one past the end of our array. Finally, we construct the left and right hand sides of the concordance, and print out the result using a simple template.

There are no doubt hundred of ways to improve and extend this script, but it does what it is meant to. So, on to more interesting tasks.

Dictionary-based content analysis in 41 lines

The heart of most content analyses is a dictionary that assigns words to categories. In its simplest form a dictionary is just a set of words under different heading, e.g. this one, which we might save as egdict.txt.




In this file each line starting with ‘>>’ signs indicates the name of a category and every word beneath the category name is a category member. Simple, but adequate for basic dictionary-based content analysis.

We’d like to be able to read in this dictionary file, and analyse a document with it by saying:

python egdict.txt idtext.txt is the script, egdict.txt is the dictionary, and idtext.txt is the text. From this line we get, the number of times words from each dictionary category appeared in the text.

Default : 0 
science : 13
self : 7 
group : 7

The Default category here contains all words that don’t appear under any heading. The code for this is shown below:

import sys, string, re

# command line arguments
dictfile = sys.argv[1]
textfile = sys.argv[2]

a = open(textfile)
text = string.split( ) # lowercase the text

a = open(dictfile)
lines = a.readlines()

dic = {}
scores = {}

# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0

# inhale the dictionary
for line in lines:
    if line[0:2] == '>>':
        current_category = string.strip( line[2:] )
        scores[current_category] = 0
        line = line.strip()
        if len(line) > 0:
            pattern = re.compile(line, re.IGNORECASE)
            dic[pattern] = current_category

# examine the text
for token in text:
    for pattern in dic.keys():
        if pattern.match( token ):
            categ = dic[pattern]
            scores[categ] = scores[categ] + 1

for key in scores.keys():
    print key, ":", scores[key]

Once again, we’ll take it from the top. Much should be familiar from the previous script. We import some useful stuff, parse the command line arguments, and read in the text. Then we read in the dictionary file. This time we use readlines() rather than read()because we want to process it line by line.

Next we set up some data structures to represent the content dictionary. We shall make use of two hashtables (called dictionaries in python) dic and scores.

For those who have not met a hashtable before, it is a mapping from keys to values. Given a key, a hashtable returns the single object that is associated with it. Hashtables lie at the heart of most scripting languages such as perl and python.

The first hashtable, dic, will be a mapping from word patterns to category names. The second, scores, will map category names to the number of times a member of that category has been recognized in the text. The first thing to do is to initialize a working category name, here the default category, and set its count to zero. Then we start reading the dictionary file.

We work through the lines in the dictionary file, checking to see if we’ve met another category name (beginning with ‘>>’). If we haven’t, then we compile the current line into pattern (so we can do case invariant substring matching), and add it as a key to dic. The value that this key will retrieve is set to the working category name. When we meet another category name, we switch the working category name to that, and carry on filling dic.

With the most important hashtable constructed, we can run through the text computing frequency statistics. Each time we see a word we check which, if any, of dic‘s keys matches it. As soon as a key matches, we find out which category dic maps the key to. We then add one to the count in scores indexed by the category’s name. Finally, we cycle through the keys of scores (the category names), and print out their values.

There is certainly more to dictionary-based content analysis than this, but there’s only so much we can show in a few lines of code. And there’s certainly more to python than this e.g. functions, modules, classes, and some great built-in libraries; we just didn’t need them.

Having a Go

If these simple scripts have tempted you to try this at home, then you’ll want to know how to install python, learn more of the language, and make use of the many excellent libraries available.

Installing Python

If you run Mac OSX, python is already installed. If not you can download the latest version from the python homepage.

Naturally, everything mentioned above is free.

Learning Python

The python homepage has a tutorial and lots of documentation. Although we have made no use of it here, python has a shell intepreter; just type python at your system prompt and do some exploring.

I found the best book for learning python is Mark Lutz and David Asher’s Learning Python, published by O’Reilly. (Avoid the similarly titled but much larger Programming Python by Mark Lutz.)

Finding Libraries

It’s quite possible, and potentially rather fun to roll your own text processing code in python. The language does a lot for you already, from downloading pages from websites to processing xml and dealing with databases. However, some things move faster with a good targeted library.

Many useful libraries are linked from the python homepage. Of particular relevance to text processing applications is the Natural Language Processing Toolkit. NLTK implements a wide range of models from the natural language processing literature. If this aspect of content analysis interests you, you may want to have Manning and Schutze’s classic but very readable text Foundations of Statistical Natural Language Processing to hand.

Happy programming.