Content Analysis in Python
This page is currently not much more than an extended advertisment for doing content analysis in Python. In time it might expand to a full tutorial, should anyone express interest in reading one. In the meantime it’ll hopefully just whet your appetite.
The scripts presented here are not intended to teach programming; I assume you have at least a vague idea about that already. Nor are they intended to exemplify fine coding style. The point is to show how easy things can be, if you pick the right tools. Now, to business…
Keyword in Context in 21 lines
Consider the following text snippet taken from the front page of the old Identity Project homepage at Harvard:
The concept of identity seems to be all the rage now in the social sciences. A critical focus of process oriented scholarship concerns why and how the social groups to which we belong whether ethnic, national, or transnational influence the knowledge, interpretations, beliefs, preferences, and strategies, that underlie both our individual and collective behavior…
The full (but still only three paragraph) text can be downloaded as idtext.txt. That’s what we’ll work with.
Very often we want to see a keyword in context (KWIC), e.g. the word and the three words either side. With the following script we will be able to type:
python kwic1.py idtext.txt identity 3
and get back all the instances of the word ‘identity’ in the following document. When it’s working, output will arrive in the console, and should look like this:
The concept of [identity] seems to be
is, broadly speaking, [identity] discourse; that is,
how to define [identity;] nor is there
and scope of [identity;] nor is there
for evidence that [identity] indeed affects knowledge,
agreement on how [identity] affects these components
how to treat [identity] as a variable.
conceptualize and study [identity.] We prefer to
this way: If [identity] is a key
develop conceptualizations of [identity] and, more importantly,
technologies for observing [identity] and identity change
observing identity and [identity] change that will
techniques for analyzing [identity] have consisted of
researchers to approach [identity] research with a
methods of analyzing [identity] as an independent
The script is reproduced below. We’ll go through it line by line, since there aren’t very many.
import sys, string, re
# command line arguments
file = sys.argv[1]
target = sys.argv[2]
window = int(sys.argv[3])
with open(file) as a:
text = a.read()
tokens = text.split() # split on whitespace
keyword = re.compile(target, re.IGNORECASE)
for index in range( len(tokens) ):
if keyword.match( tokens[index] ):
start = max(0, index-window)
finish = min(len(tokens), index+window+1)
lhs = string.join( tokens[start:index] )
rhs = string.join( tokens[index+1:finish] )
conc = "%s [ %s ] %s" % (lhs, tokens[index], rhs)
print(conc)
First, we import some modules that provide useful functions. Next we
get the command line arguments. (Any text after #
is a
comment.) sys.argv
is an array containing everything on the
command line. Thus, sys.argv[0]
, which we ignore, is the script
name (computers count from zero), sys.argv[1]
is the
filename, sys.argv[2]
is the keyword, and sys.argv[3]
is
the context window size. sys.argv[3]
is treated as a string by
default, so we convert it to an integer with the int
function.
Having got the relevant information, we open the file and read
contents into the variable text. Next we split the text into words
using the split
function of the string
module. split
assumes
that words are anything separated by whitespace. This won’t work
generally, but it’ll do for now.
We could simply look for exact copies of the keyword, but often a substring match is more useful; trailing bits of punctuation won’t spoil our match. Also, we don’t care about case. To make this all happen we compile a regular expression from the target.
Finally we walk through the array of words, looking for our
matches. If keyword
matches the array element at the current index,
we want to print out the matching word, surrounded by its context. We
compute start
and finish
indices of the context explicitly to
ensure we don’t ask for a negative index or one past the end of our
array. Finally, we construct the left and right hand sides of the
concordance using a simple %
template, and print out the result.
There are no doubt hundred of ways to improve and extend this script, but it does what it is meant to. So, on to more interesting tasks.
Dictionary-based content analysis in 38 lines
The heart of most content analyses is a dictionary that assigns words
to categories. In its simplest form a dictionary is just a set of
words under different heading, e.g. this one, which is saved in a file
called egdict.txt
.
>>group
collective
consensus
agreement
>>self
self
preferences
individual
>>science
research
science
variable
analyzing
In this file each line starting with >>
signs indicates the name of
a category and every word beneath the category name is a category
member. Simple, but adequate for basic dictionary-based content
analysis.
We’d like to be able to read in this dictionary file, and analyse a document with it by saying:
python dict1.py egdict.txt idtext.txt
dict1.py
is the script, egdict.txt
is the dictionary, and
idtext.txt
is the text. From this line we get, the number of times
words from each dictionary category appeared in the text:
Default : 0
science : 13
self : 7
group : 7
The Default
category here contains all words that don’t appear under
any heading. The code for this is shown below:
import sys, string, re
# command line arguments
dictfile = sys.argv[1]
textfile = sys.argv[2]
with open(textfile) as a:
text = string.split( a.read() ) # lowercase the text
with open(dictfile) as d:
lines = d.readlines()
dic = {}
scores = {}
current_category = "Default"
scores[current_category] = 0
# inhale the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = string.strip( line[2:] )
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in text:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
for key in scores.keys():
print(key, ":", scores[key])
Once again, we’ll take it from the top. Much should be familiar from
the previous script. We import some useful stuff, parse the command
line arguments, and read in the text. Then we read in the dictionary
file. This time we use readlines()
rather than read()
because we
want to process it line by line.
Next we set up some data structures to represent the content
dictionary. We shall make use of two hashtables (called dictionaries
in python) dic
and scores
.
For those who have not met a hashtable before, it is a mapping from keys to values. Given a key, a hashtable returns the single object that is associated with it. Hashtables lie at the heart of most scripting languages such as perl and python.
The first hashtable, dic
, will be a mapping from word patterns to
category names. The second, scores
, will map category names to the
number of times a member of that category has been recognized in the
text. The first thing to do is to initialize a working category name,
here the default category, and set its count to zero. Then we start
reading the dictionary file.
We work through the lines in the dictionary file, checking to see if
we’ve met another category name (beginning with >>
). If we
haven’t, then we compile the current line into pattern
(so we can do
case invariant substring matching), and add it as a key to dic
. The
value that this key will retrieve is set to the working category
name. When we meet another category name, we switch the working
category name to that, and carry on filling dic
.
With the most important hashtable constructed, we can run through the
text computing frequency statistics. Each time we see a word we check
which, if any, of dic
‘s keys matches it. As soon as a key matches,
we find out which category dic
maps the key to. We then add one to
the count in scores
indexed by the category’s name. Finally, we
cycle through the keys of scores (the category names), and print out
their values.
There is certainly more to dictionary-based content analysis than this, but there’s only so much we can show in a few lines of code. And there’s certainly more to python than this e.g. functions, modules, classes, and some great built-in libraries; we just didn’t need them.
Having a Go
If these simple scripts have tempted you to try this at home, then you’ll want to know how to install python, learn more of the language, and make use of the many excellent libraries available.
Installing Python
If you run Mac OSX, python (version 2) is already installed, but you
probably want python (version 3). You can download that
from the python homepage or use brew
.
Then every time you see python
in the code above, replace it with
python3
. (Nothing terrible will happen if you forget, but your
print
output may look a little wonky.)
Windows users can also get a recent python version from the python homepage
Naturally, everything mentioned above is free.
Learning Python
The python homepage has a tutorial and lots of documentation. When I was learning python - a very long time ago now - I found the best book was Mark Lutz and David Asher’s ‘Learning Python’, published by O’Reilly. (Avoid the similarly titled but much larger ‘Programming Python’ by Mark Lutz.) These days I’m sure there’s a lovely online tutorial. The free and amusingly titled free ebook/pdf Learn Python the Hard Way looks pretty good to me.
Finding Libraries
It’s quite possible, and potentially rather fun to roll your own text processing code in python. The language does a lot for you already, from downloading pages from websites to processing xml and dealing with databases. However, some things move faster with a good targeted library.
Many useful libraries are linked from the python homepage. Of particular relevance to text processing applications is the Natural Language Processing Toolkit. NLTK implements a wide range of models from the natural language processing literature. If this aspect of content analysis interests you, you may want to have Manning and Schutze’s classic but very readable text Foundations of Statistical Natural Language Processing to hand.
Happy programming.