JCA is a suite of tools to count words, apply content analysis dictionaries, examine keywords and categories in local document context (KWIC), and a few other things.

It is available in three forms: jca, which is both a Java library and some associated with command line tools, and rjca which is an R package.  Probably you want the R package version.

Features

  • Creates word frequency matrices in sparse formats (LDA-C or Matrix Market) with minimal memory usage. Supports stop word, currency, and number removal, and stemming.
  • Reads in content analysis dictionaries in Yoshikoder, Lexicoder, LIWC, VBPro, and Wordstat format. Supports multi-word pattern matching.
  • Creates concordances (keywords in context) for categories, dictionaries, words or phrases in text or HTML.
  • Turns text documents into single files suitable for Mallet.

rjca

The rjca package drives the JCA tools and offers some convenience functions for dealing with the output. Here’s the package vignette that works through some of them.

Installation

You’ll need Java 1.8 and a recent version of R. R is available from a mirror site here. Installation instructions for Java and rjca are here and source code for the R part is here.

If instead you prefer to work from the commandline then you might prefer jca. Installation instructions for that are here, releases are here and source code is here.

License

Everything here is open source and distributed under the Gnu Public License (GPL).

Citation

If you’d like to refer to this software in written work, you can use one of these:

Lowe W. (2015) ‘rjca: An R package to drive Java Content Analysis tools’. R software version 0.2, URL https://github.com/conjugateprior/rjca

Lowe W. (2015) ‘jca: Java Content Analysis tools’. Java software version 0.2.4.1, URL https://github.com/conjugateprior/jca