YKConverter

Turn documents into texts

The YKConverter is a utility that tries to extract the text from documents in various formats (HTML, Word, PDF, Powerpoint, Excel) and save it as UTF-8 encoded text. You might do this to prepare for a subsequent content analysis.

The software is a thin wrapper around the awesome Apache POIPDFBox, and Tag Soup libraries. They’re why it’s so big (and also why it works).

Just drag your documents into the window and they will be converted. Some formats take longer than others, and the results are never guaranteed. When conversion is complete you can adjust the resulting text if you like. Depending on how you set the Preferences when you save all the documents in the window they will either arrive next to the they converted, or in a folder of your choice.

It looks like this on a Mac with two converted files and the help open. Click on the image to see it full size in a new browser window.

Download

You can download the latest release (0.5) of the software for your operating system below. Note that you must have Java installed first. (Mac users should be prompted if it is not already there.)

Mac OSX: YKConverter-0.5.dmg
Any OS: YKConverter-0.5.jar

Windows users may need to right-click save-as in order to download the jar version without problems. If Java is installed then a window should pop up after double clicking the jar file.

The source code for this release is on Github. Source code for previous releases is on Sourceforge.

License

YKConverter is open source software distributed under the Gnu Public License (GPL).

Citation

If you’d like to refer to the package in written work, you can use this:

Lowe W. (2010) ‘YKConverter: Turn documents into texts’. Java software version 0.5, URL http://www.conjugateprior.org/software/ykconverter/