A conversion to Yoshikoder format

A couple of months ago somebody asked how to convert a new dictionary file so that it would run in the Yoshikoder. The format of the original file had two parts: the first half looked like:

1	funct							
2	pronoun							
3	ppron							
4	i							
5	we							

which assigns identifiers to category labels, and a second half that looked like:

a	1	10
abandon*	125	127	130	131	137
abdomen*	146	147			
abilit*	355

in which words or wildcarded patterns were assigned to categories via their identifiers. How to get it into Yoshikoder-readable format?

The target XML format used by the Yoshikider in contrast looks like this (a snippet from the Laver and Garry dictionary):

<dictionary style="050805" patternengine="substring">
<cnode name="Laver and Garry">
 <cnode name="Economy">
  <cnode name="+State+">
<pnode name="accommodation"/>
<pnode name="age"/>
<pnode name="ambulance"/>

The Yoshikoder format can only represent nested category/pattern structures so the ‘tag’ style had to be flattened out and some words repeated.

The following python script does the conversion:

import sys, os, re
from xml.sax.saxutils import escape

pm = re.compile(r"(\d+)\t(\w+)\t+") ## first set of entries
en = re.compile(r"([\w*]+)([\t\d+]+)") ## second set of entries

if __name__=='__main__':
    fname = sys.argv[1]
    lines = open(fname).readlines()
    id2cat = {}
    cat2wd = {}    
    for line in lines:
        mm = re.match(pm, line)
        se = re.match(en, line)
        if mm:
            id2cat[mm.group(1)] = mm.group(2)
        elif se:
            wd = se.group(1)
            ids = filter(None, se.group(2).split('\t'))
            for id in ids:
                catname = id2cat[id]
                    cat2wd[catname] = set()

    print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
    print "<dictionary style=\"050805\" patternengine=\"substring\">"
    print "<cnode name=\"imported-%s\">" % escape(fname)
    for ent in cat2wd.keys():
        print "<cnode name=\"%s\">" % escape(ent)
        for el in cat2wd[ent]:
            print "<pnode name=\"%s\"/>" % escape(el)
        print "</cnode>"
    print "</cnode>"
    print "</dictionary>"

On the command line this runs as

python trans.py filetoconvert.txt > convertedfile.ykd

Seems to work to too.

Leave a Reply

Your email address will not be published. Required fields are marked *