conjugateprior

A conversion to Yoshikoder format

A couple of months ago somebody asked how to convert a new dictionary file so that it would run in the Yoshikoder. The format of the original file had two parts: the first half looked like:

%                   
1   funct               
2   pronoun             
3   ppron               
4   i               
5   we              
...

which assigns identifiers to category labels, and a second half that looked like:

%                  
a   1   10
abandon*    125 127 130 131 137
abdomen*    146 147     
abilit* 355
...

in which words or wildcarded patterns were assigned to categories via their identifiers. How to get it into Yoshikoder-readable format?

The target XML format used by the Yoshikoder in contrast looks like this (a snippet from the Laver and Garry dictionary):

<dictionary style="050805" patternengine="substring">
  <cnode name="Laver and Garry">
    <cnode name="Economy">
      <cnode name="+State+">
         <pnode name="accommodation"/>
         <pnode name="age"/>
         <pnode name="ambulance"/>

The Yoshikoder format can only represent nested category/pattern structures so the 'tag' style had to be flattened out and some words repeated.

The following python script does the conversion. (The first line should make sure it will work in Python 3 as well as 2.)

from __future__ import print_function

import sys, os, re
from xml.sax.saxutils import escape

pm = re.compile(r"(\d+)\t(\w+)\t+") ## first set of entries
en = re.compile(r"([\w*]+)([\t\d+]+)") ## second set of entries

if __name__=='__main__':
    fname = sys.argv[1]
    lines = open(fname).readlines()
    id2cat = {}
    cat2wd = {}    
    for line in lines:
        mm = re.match(pm, line)
            se = re.match(en, line)
            if mm:
                id2cat[mm.group(1)] = mm.group(2)
            elif se:
                wd = se.group(1)
                ids = filter(None, se.group(2).split('\t'))
                for id in ids:
                    catname = id2cat[id]
                    try:
                        cat2wd.get(catname).add(wd)
                    except: 
                        cat2wd[catname] = set()
                        cat2wd[catname].add(wd)
            else:
                pass

    print('<?xml version="1.0" encoding="UTF-8"?>')
    print('<dictionary style="050805" patternengine="substring">')
    print('<cnode name="imported-%s">' % escape(fname))
    for ent in cat2wd.keys():
        print('<cnode name="%s">' % escape(ent))
        for el in cat2wd[ent]:
            print('<pnode name="%s"/>' % escape(el))
        print('</cnode>')
    print('</cnode>')
    print('</dictionary>')

On an OSX/Unix command line this runs as

python trans.py filetoconvert.txt > convertedfile.ykd

Archived page and comments

From the Internet Archive