The Right Way to count lexc stems

From Apertium
Jump to navigation Jump to search
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

Challenges

  • Unlike dix format, lexc is not xml. ...
  • Entries might be repeated

Multiple entries for a single stem.

  • ...

How we're doing it now

The current approach is to recurse through the LEXC file, considering only lexica pointed to from Root and lexica pointed to by those, and counting only the unique entries (by stem and continuation lexicon) in each one.

  1. Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
    • Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
    • Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
  2. After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
    • Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
    • If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
    • ASSUMPTION: Root lexicon points to exactly the set of lexica with lexical items
  3. Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.

Ready-made script

lexccounter.py

See Also