The Right Way to count lexc stems

From Apertium
Jump to navigation Jump to search

Challenges

  • Unlike dix format, lexc is not xml. ...

Multiple entries for a single stem.

How we're doing it now

  1. Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
    • Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
    • Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
  2. After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
    • Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
    • If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
  3. Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.

Ready-made script

lexccounter.py

See Also