The Right Way to count lexc stems
Jump to navigation
Jump to search
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.
Contents
Challenges
- Unlike dix format, lexc is not xml. ...
- Entries might be repeated
Multiple entries for a single stem.
- ...
How we're doing it now
The current approach is to recurse through the LEXC file, considering only lexica pointed to from Root and lexica pointed to by those, and counting only the unique entries (by stem and continuation lexicon) in each one.
- Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
- Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
- Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
- After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
- Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
- If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
- ASSUMPTION: Root lexicon points to exactly the set of lexica with lexical items
- Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.