The Right Way to count lexc stems
Contents
Challenges
- Unlike dix format, lexc is not xml. ...
Multiple entries for a single stem.
How we're doing it now
- Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
- Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
- Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
- After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
- Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
- If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
- Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.