Difference between revisions of "The Right Way to count lexc stems"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) (Created page with " == Challenges == * Unlike dix format, lexc is not xml. ... === Multiple entries for a single stem. === == How we're doing it now == == Ready-made script == [https:...") |
|||
Line 8: | Line 8: | ||
== How we're doing it now == |
== How we're doing it now == |
||
# Iterate through each line of the LEXC file, maintaining a record of the current lexicon. |
|||
#* Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name. |
|||
#* Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name. |
|||
# After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers |
|||
#* Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set. |
|||
#* If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set. |
|||
# Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary. |
|||
== Ready-made script == |
== Ready-made script == |
Revision as of 03:04, 8 January 2014
Contents
Challenges
- Unlike dix format, lexc is not xml. ...
Multiple entries for a single stem.
How we're doing it now
- Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
- Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
- Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
- After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
- Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
- If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
- Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.