Difference between revisions of "The Right Way to count lexc stems"

From Apertium
Jump to navigation Jump to search
(Created page with " == Challenges == * Unlike dix format, lexc is not xml. ... === Multiple entries for a single stem. === == How we're doing it now == == Ready-made script == [https:...")
 
Line 8: Line 8:
   
 
== How we're doing it now ==
 
== How we're doing it now ==
  +
# Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
 
  +
#* Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
  +
#* Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
  +
# After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
  +
#* Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
  +
#* If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
  +
# Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.
   
 
== Ready-made script ==
 
== Ready-made script ==

Revision as of 03:04, 8 January 2014

Challenges

  • Unlike dix format, lexc is not xml. ...

Multiple entries for a single stem.

How we're doing it now

  1. Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
    • Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
    • Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
  2. After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
    • Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
    • If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
  3. Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.

Ready-made script

lexccounter.py

See Also