Difference between revisions of "The Right Way to count lexc stems"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:


== How we're doing it now ==
== How we're doing it now ==
The current approach is to recurse through the LEXC file, following only lexica pointed to from Root and lexica pointed to by those, counting only '''unique''' entries (by stem and continuation lexicon) in each one.

# Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
# Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
#* Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
#* Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.

Revision as of 03:09, 8 January 2014

Challenges

  • Unlike dix format, lexc is not xml. ...

Multiple entries for a single stem.

How we're doing it now

The current approach is to recurse through the LEXC file, following only lexica pointed to from Root and lexica pointed to by those, counting only unique entries (by stem and continuation lexicon) in each one.

  1. Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
    • Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
    • Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
  2. After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
    • Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
    • If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
  3. Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.

Ready-made script

lexccounter.py

See Also