Difference between revisions of "The Right Way to count lexc stems"

From Apertium
Jump to navigation Jump to search
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  +
{{Github-unmigrated-tool}}
   
 
== Challenges ==
 
== Challenges ==
   
* Unlike [[dix]] format, lexc is not xml. ...
+
* Unlike [[Monodix basics|dix]] format, lexc is not xml. ...
 
* Entries might be repeated
 
* Entries might be repeated
   
Line 17: Line 18:
 
#* Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
 
#* Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
 
#* If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
 
#* If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
  +
#* ASSUMPTION: Root lexicon points to exactly the set of lexica with lexical items
 
# Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.
 
# Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.
   
Line 27: Line 29:
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]
 
[[Category:Tools]]
 
[[Category:Tools]]
  +
[[Category:Lexc]]

Latest revision as of 02:42, 10 March 2018

Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

Challenges[edit]

  • Unlike dix format, lexc is not xml. ...
  • Entries might be repeated

Multiple entries for a single stem.[edit]

  • ...

How we're doing it now[edit]

The current approach is to recurse through the LEXC file, considering only lexica pointed to from Root and lexica pointed to by those, and counting only the unique entries (by stem and continuation lexicon) in each one.

  1. Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
    • Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
    • Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
  2. After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
    • Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
    • If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
    • ASSUMPTION: Root lexicon points to exactly the set of lexica with lexical items
  3. Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.

Ready-made script[edit]

lexccounter.py

See Also[edit]