Difference between revisions of "The Right Way to count lexc stems"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
| Firespeaker (talk | contribs)  (Created page with " == Challenges ==  * Unlike dix format, lexc is not xml.  ...  === Multiple entries for a single stem. ===   == How we're doing it now ==   == Ready-made script == [https:...") | |||
| (8 intermediate revisions by 4 users not shown) | |||
| Line 1: | Line 1: | ||
| {{Github-unmigrated-tool}} | |||
| == Challenges == | == Challenges == | ||
| * Unlike [[dix]] format, lexc is not xml.  ... | * Unlike [[Monodix basics|dix]] format, lexc is not xml.  ... | ||
| * Entries might be repeated | |||
| === Multiple entries for a single stem. === | === Multiple entries for a single stem. === | ||
| * ... | |||
| == How we're doing it now == | == How we're doing it now == | ||
| The current approach is to recurse through the LEXC file, considering only lexica pointed to from Root and lexica pointed to by those, and counting only the '''unique''' entries (by stem and continuation lexicon) in each one. | |||
| # Iterate through each line of the LEXC file, maintaining a record of the current lexicon.  | |||
| #* Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.  | |||
| #* Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.  | |||
| # After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers | |||
| #* Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set. | |||
| #* If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set. | |||
| #* ASSUMPTION: Root lexicon points to exactly the set of lexica with lexical items | |||
| # Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary. | |||
| == Ready-made script == | == Ready-made script == | ||
| Line 18: | Line 29: | ||
| [[Category:Documentation]] | [[Category:Documentation]] | ||
| [[Category:Documentation in English]] | |||
| [[Category:Tools]] | [[Category:Tools]] | ||
| [[Category:Lexc]] | |||
Latest revision as of 02:42, 10 March 2018
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.
Contents
Challenges[edit]
- Unlike dix format, lexc is not xml. ...
- Entries might be repeated
Multiple entries for a single stem.[edit]
- ...
How we're doing it now[edit]
The current approach is to recurse through the LEXC file, considering only lexica pointed to from Root and lexica pointed to by those, and counting only the unique entries (by stem and continuation lexicon) in each one.
- Iterate through each line of the LEXC file, maintaining a record of the current lexicon.
- Upon encountering an entry, attempt to parse the entry and add a tuple of the lemma and continuation lexicon to a set labeled by the current lexicon name.
- Upon encountering a pointer to another lexicon, add the pointer to a list labeled by the current lexicon name.
 
- After processing the entire file, lookup the Root lexicon in the stored data and iterate through its pointers
- Add the entries of each lexicon the Root lexicon contains a pointer towards to a global entries set.
- If there are pointers present in these lexicon, follow them as well and add their entries to the global entries set.
- ASSUMPTION: Root lexicon points to exactly the set of lexica with lexical items
 
- Output the length of the accumulated entries set, a number representative of the unique stems present in the LEXC dictionary.

