User:Mlforcada/Sandbox/basque
Contents
How to improve Apertium-eu-es 0.3
These are some notes on how to improve apertium-eu-es 0.3 so that its performance improves for assimilation purposes and its maintenance is easier for future developers.
Lexical coverage
Lexical coverage may be improved in different ways:
Regular vocabulary
- Collect large corpora of basque news text and search for unknown words (as has been done for version 0.3)
- Using possible new vocabulary from the new version of Matxin (extracting it and converting it to our format).
- Using existing vocabulary (esp. multiword lexical units or MWLUs) in current dictionaries of apertium-eu-es, especially tagging and activating untagged MWLUs.
Proper names
- Including massive lists of proper names (place names "gazeteer", person names, etc.).
- Using some kind of guesser for proper names so that we don't have to include them in the dictionary. Apertium-cy-en uses a guesser for proper names. We can look at endings. For instance, something like
<e> <re>[A-Z]([a-z]*)</re> <p> <l>tik</l> <r><s n="np"/><s n="top"/></r> </p> </e>
could detect a place name such as Tuscaloosa if the text contains Tuscaloosatik, with a regular expression entry (well, we would also have other endings like etik and dik)
Structural transfer
Verb chunks
We need to have paradigms for the potential ("ezan") and other verb structures. Perhaps we can use information in Matxin for this and other analytical verb forms.
Noun phrases and prepositional phrases
Naming conventions
The naming of NP and PP pseudo-lemmas should be systematized. If these pseudolemmas are not used by t2x, they could be arbitrarily long and descriptive.
For instance, we have D_n_pr_d_n<SN>
for "gizonaren etxea" but Det_nom<SN>
for "gizona".
Regarding pseudolemmas, I think we have to review what to use as chunk categories and what to use as chunk pseudolemmas. In a quick visit to the current .t2x I have seen cases where we try to detect lemmas when categories could have equally been used.
What is a chunk?
The idea of having Apertium 2.0 was to curtail the proliferation of long, flat patterns, by defining chunks as building bricks for an interchunk operation.