Difference between revisions of "User:Mlforcada/Sandbox/basque"

From Apertium
Jump to navigation Jump to search
Line 6: Line 6:
   
 
Lexical coverage may be improved in different ways:
 
Lexical coverage may be improved in different ways:
  +
  +
=== Regular vocabulary ===
  +
  +
* Collect large corpora of basque news text and search for unknown words (as has been done for version 0.3)
  +
  +
* Using possible new vocabulary from the new version of Matxin
  +
  +
* Using existing vocabulary (esp. multiword lexical units or MWLUs) in current dictionaries of apertium-eu-es, especially tagging and activating untagged MWLUs.
   
 
=== Proper names ===
 
=== Proper names ===
Line 11: Line 19:
 
* Including massive lists of proper names (place names "gazeteer", person names, etc.).
 
* Including massive lists of proper names (place names "gazeteer", person names, etc.).
   
* Using some kind of guesser for proper names so that we don't have to include them in the dictioanry.
+
* Using some kind of guesser for proper names so that we don't have to include them in the dictionary.
  +
  +
  +
  +
== Structural transfer ===
  +
  +
===Verb chunks===
  +
  +
We need to have paradigms for the potential ("ezan") and other verb structures. Perhaps we can use information in Matxin for this and other analytical verb forms.
  +
  +
=== Noun phrases and prepositional phrases ===
  +
  +
==== Naming conventions ====

Revision as of 09:03, 19 November 2008

How to improve Apertium-eu-es 0.3

These are some notes on how to improve apertium-eu-es 0.3 so that its performance improves for assimilation purposes and its maintenance is easier for future developers.

Lexical coverage

Lexical coverage may be improved in different ways:

Regular vocabulary

  • Collect large corpora of basque news text and search for unknown words (as has been done for version 0.3)
  • Using possible new vocabulary from the new version of Matxin
  • Using existing vocabulary (esp. multiword lexical units or MWLUs) in current dictionaries of apertium-eu-es, especially tagging and activating untagged MWLUs.

Proper names

  • Including massive lists of proper names (place names "gazeteer", person names, etc.).
  • Using some kind of guesser for proper names so that we don't have to include them in the dictionary.


Structural transfer =

Verb chunks

We need to have paradigms for the potential ("ezan") and other verb structures. Perhaps we can use information in Matxin for this and other analytical verb forms.

Noun phrases and prepositional phrases

Naming conventions