Difference between revisions of "User:Mlforcada/Sandbox/basque"

From Apertium
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 6: Line 6:


Lexical coverage may be improved in different ways:
Lexical coverage may be improved in different ways:

=== Regular vocabulary ===

* Collect large corpora of basque news text and search for unknown words (as has been done for version 0.3). Newcomers can use the spreadsheets designed by Mireia to do so.

* Using possible new vocabulary from the new version of Matxin (extracting it and converting it to our format).

* Using existing vocabulary (esp. multiword lexical units or MWLUs) in current dictionaries of apertium-eu-es, especially tagging and activating untagged MWLUs.


=== Proper names ===
=== Proper names ===
Line 11: Line 19:
* Including massive lists of proper names (place names "gazeteer", person names, etc.).
* Including massive lists of proper names (place names "gazeteer", person names, etc.).


* Using some kind of guesser for proper names so that we don't have to include them in the dictioanry.
* Using some kind of guesser for proper names so that we don't have to include them in the dictionary. Apertium-cy-en uses a guesser for proper names. We can look at endings. For instance, something like

<pre>
<e>
<re>[A-Z]([a-z]*)</re>
<p>
<l>tik</l>
<r><s n="np"/><s n="top"/></r>
</p>
</e>
</pre>
could detect a place name such as Tuscaloosa if the text contains Tuscaloosatik, with a regular expression entry (well, we would also have other endings like ''etik'' and ''dik''; thanks Fran).

== Structural transfer ==

===Verb chunks===

We need to have paradigms for the potential ("ezan") and other verb structures. Perhaps we can use information in Matxin for this and other analytical verb forms.

Having "verb chunks" (when they are continuous, which is sometimes not the case for negation) and explicitly marked ergative NPs could allow us to generate a correct Spanish word order for some short sentences using interchunk operations (for instance NP-erg, NP-abs VP-nor-nork --> NP-erg VP-nor-nork NP-abs: currently there is no such rule "Gizonek ogia erosi dute" <math>\to</math> "Los hombres el pan han comprado")

=== Noun phrases and prepositional phrases ===

==== Naming conventions ====

The naming of NP and PP pseudo-lemmas should be systematized. If these pseudolemmas are not used by t2x, they could be arbitrarily long and descriptive.

For instance, we have for "gizonaren etxea" but <code>Det_nom<SN></code> for "gizona".

Regarding pseudolemmas, I think we have to review what to use as chunk categories and what to use as chunk pseudolemmas. In a quick visit to the current .t2x I have seen cases where we detect pseudolemmas when categories could have equally been used with no "lexicalization".

==== What should constitute a chunk? ====

The idea of having Apertium 2.0 was to curtail the proliferation of long, flat patterns, by defining chunks as building bricks or factors for a later interchunk operation. This does not increase the computational power of our structural transfer, but allows factoring long Apertium 1 patterns into shorter chunks and interchunk operations.

Having '''long chunks''' extends the range of interchunk operations but at the cost of writing many long chunks. Focussing on those long chunks that appear frequently in corpora could be a possible compromise. I would not remove any existing chunk now.

Having '''short chunks''' and rich interchunk operations makes the description of chunks simpler but reduces the range (length) of structural transfer operations.

We have to find a way to reconcile both, writing our interchunk operations in the most general way, so that they can operate on short NP / PP chunks and frequent but long NP/PP chunks. This could help reducing the size of structural transfer files (this causes problems with the current interpreter).

For instance, currently we have "complex" chunks such as <code>D_n_pr_d_n<SN></code> "gizonaren etxea" (genitive structure treated at the chunk stage without interchunk operation) but "simple" chunks followed by an interchunk operation in cases such as what happens with "gizon zaharraren etxea". Two chunks are detected (<code>Pr_det_nom_adj<SPGEN></code> and <code>det_nom<SN></code>) and then reordered by an interchunk rule. Similarly, we have very long chunks such as <code>D_n_a_pr_d_n<SN></code>, <code>^D_n_pr_d_n_pr_d_n<SN></code>, <code>^Pr_d_n_a_pr_d_n<SPGEN></code>. The criteria for doing things in the chunk phase or as interchunk operations should be explicitly formulated and reviewed:

* if a long chunk is included, it should be for frequency reasons, and be added in a special part of the rule file
* long chunks should be ready for use with interchunk rules (I think they are now)


==== ''Simple'' NP chunks should be as complete as possible to minimize "word salad" ====

Here is a list of "simple" NP chunks of systematically increased complexity. Those leading to wrong translations are marked (*):

* Softwarea.
* Software traketsa.
* Software traketsagoa.
* Azken softwarea.
* Azken software traketsa.
* Azken software traketsagoa (**)
* Hizkuntz softwarea.
* Azken hizkuntz softwarea.
* Hizkuntz software traketsa. (*)
* Hizkuntz software traketsagoa. (*)
* Azken hizkuntz software traketsa. (*)
* Azken hizkuntz software traketsagoa. (*)

Their corresponding genitives are tested in the context ("& etorkizuna"), where the "&" is ignored by the system. Note that some NPs are treated correctly above but not here when they are genitives:

* Softwarearen & etorkizuna.
* Software traketsaren & etorkizuna.
* Software traketsagoaren & etorkizuna. (*)
* Azken softwarearen & etorkizuna.
* Azken software traketsaren & etorkizuna.
* Azken software traketsagoaren & etorkizuna. (*)
* Hizkuntz softwarearen & etorkizuna.
* Azken hizkuntz softwarearen & etorkizuna.
* Hizkuntz software traketsaren & etorkizuna (*)
* Hizkuntz software traketsagoaren & etorkizuna. (*)
* Azken hizkuntz software traketsaren & etorkizuna. (*)
* Azken hizkuntz software traketsagoaren & etorkizuna. (*)

==== Other things missing from chunks ====

Some errors can easily be corrected by extending category definitions:

* ''Gizon zahar baten etxea'' <math>\to</math> ''La casa de un hombre viejo''.
* ''Gizon zaharrago baten etxea.'' <math>\to</math> *''De un hombre más viejo la casa.''

Latest revision as of 10:21, 19 November 2008

How to improve Apertium-eu-es 0.3[edit]

These are some notes on how to improve apertium-eu-es 0.3 so that its performance improves for assimilation purposes and its maintenance is easier for future developers.

Lexical coverage[edit]

Lexical coverage may be improved in different ways:

Regular vocabulary[edit]

  • Collect large corpora of basque news text and search for unknown words (as has been done for version 0.3). Newcomers can use the spreadsheets designed by Mireia to do so.
  • Using possible new vocabulary from the new version of Matxin (extracting it and converting it to our format).
  • Using existing vocabulary (esp. multiword lexical units or MWLUs) in current dictionaries of apertium-eu-es, especially tagging and activating untagged MWLUs.

Proper names[edit]

  • Including massive lists of proper names (place names "gazeteer", person names, etc.).
  • Using some kind of guesser for proper names so that we don't have to include them in the dictionary. Apertium-cy-en uses a guesser for proper names. We can look at endings. For instance, something like
<e>
         <re>[A-Z]([a-z]*)</re>
         <p>
           <l>tik</l>
           <r><s n="np"/><s n="top"/></r>
         </p>
       </e> 

could detect a place name such as Tuscaloosa if the text contains Tuscaloosatik, with a regular expression entry (well, we would also have other endings like etik and dik; thanks Fran).

Structural transfer[edit]

Verb chunks[edit]

We need to have paradigms for the potential ("ezan") and other verb structures. Perhaps we can use information in Matxin for this and other analytical verb forms.

Having "verb chunks" (when they are continuous, which is sometimes not the case for negation) and explicitly marked ergative NPs could allow us to generate a correct Spanish word order for some short sentences using interchunk operations (for instance NP-erg, NP-abs VP-nor-nork --> NP-erg VP-nor-nork NP-abs: currently there is no such rule "Gizonek ogia erosi dute" "Los hombres el pan han comprado")

Noun phrases and prepositional phrases[edit]

Naming conventions[edit]

The naming of NP and PP pseudo-lemmas should be systematized. If these pseudolemmas are not used by t2x, they could be arbitrarily long and descriptive.

For instance, we have for "gizonaren etxea" but Det_nom<SN> for "gizona".

Regarding pseudolemmas, I think we have to review what to use as chunk categories and what to use as chunk pseudolemmas. In a quick visit to the current .t2x I have seen cases where we detect pseudolemmas when categories could have equally been used with no "lexicalization".

What should constitute a chunk?[edit]

The idea of having Apertium 2.0 was to curtail the proliferation of long, flat patterns, by defining chunks as building bricks or factors for a later interchunk operation. This does not increase the computational power of our structural transfer, but allows factoring long Apertium 1 patterns into shorter chunks and interchunk operations.

Having long chunks extends the range of interchunk operations but at the cost of writing many long chunks. Focussing on those long chunks that appear frequently in corpora could be a possible compromise. I would not remove any existing chunk now.

Having short chunks and rich interchunk operations makes the description of chunks simpler but reduces the range (length) of structural transfer operations.

We have to find a way to reconcile both, writing our interchunk operations in the most general way, so that they can operate on short NP / PP chunks and frequent but long NP/PP chunks. This could help reducing the size of structural transfer files (this causes problems with the current interpreter).

For instance, currently we have "complex" chunks such as D_n_pr_d_n<SN> "gizonaren etxea" (genitive structure treated at the chunk stage without interchunk operation) but "simple" chunks followed by an interchunk operation in cases such as what happens with "gizon zaharraren etxea". Two chunks are detected (Pr_det_nom_adj<SPGEN> and det_nom<SN>) and then reordered by an interchunk rule. Similarly, we have very long chunks such as D_n_a_pr_d_n<SN>, ^D_n_pr_d_n_pr_d_n<SN>, ^Pr_d_n_a_pr_d_n<SPGEN>. The criteria for doing things in the chunk phase or as interchunk operations should be explicitly formulated and reviewed:

  • if a long chunk is included, it should be for frequency reasons, and be added in a special part of the rule file
  • long chunks should be ready for use with interchunk rules (I think they are now)


Simple NP chunks should be as complete as possible to minimize "word salad"[edit]

Here is a list of "simple" NP chunks of systematically increased complexity. Those leading to wrong translations are marked (*):

  • Softwarea.
  • Software traketsa.
  • Software traketsagoa.
  • Azken softwarea.
  • Azken software traketsa.
  • Azken software traketsagoa (**)
  • Hizkuntz softwarea.
  • Azken hizkuntz softwarea.
  • Hizkuntz software traketsa. (*)
  • Hizkuntz software traketsagoa. (*)
  • Azken hizkuntz software traketsa. (*)
  • Azken hizkuntz software traketsagoa. (*)

Their corresponding genitives are tested in the context ("& etorkizuna"), where the "&" is ignored by the system. Note that some NPs are treated correctly above but not here when they are genitives:

  • Softwarearen & etorkizuna.
  • Software traketsaren & etorkizuna.
  • Software traketsagoaren & etorkizuna. (*)
  • Azken softwarearen & etorkizuna.
  • Azken software traketsaren & etorkizuna.
  • Azken software traketsagoaren & etorkizuna. (*)
  • Hizkuntz softwarearen & etorkizuna.
  • Azken hizkuntz softwarearen & etorkizuna.
  • Hizkuntz software traketsaren & etorkizuna (*)
  • Hizkuntz software traketsagoaren & etorkizuna. (*)
  • Azken hizkuntz software traketsaren & etorkizuna. (*)
  • Azken hizkuntz software traketsagoaren & etorkizuna. (*)

Other things missing from chunks[edit]

Some errors can easily be corrected by extending category definitions:

  • Gizon zahar baten etxea La casa de un hombre viejo.
  • Gizon zaharrago baten etxea. *De un hombre más viejo la casa.