Difference between revisions of "Basque and Spanish"

From Apertium
Jump to navigation Jump to search
 
(20 intermediate revisions by 8 users not shown)
Line 1: Line 1:
  +
{{TOCD}}
=The idea=
 
 
Mireia Ginestí is recycling [[Matxin]] data to build an Apertium-based system that would allow Spanish speakers to read Basque newspapers.
 
Mireia Ginestí is recycling [http://www.sf.net/projects/matxin/ Matxin] data to build an Apertium-based system that would allow Spanish speakers to read Basque newspapers.
 
   
 
Some of the morphological choices in Matxin will be revised.
 
Some of the morphological choices in Matxin will be revised.
Line 7: Line 6:
 
This document is to keep track of decisions and to raise questions
 
This document is to keep track of decisions and to raise questions
   
=Deklinabidea?=
+
==Deklinabidea?==
   
 
For instance, "declination" will be treated as postpositions:
 
For instance, "declination" will be treated as postpositions:
Line 39: Line 38:
 
</pre>
 
</pre>
   
=Possessives?=
+
==Possessives?==
   
 
A problem appears with "possessives" like 'nire', 'gure', 'zuen', 'haien', 'bere'. Should they be treated as preadjectives ('izenlagun') or as genitive constructs:
 
A problem appears with "possessives" like 'nire', 'gure', 'zuen', 'haien', 'bere'. Should they be treated as preadjectives ('izenlagun') or as genitive constructs:
Line 47: Line 46:
 
</pre>
 
</pre>
   
=Undefined determiners (or quantifiers)=
+
==Undefined determiners (or quantifiers)==
   
 
There are some words in basque that could be considered as adjectives or as quantifiers (asko, gehiegi, nahiko, etc.).
 
There are some words in basque that could be considered as adjectives or as quantifiers (asko, gehiegi, nahiko, etc.).
Line 57: Line 56:
 
Matxin dictionaries tag them as undefined determiners. We decided to tag them this way, with a distinction for the ones that come usually before the noun, like the adjectives 'izenlagun' (for example, 'nahiko').
 
Matxin dictionaries tag them as undefined determiners. We decided to tag them this way, with a distinction for the ones that come usually before the noun, like the adjectives 'izenlagun' (for example, 'nahiko').
   
=Distinction plural determiner/ergative=
+
==Distinction plural determiner/ergative==
   
 
The plural article and the ergative postposition have the same form (are ambiguous):
 
The plural article and the ergative postposition have the same form (are ambiguous):
Line 72: Line 71:
   
   
=Tagger categories=
+
==Tagger categories==
   
 
The way basque is analysed in our system gives some problems to the tagger. Postpositions and determiners are attached to the main word with a '+' (<j/> in xml). The tagger sees the resulting lexical forms as separate forms unless a def-mult element is defined in the tagger with 2 or more lexical forms.
 
The way basque is analysed in our system gives some problems to the tagger. Postpositions and determiners are attached to the main word with a '+' (<j/> in xml). The tagger sees the resulting lexical forms as separate forms unless a def-mult element is defined in the tagger with 2 or more lexical forms.
Line 127: Line 126:
   
   
= Positional nous =
+
== Positional nouns ==
   
 
Basque has constructions to express positions relative to an object which are based around what we could call positional nouns. For instance the positional noun 'aurre' (front part) is used in 'etxearen aurrean' (in front of the house) or 'etxearen aurretik' (starting at the front of the house). Here is a non-exhaustive list of these positional nouns:
 
Basque has constructions to express positions relative to an object which are based around what we could call positional nouns. For instance the positional noun 'aurre' (front part) is used in 'etxearen aurrean' (in front of the house) or 'etxearen aurretik' (starting at the front of the house). Here is a non-exhaustive list of these positional nouns:
Line 146: Line 145:
 
When these nouns appear with one of these postpositions, they have the function of an adverb (''aurrean'' -> ''in front of'') and the preceding noun appears in genitive (''etxearen aurrean'').
 
When these nouns appear with one of these postpositions, they have the function of an adverb (''aurrean'' -> ''in front of'') and the preceding noun appears in genitive (''etxearen aurrean'').
   
= Adverbs used as postpositions =
+
== Adverbs used as postpositions ==
   
 
A similar construct takes a NP or a PP (with a particular postposition) and a special word which works like an adverb (but can't be considered a positional noun). The whole construct can work as an adverb, or as an adjective if the ''-ko'' form is used:
 
A similar construct takes a NP or a PP (with a particular postposition) and a special word which works like an adverb (but can't be considered a positional noun). The whole construct can work as an adverb, or as an adjective if the ''-ko'' form is used:
   
'''preceding NP in genitive''':
+
'''following a NP in genitive''':
* GEN ''kontra''[''ko''] (agains)
+
* GEN ''kontra''[''ko''] (against)
 
* GEN ''aurka''[''ko''] (against)
 
* GEN ''aurka''[''ko''] (against)
 
* GEN ''alde'' (for)
 
* GEN ''alde'' (for)
 
* GEN ''arabera'' (according to)
 
* GEN ''arabera'' (according to)
   
'''preceding NP in absolutive or other cases''':
+
'''following a NP in absolutive or other cases''':
* ABS|PART ''gabe'' (without)
+
* ABS|PART ''gabe''[''ko''] (without)
 
* ABS|ERG| ''salbu'' (except)
 
* ABS|ERG| ''salbu'' (except)
 
* INSTR ''gain'' (in addition to)
 
* INSTR ''gain'' (in addition to)
 
* DAT ''esker'' (thanks to)
 
* DAT ''esker'' (thanks to)
* DEST (ABS) ''arte'' (until)
+
* ADL (ABS) ''arte'' (until)
 
* ABS ''inguru'' (around)
 
* ABS ''inguru'' (around)
   
 
== Questions about word detection and LRLM ==
 
 
= Questions about word detection and LRLM =
 
   
 
In Romance languages, the left-to-right-longest-match system is a good solution for multiword detection.
 
In Romance languages, the left-to-right-longest-match system is a good solution for multiword detection.
For example, if you define the Spanish multiword ''de nuevo'' as adverb, the morphological dictionary will deliver this analysis and not the word-by-word analysis 'de<pr> nuevo<adj'. Other examples are ''hoy en día'' and ''con las manos en la masa'', where you do not have any conflicting analysis since the longest match from left to right is the only delivered analysis.
+
For example, if you define the Spanish multiword ''de nuevo'' as adverb, the morphological dictionary will deliver this analysis and not the word-by-word analysis 'de<pr> nuevo<adj>'. Other examples are ''hoy en día'' and ''con las manos en la masa'', where you do not have any conflicting analysis since the longest match from left to right is the only delivered analysis.
   
 
The way Basque is analysed in our system gives us some problems in this aspect. Since lexical forms are agglutinated to form a 'word' and the analysis of the lexical forms that form a word is joined with '+' in the stream, the LRLM system is not applied in this case.
 
The way Basque is analysed in our system gives us some problems in this aspect. Since lexical forms are agglutinated to form a 'word' and the analysis of the lexical forms that form a word is joined with '+' in the stream, the LRLM system is not applied in this case.
   
For example: ''berri'' means ''new'' (adjective and noun). ''berriz'' is and adverb meaning ''again'', which can be split into ''berri''<adj> and ''z''<post>. According to the method of analysis described above for romance languages, the analysis ''berriz''(adv) should overwrite the analysis ''berri''(adj) + ''z''(post). But this does not happen in this system and all possible analysis are delivered to the tagger, which has to decide which one is better (and in most cases has not enough information to do this):
+
For example: ''berri'' means ''new'' (adjective and noun). ''Berriz'' is an adverb meaning ''again'' or ''on the other hand'', which can be split into ''berri''<adj> and ''z''<post>. According to the method of analysis described above for romance languages, the analysis ''berriz''(adv) should overwrite the analysis ''berri''(adj) + ''z''(post). But this does not happen in this system and all possible analysis are delivered to the tagger, which has to decide which one is better (and in most cases has not enough information to do this):
   
 
berriz =
 
berriz =
Line 191: Line 188:
 
ohiko =
 
ohiko =
 
^ohiko/ohiko<adj><izl>/ohi<adj><izo>+a<det><art><sg>+ko<post>$
 
^ohiko/ohiko<adj><izl>/ohi<adj><izo>+a<det><art><sg>+ko<post>$
  +
  +
[Does this one fail? Wouldn't it always choose the first form?]
   
 
hartzen =
 
hartzen =
 
^hartzen/hartu<vblex><ger>/hartz<n>+en<post>/hartz<n>+a<det><art><pl>+en<post>$
 
^hartzen/hartu<vblex><ger>/hartz<n>+en<post>/hartz<n>+a<det><art><pl>+en<post>$
  +
  +
[Ugh!]
   
 
neurrizko =
 
neurrizko =
 
^neurrizko/neurri<n>+z<post><ko>/neurrizko<adj><izl>$
 
^neurrizko/neurri<n>+z<post><ko>/neurrizko<adj><izl>$
   
  +
[One idea to avoid some of these is to avoid some "cases" in the "declination" of participles. For instance, analysing ''joan'' = ''jo'' + ''a'' + ''an'' is uncommon. ''Ate joan'' (on the hit door) would be most likely said ''jotako atean''. Same with ''etxe erakutsiko'', which would be rendered more likely as ''erakutsitako etxeko''. We seem to have an overgenerating dictionary, which is quite nice when translating to eu but has many low-frequency entries which are useless when analysing eu]
   
 
When there is not a mutiple category defined in the tagger file, the tagger simply chooses the longest form (the one with more lexical forms) and sometimes chooses the first LF of an analysis and the 2nd and the 3rd LF of the other analysis.
 
When there is not a mutiple category defined in the tagger file, the tagger simply chooses the longest form (the one with more lexical forms) and sometimes chooses the first LF of an analysis and the 2nd and the 3rd LF of the other analysis.
Line 206: Line 208:
 
tagger =
 
tagger =
 
^hartz<n>+a<det><art><pl>+en<post>$ ^ukan<vbsint><pri><NR_HU><NK_NI>$
 
^hartz<n>+a<det><art><pl>+en<post>$ ^ukan<vbsint><pri><NR_HU><NK_NI>$
ouput = ''de los osos tengo'' (vs. ''he cogido'')
+
ouput = ''de los osos tengo'' (vs. ''yo cojo'', del verbo ''coger'')
 
   
 
Proposal: I think it could be useful to have a mechanism similar to LRLM in this case: beggining from left to right, the analysis with less lexical forms should be selected. For example, given the three analysis: ''hartu<vblex><ger>''; ''hartz<n>+en<post>''; ''hartz<n>+a<det><art><pl>+en<post>'', the first one should be matched on a left-to-right-longest-match basis.
 
Proposal: I think it could be useful to have a mechanism similar to LRLM in this case: beggining from left to right, the analysis with less lexical forms should be selected. For example, given the three analysis: ''hartu<vblex><ger>''; ''hartz<n>+en<post>''; ''hartz<n>+a<det><art><pl>+en<post>'', the first one should be matched on a left-to-right-longest-match basis.
Line 221: Line 222:
 
Any ideas?
 
Any ideas?
   
  +
[The problem seems to be hard and not easy to solve; perhaps the tagger could be instructed to preserve the chosen path instead of reassembling LFs from different paths into an impossible analysis.]
   
 
== See also ==
[[Category:Discussions]]
 
 
= See also =
 
 
* [[Basque to Spanish Tools|Some tools used]]
 
* [[Basque to Spanish Tools|Some tools used]]
  +
* [[/informe 2008]]
  +
* [http://www.euskaltzaindia.net/arauak Euskaltzaindiaren arauak] contains lists of Basque placenames and person names which could be added to dictionaries relatively easily.
  +
  +
== Further reading ==
  +
  +
* Ginestí-Rosell, M. and Ramírez-Sánchez, G. and Ortiz-Rojas, S. and Tyers, F. M. and Forcada, M. L. (2009) "Development of a free Basque to Spanish machine translation system". ''Procesamiento de Lenguaje Natural'' No. 43, pp. 185--197
  +
  +
[[Category:Basque and Spanish|*]]
  +
[[Category:Basque]]
  +
[[Category:Spanish]]
  +
  +
 
[[Category:Discussions]]

Latest revision as of 13:22, 10 December 2010

Mireia Ginestí is recycling Matxin data to build an Apertium-based system that would allow Spanish speakers to read Basque newspapers.

Some of the morphological choices in Matxin will be revised.

This document is to keep track of decisions and to raise questions

Deklinabidea?[edit]

For instance, "declination" will be treated as postpositions:

gizonentzat : gizon.n + a.det.pl + tzat.post

In principle, the absolutive will not be marked:

gizonak : gizon.n + a.det.pl

Determiners and postpositions will be given mnemonic lemmas, one per case.

gizonei : gizon.n + a.det.pl + i.post


Mirenekin : Miren.NP + kin.post
katuarentzat : katu.n + a.det.sg + tzat.post

Postpositions which can modify a noun phrase will be marked explicitly as ko

etxeetako: etxe.n + a.det.pl + ko.post.ko
Mikelekin : Mikel.NP + kin.post
Mikelekiko : Mikel.NP + kin.post.ko

Possessives?[edit]

A problem appears with "possessives" like 'nire', 'gure', 'zuen', 'haien', 'bere'. Should they be treated as preadjectives ('izenlagun') or as genitive constructs:

nire: ni.pron.sg + ren.post.ko
haien : hura.pron.pl + ren.post.ko

Undefined determiners (or quantifiers)[edit]

There are some words in basque that could be considered as adjectives or as quantifiers (asko, gehiegi, nahiko, etc.).

Like determiners and unlike adjectives, they can signal the end of a SN. This is a reason why they shouldn't be tagged as adjectives.

They can also be followed by another deteminer ('etxe askoa').

Matxin dictionaries tag them as undefined determiners. We decided to tag them this way, with a distinction for the ones that come usually before the noun, like the adjectives 'izenlagun' (for example, 'nahiko').

Distinction plural determiner/ergative[edit]

The plural article and the ergative postposition have the same form (are ambiguous):

^liburuak/liburu<n>+a<det><art><pl>/liburu<n>+a<det><art><sg>+k<post>$

The tagger will not, in principle, be able to solve this ambiguity, as it depends on long-distance characteristics (whether the verb is NOR-NORK/NOR-NORI-NORK or NOR/NOR-NORI).

Therefore the solution should take place in the transfer. It would be useful that the previous word has only one analysis (plural-ergative) so that no election is made until the word reaches the transfer module. But this would cause that the morphological analysis is not completely adequate.

Solution still pending.


Tagger categories[edit]

The way basque is analysed in our system gives some problems to the tagger. Postpositions and determiners are attached to the main word with a '+' (<j/> in xml). The tagger sees the resulting lexical forms as separate forms unless a def-mult element is defined in the tagger with 2 or more lexical forms. This is the analysis of 'liburuak'

^liburuak/liburu<n>+a<det><art><pl>/liburu<n>+a<det><art><sg>+k<post>$

Unless a multiple category is defined in the tagger file, the third LF ('k<post>') will always be output as it has not another LF to be compared with (2 vs 3 LF). The solution is to define a multiple category in the tagger for 'a<det><art><sg>+k<post>' so that it can be compared to 'a<det><art><pl>' of the first analysis:

<def-mult name="DETERG" closed="true">
    <sequence>
      <tags-item tags="det.art.sg"/>
      <tags-item lemma="k" tags="post"/>
    </sequence>
  </def-mult>

Then we found another problem with the tagger. The analysis of 'neska' and other nouns ending in 'a' is:

^neska/neska<n>+a<det><art><sg>/neska<n>$

In order for the two analyses to be compared as two different LF in the tagger, we should define a def.mult:

<def-mult name="NOMDET" closed="true">
    <sequence>
      <tags-item tags="n"/>
      <tags-item tags="det.art.sg"/>
    </sequence>
  </def-mult>

But this enters in conflict with the multiple category defined above, as 'a<det><art><sg>' will be joined to the noun to build a multiple unit and therefore can not build a unit with 'k<post' (DETERG category).

The solution to this conflict was to define a different category for nouns ending with 'a', and a multiple category only for these nouns attached to a singular determiner:

<def-label name="NOMA">
    <tags-item lemma="*a" tags="n"/>
    <tags-item lemma="*a" tags="IZE.ARR"/>
  </def-label> 

<def-mult name="NOMA_DET">
    <sequence>
      <label-item label="NOMA"/>
      <tags-item tags="det.art.sg"/>
    </sequence>
  </def-mult>


Positional nouns[edit]

Basque has constructions to express positions relative to an object which are based around what we could call positional nouns. For instance the positional noun 'aurre' (front part) is used in 'etxearen aurrean' (in front of the house) or 'etxearen aurretik' (starting at the front of the house). Here is a non-exhaustive list of these positional nouns:

  • aurre (front)
  • atze (back)
  • ondo (side, back)
  • albo (side)
  • azpi (below)
  • gain (on)
  • alde (side)
  • inguru (around)
  • barru (in)
  • pare (front of)

These nouns can take the cases -tik, -ra, -rantz/-runtz, -raino, an and ko.

When these nouns appear with one of these postpositions, they have the function of an adverb (aurrean -> in front of) and the preceding noun appears in genitive (etxearen aurrean).

Adverbs used as postpositions[edit]

A similar construct takes a NP or a PP (with a particular postposition) and a special word which works like an adverb (but can't be considered a positional noun). The whole construct can work as an adverb, or as an adjective if the -ko form is used:

following a NP in genitive:

  • GEN kontra[ko] (against)
  • GEN aurka[ko] (against)
  • GEN alde (for)
  • GEN arabera (according to)

following a NP in absolutive or other cases:

  • ABS|PART gabe[ko] (without)
  • ABS|ERG| salbu (except)
  • INSTR gain (in addition to)
  • DAT esker (thanks to)
  • ADL (ABS) arte (until)
  • ABS inguru (around)

Questions about word detection and LRLM[edit]

In Romance languages, the left-to-right-longest-match system is a good solution for multiword detection. For example, if you define the Spanish multiword de nuevo as adverb, the morphological dictionary will deliver this analysis and not the word-by-word analysis 'de<pr> nuevo<adj>'. Other examples are hoy en día and con las manos en la masa, where you do not have any conflicting analysis since the longest match from left to right is the only delivered analysis.

The way Basque is analysed in our system gives us some problems in this aspect. Since lexical forms are agglutinated to form a 'word' and the analysis of the lexical forms that form a word is joined with '+' in the stream, the LRLM system is not applied in this case.

For example: berri means new (adjective and noun). Berriz is an adverb meaning again or on the other hand, which can be split into berri<adj> and z<post>. According to the method of analysis described above for romance languages, the analysis berriz(adv) should overwrite the analysis berri(adj) + z(post). But this does not happen in this system and all possible analysis are delivered to the tagger, which has to decide which one is better (and in most cases has not enough information to do this):

berriz = ^berriz/berriz<adv>/berri<n>+z<post>/berri<adj><izo>+z<post>$

Other examples of conflicting analysis are:

joan = ^joan/joan<vblex><inf>/jo<vblex><pp>+a<det><art><sg>+an<post>/joan<vblex><pp>$

aterako = ^aterako/atera<vblex><pfut>/ate<n>+a<det><art><sg>+ra<post><ko>$

erakutsiko = ^erakutsiko/erakutsi<vblex><pfut>/erakutsi<vblex><pp>+a<det><art><sg>+ko<post>/erakutsi<n>+a<det><art><sg>+ko<post>$

ohiko = ^ohiko/ohiko<adj><izl>/ohi<adj><izo>+a<det><art><sg>+ko<post>$

[Does this one fail? Wouldn't it always choose the first form?]

hartzen = ^hartzen/hartu<vblex><ger>/hartz<n>+en<post>/hartz<n>+a<det><art><pl>+en<post>$

[Ugh!]

neurrizko = ^neurrizko/neurri<n>+z<post><ko>/neurrizko<adj><izl>$

[One idea to avoid some of these is to avoid some "cases" in the "declination" of participles. For instance, analysing joan = jo + a + an is uncommon. Ate joan (on the hit door) would be most likely said jotako atean. Same with etxe erakutsiko, which would be rendered more likely as erakutsitako etxeko. We seem to have an overgenerating dictionary, which is quite nice when translating to eu but has many low-frequency entries which are useless when analysing eu]

When there is not a mutiple category defined in the tagger file, the tagger simply chooses the longest form (the one with more lexical forms) and sometimes chooses the first LF of an analysis and the 2nd and the 3rd LF of the other analysis.

Wrong translations due to this problem:

    -hartzen dut:
     tagger = 
    ^hartz<n>+a<det><art><pl>+en<post>$ ^ukan<vbsint><pri><NR_HU><NK_NI>$
     ouput = de los osos tengo (vs. yo cojo, del verbo coger)

Proposal: I think it could be useful to have a mechanism similar to LRLM in this case: beggining from left to right, the analysis with less lexical forms should be selected. For example, given the three analysis: hartu<vblex><ger>; hartz<n>+en<post>; hartz<n>+a<det><art><pl>+en<post>, the first one should be matched on a left-to-right-longest-match basis. Could this be done at the morphological analysis level, or later in the tagger?

I am not sure whether this could add problems in other cases. In cases such as:

liburuak = ^liburuak/liburu<n>+a<det><art><pl>/liburu<n>+a<det><art><sg>+k<post>$

the two analysis should be kept. In this case, there is a def-mult defined in the tagger for 'a<art>+k<post>', in order for it to be matched and compared with 'a<art>'. In this cases there is not a single lexical form as possible analysis, this could be maybe the condition: choose the LRLM only when there is a single LF as analysis of one SF.

Any ideas?

[The problem seems to be hard and not easy to solve; perhaps the tagger could be instructed to preserve the chosen path instead of reassembling LFs from different paths into an impossible analysis.]

See also[edit]

Further reading[edit]

  • Ginestí-Rosell, M. and Ramírez-Sánchez, G. and Ortiz-Rojas, S. and Tyers, F. M. and Forcada, M. L. (2009) "Development of a free Basque to Spanish machine translation system". Procesamiento de Lenguaje Natural No. 43, pp. 185--197