Basque and Spanish

From Apertium
Jump to navigation Jump to search

The idea

Mireia Ginestí is recycling Matxin data to build an Apertium-based system that would allow Spanish speakers to read Basque newspapers.

Some of the morphological choices in Matxin will be revised.

This document is to keep track of decisions and to raise questions

Deklinabidea?

For instance, "declination" will be treated as postpositions:

gizonentzat : gizon.n + a.det.pl + tzat.post

In principle, the absolutive will not be marked:

gizonak : gizon.n + a.det.pl

Determiners and postpositions will be given mnemonic lemmas, one per case.

gizonei : gizon.n + a.det.pl + i.post


Mirenekin : Miren.NP + kin.post
katuarentzat : katu.n + a.det.sg + tzat.post

Postpositions which can modify a noun phrase will be marked explicitly as ko

etxeetako: etxe.n + a.det.pl + ko.post.ko
Mikelekin : Mikel.NP + kin.post
Mikelekiko : Mikel.NP + kin.post.ko

Possessives?

A problem appears with "possessives" like 'nire', 'gure', 'zuen', 'haien', 'bere'. Should they be treated as preadjectives ('izenlagun') or as genitive constructs:

nire: ni.pron.sg + ren.post.ko
haien : hura.pron.pl + ren.post.ko

Undefined determiners (or quantifiers)

There are some words in basque that could be considered as adjectives or as quantifiers (asko, gehiegi, nahiko, etc.).

Like determiners and unlike adjectives, they can signal the end of a SN. This is a reason why they shouldn't be tagged as adjectives.

They can also be followed by another deteminer ('etxe askoa').

Matxin dictionaries tag them as undefined determiners. We decided to tag them this way, with a distinction for the ones that come usually before the noun, like the adjectives 'izenlagun' (for example, 'nahiko').

Distinction plural determiner/ergative

The plural article and the ergative postposition have the same form (are ambiguous):

^liburuak/liburu<n>+a<det><art><pl>/liburu<n>+a<det><art><sg>+k<post>$

The tagger will not, in principle, be able to solve this ambiguity, as it depends on long-distance characteristics (whether the verb is NOR-NORK/NOR-NORI-NORK or NOR/NOR-NORI).

Therefore the solution should take place in the transfer. It would be useful that the previous word has only one analysis (plural-ergative) so that no election is made until the word reaches the transfer module. But this would cause that the morphological analysis is not completely adequate.

Solution still pending.


Tagger categories

The way basque is analysed in our system gives some problems to the tagger. Postpositions and determiners are attached to the main word with a '+' (<j/> in xml). The tagger sees the resulting lexical forms as separate forms unless a def-mult element is defined in the tagger with 2 or more lexical forms. This is the analysis of 'liburuak'

^liburuak/liburu<n>+a<det><art><pl>/liburu<n>+a<det><art><sg>+k<post>$

Unless a multiple category is defined in the tagger file, the third LF ('k<post>') will always be output as it has not another LF to be compared with (2 vs 3 LF). The solution is to define a multiple category in the tagger for 'a<det><art><sg>+k<post>' so that it can be compared to 'a<det><art><pl>' of the first analysis:

<def-mult name="DETERG" closed="true">
    <sequence>
      <tags-item tags="det.art.sg"/>
      <tags-item lemma="k" tags="post"/>
    </sequence>
  </def-mult>

Then we found another problem with the tagger. The analysis of 'neska' and other nouns ending in 'a' is:

^neska/neska<n>+a<det><art><sg>/neska<n>$

In order for the two analyses to be compared as two different LF in the tagger, we should define a def.mult:

<def-mult name="NOMDET" closed="true">
    <sequence>
      <tags-item tags="n"/>
      <tags-item tags="det.art.sg"/>
    </sequence>
  </def-mult>

But this enters in conflict with the multiple category defined above, as 'a<det><art><sg>' will be joined to the noun to build a multiple unit and therefore can not build a unit with 'k<post' (DETERG category).

The solution to this conflict was to define a different category for nouns ending with 'a', and a multiple category only for these nouns attached to a singular determiner:

<def-label name="NOMA">
    <tags-item lemma="*a" tags="n"/>
    <tags-item lemma="*a" tags="IZE.ARR"/>
  </def-label> 

<def-mult name="NOMA_DET">
    <sequence>
      <label-item label="NOMA"/>
      <tags-item tags="det.art.sg"/>
    </sequence>
  </def-mult>


Positional nous

Basque has constructions to express positions relative to an object which are based around what we could call positional nouns. For instance the positional noun 'aurre' (front part) is used in 'etxearen aurrean' (in front of the house) or 'etxearen aurretik' (starting at the front of the house). Here is a non-exhaustive list of these positional nouns:

  • aurre (front)
  • atze (back)
  • ondo (side, back)
  • albo (side)
  • azpi (below)
  • gain (on)
  • ...

These nouns can take the cases -ra, -rantz/-runtz, -raino, ...

Adverbs used as postpositions

A similar construct takes a NP or a PP (with a particular postposition) and a special word which works like an adverb (but can't be considered a positional noun). The whole construct can work as an adverb or as an adjective, if the -ko form is used:

  • GEN kontra[ko], GEN 'aurka' (against)
  • GEN 'alde' (for)
  • GEN 'arabera' (according to)
  • DAT 'esker' (thanks to)
  • ABS|PART 'gabe' (without)
  • ABS|ERG|... '
  • INSTR gain (in addition to)

See also