Difference between revisions of "Basque and Spanish"
Line 25: | Line 25: | ||
<pre> |
<pre> |
||
gizonei : gizon.n + a.det.pl + i.post |
gizonei : gizon.n + a.det.pl + i.post |
||
Mirenekin : Miren.NP + kin.post |
Mirenekin : Miren.NP + kin.post |
||
katuarentzat : katu.n + a.det.sg + tzat.post |
katuarentzat : katu.n + a.det.sg + tzat.post |
||
Line 47: | Line 49: | ||
=Tagger categories= |
=Tagger categories= |
||
The way basque is analysed in our system gives some problems to the tagger. Postpositions and determiners are attached to the main word with a + (<j/> in xml). The tagger sees the resulting lexical forms as separate forms unless a def-mult element is defined in the tagger with 2 or more lexical forms |
The way basque is analysed in our system gives some problems to the tagger. Postpositions and determiners are attached to the main word with a + (<j/> in xml). The tagger sees the resulting lexical forms as separate forms unless a def-mult element is defined in the tagger with 2 or more lexical forms. |
||
Analysis of 'liburuak' |
|||
<pre> |
|||
^liburuak/liburu<n>+a<det><art><pl>/liburu<n>+a<det><art><sg>+k<post>$ |
|||
</pre> |
|||
Unless a multiple category is defined in the tagger file, the third LF ('k<post>') will always be output as it has not another LF to be compared with (2 vs 3 LF). |
|||
The solution is to define a multiple category in the tagger for 'a<det> + k<post>' so that it can be compared to 'a<det>' of the first analysis: |
|||
<pre> |
|||
<def-mult name="DETERG" closed="true"> |
|||
<sequence> |
|||
<tags-item tags="det.art.sg"/> |
|||
<tags-item lemma="k" tags="post"/> |
|||
</sequence> |
|||
</def-mult> |
|||
</pre> |
|||
Then we found another problem with the tagger. The analysis of 'neska' and other nouns ending in 'a' is: |
|||
<pre> |
|||
^neska/neska<n>+a<det><art><sg>/neska<n>$ |
|||
</pre> |
|||
In order for the to anlysis to be compared as two different LF in the tagger, we should define a def.mult: |
|||
<pre> |
|||
<def-mult name="NOMDET" closed="true"> |
|||
<sequence> |
|||
<tags-item tags="n"/> |
|||
<tags-item tags="det.art.sg"/> |
|||
</sequence> |
|||
</def-mult> |
|||
</pre> |
|||
Revision as of 08:26, 26 July 2007
The idea
Mireia Ginestí is recycling Matxin data to build an Apertium-based system that would allow Spanish speakers to read Basque newspapers.
Some of the morphological choices in Matxin will be revised.
This document is to keep track of decisions and to raise questions
Deklinabidea?
For instance, "declination" will be treated as postpositions:
gizonentzat : gizon.n + a.det.pl + tzat.post
In principle, the absolutive will not be marked:
gizonak : gizon.n + a.det.pl
Determiners and postpositions will be given mnemonic lemmas, one per case.
gizonei : gizon.n + a.det.pl + i.post Mirenekin : Miren.NP + kin.post katuarentzat : katu.n + a.det.sg + tzat.post
Postpositions which can modify a noun phrase will be marked explicitly as ko
etxeetako: etxe.n + a.det.pl + ko.post.ko Mikelekin : Mikel.NP + kin.post Mikelekiko : Mikel.NP + kin.post.ko
Possessives?
A problem appears with "possessives" like 'nire', 'gure', 'zuen', 'haien', 'bere'. Should they be treated as preadjectives ('izenlagun') or as genitive constructs:
nire: ni.pron.sg + ren.post.ko haien : hura.pron.pl + ren.post.ko
Tagger categories
The way basque is analysed in our system gives some problems to the tagger. Postpositions and determiners are attached to the main word with a + (<j/> in xml). The tagger sees the resulting lexical forms as separate forms unless a def-mult element is defined in the tagger with 2 or more lexical forms. Analysis of 'liburuak'
^liburuak/liburu<n>+a<det><art><pl>/liburu<n>+a<det><art><sg>+k<post>$
Unless a multiple category is defined in the tagger file, the third LF ('k<post>') will always be output as it has not another LF to be compared with (2 vs 3 LF). The solution is to define a multiple category in the tagger for 'a<det> + k<post>' so that it can be compared to 'a<det>' of the first analysis:
<def-mult name="DETERG" closed="true"> <sequence> <tags-item tags="det.art.sg"/> <tags-item lemma="k" tags="post"/> </sequence> </def-mult>
Then we found another problem with the tagger. The analysis of 'neska' and other nouns ending in 'a' is:
^neska/neska<n>+a<det><art><sg>/neska<n>$
In order for the to anlysis to be compared as two different LF in the tagger, we should define a def.mult:
<def-mult name="NOMDET" closed="true"> <sequence> <tags-item tags="n"/> <tags-item tags="det.art.sg"/> </sequence> </def-mult>