Difference between revisions of "User:Xavivars/Overlapping language variants"

From Apertium
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 95: Line 95:
   
 
The idea is to have entries that are alternative language variant forms to one "general" form marked with a unique ID as follows:
 
The idea is to have entries that are alternative language variant forms to one "general" form marked with a unique ID as follows:
  +
  +
In the monolingual dictionary we have:
   
 
<pre>
 
<pre>
<e><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
+
<e eid="la-meva-det"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
<e id="meua-det"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
+
<e eid="la-meua-det"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
  +
  +
...
 
<e eid="cafè-n"><p><l>è</l><r>è<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
...
 
...
  +
<e><i>poal</i><pardef n="abric__n"></e>
<e> <p><l>è</l> <r>è<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
  +
<e><i>galleda</i><pardef n="abella__n"></e>
<e id="café-n"><p><l>é</l> <r>è<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
 
</pre>
 
</pre>
   
</pre>
 
   
 
In a different file, we explicit what we do to the alternatives marked with ID by variant with sections like
 
In a different file, we explicit what we do to the alternatives marked with ID by variant with sections like
Line 110: Line 114:
 
<pre>
 
<pre>
 
<?xml version=“1.0”?>
 
<?xml version=“1.0”?>
<variant name=“valencia”>
+
<variant name=“valencia_uni”>
 
<replace>
 
<replace>
  +
<e eid="la-meva-det">
<e><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 
  +
<e eid="la-meua-det">
<e id="meua-det"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 
 
</replace>
 
</replace>
<replace>
+
<replace>
  +
<e eid="cafè-n">
<e id="café-nsg"><p><l>é</l> <r>è<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
  +
<e eid="café-n">
<e> <p><l>è</l> <r>è<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
 
</replace>
 
</replace>
<add>
 
</add>
 
<remove>
 
</remove>
 
 
</variant>
 
</variant>
 
</pre>
  +
  +
In this file, we maintain all the info related to variants in the monolingual package.
  +
  +
  +
In the bilingual dictionary we have:
  +
 
<pre>
 
<pre>
  +
...
  +
<e eid="galleda-n"><p><l>cubo<s n="n"/><s n="m"/></l><r>galleda<s n="n"/><s n="f"/></r></p></e>
 
<e eid="poal-n"><p><l>cubo<s n="n"/></l><r>poal<s n="n"/></r></p></e>
  +
...
  +
</pre>
  +
  +
In a different file, only for bilingual lexical choices, we explicit what we do to the alternatives marked with IDs:
  +
  +
<pre>
  +
<?xml version=“1.0”?>
  +
<variant name=“valencia_uni”>
  +
<replace>
  +
<e eid="galleda-n">
  +
<e eid="poal-n">
  +
</replace>
  +
</variant>
  +
</pre>
   
  +
In this file, we maintain all the info related to variants in the bilingual package.
Dins d’aquest fitxer ens podríem referir a les entrades per l’identificador i fer les operacions que suggereix el nom dels elements. D’aquesta manera la informació de variant seria una unitat de dades i podria ser més fàcil de mantenir.
 
   
=== One entry, multiple variants===
+
=== Multiple variants per entry===
   
 
The idea is to extend the attribute '''v''' for entries so as it can carry more than one value:
 
The idea is to extend the attribute '''v''' for entries so as it can carry more than one value:
   
 
<pre>
 
<pre>
  +
  +
In the pardef section (orthographic changes)
  +
 
<e v="cat"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 
<e v="cat"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 
<e v="cat_uni cat_gva"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 
<e v="cat_uni cat_gva"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 
...
 
...
  +
<e v="cat cat_uni> <p><l>ès</l> <r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
<e v="cat_gva"> <p><l>és</l> <r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
+
<e v="cat cat_uni><p><l>ès</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
<e v="cat_gva"><p><l>és</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 
...
 
...
  +
  +
In the main section (lexical changes).
  +
I think they must be marked also, even if the choice is made in the bilingual dix, to have the info available in a monolingual view. It does not harm...
  +
 
<e v="cat_uni cat_gva"><i>poal</i><pardef n="abric__n"></e>
 
<e v="cat_uni cat_gva"><i>poal</i><pardef n="abric__n"></e>
 
<e v="cat"><i>galleda</i><pardef n="abric__n"></e>
 
<e v="cat"><i>galleda</i><pardef n="abric__n"></e>
 
...
 
...
  +
  +
In the bilingual
  +
  +
<e v="cat_uni cat_gva"><p><l>cubo<s n="n"/></l><r>poal<s n="n"/></r></p></e>
 
<e v="cat"><p><l>cubo<s n="n"/><s n="m"/></l><r>galleda<s n="n"/><s n="f"/></r></p></e>
  +
  +
 
</pre>
 
</pre>
   
As we do now, we would create the appropriate dixes for cat, val_universitats, val_gva, etc. during compilation.
+
As we do now, we would create the appropriate morphological generator depending on the variant during compilation in the Makefile
  +
<pre>
  +
...
  +
VAR1=#deixar en blanc
  +
VAR2=_valencia-uni
  +
VAR3=_valencia-gva
  +
...
  +
$(LANG1)$(VAR1).autogen.bin: $(BASENAME).$(LANG1).dix .deps/.d
  +
apertium-validate-dictionary $<
  +
lt-comp -v cat rl $< $@ $(BASENAME).$(LANG1).acx
  +
  +
$(LANG1)$(VAR2).autogen.bin: $(BASENAME).$(LANG1).dix .deps/.d
  +
apertium-validate-dictionary $<
  +
lt-comp -v cat_uni rl $< $@ $(BASENAME).$(LANG1).acx
  +
  +
$(LANG1)$(VAR3).autogen.bin: $(BASENAME).$(LANG1).dix .deps/.d
  +
apertium-validate-dictionary $<
  +
lt-comp -v cat_gva rl $< $@ $(BASENAME).$(LANG1).acx
  +
  +
</pre>
  +
  +
This gets all variants as LR and set variant choices as RL.
  +
  +
Moreover we propose a change to lt-comp to exclude variants from morphological analyser. That is, for example, to exclude balearic variants from val_uni or cat morphological analysers to avoid ambiguity.
  +
  +
And then we create also the bilingual transducer to get from Spanish to Catalan the language variant dependent lexical choice also in the Makefile.
  +
   
 
'''Benefits''': it is simple, all the info stays in the dix files, is consistent and easy to understand.
Requires: modifying lt-comp
 
   
  +
'''Requirements''': mark explicitly all entries, no grouping can be done; modify v= attribute to carry more than one value (NTOKEN) and lt-comp.
Benefits: it is simple, all the info stays in a single dix file and is easy to understand.
 

Latest revision as of 11:22, 24 February 2017

Problem[edit]

Language variants are a great feature of Apertium. It allows you to create slightly different language pairs, where the core of the translations are the same.

However, the way variants are implemented is not properly suited for overlapping variants, when we have more than two variants for the same language that are not completely disjoint.

Let's take Catalan as an example:

  • Currently, we have the base Catalan translation (cat), plus a Valencian variant (cat_valencia) based on the Universities language model.
  • There used to exist an "economic focused" variant (cat_eco).
  • Work is being done by Joan Moratinos to add suppor for a balearic variant (cat_balear, temptative name).

Let's imagine that we want to add another Valencian variant, based on the Generalitat (Valencian Government) or the AVL (Valencian Language Accademy), (cat_gva or cat_avl are the names I'll use here to support the example).

Let's also define a set of "language features", and are not overlapping:

  • Verbs
    • CAT verb terminations (penso)
    • VAL verb terminations (pense)
    • BAL verb terminations (pens)
  • Lexical choice
    • CAT lexical (sortir)
    • VAL lexical (eixir)
  • Accents
    • CAT accents (interès)
    • VAL accents (interés)
  • Demonstratives
    • Weak (aquest)
    • Strong (este)
  • Possessives
    • U (meua, teua)
    • V (meva, teva)
  • Economics lexic
    • Generic
    • Economics focused

The problem comes on how the different variants use each of the language features. As an example, the following chart shows only two of the features: verbs and accents.

Apertium-verbs.png

For this example, if we just keep the current variants approach, we would need four variants to be present in the dictionaries. All 4 variants would need to define their own version for each "feature", (even if some of them are identical):

  • cat_balear would need to have all the same verb entries than cat, and same thing would happen for cat_valencia and cat_gva.
  • In case of accents, cat, cat_balear and cat_valencia, as they share the same accents, would create the same entries three times in the dictionary, while another different entry would be added to support cat_gva


Proposed approaches[edit]

Metadix and variant definition[edit]

This is a very naive approach, but could work without much involvement for Apertium as a framework.

The idea would be to create a new "metadix" for that language (and language pairs that use it) so when building the pair, a proper dix gets created. In the metadix, instead of current variants, "language features" will be tagged. And variants will be defined in a "config" file as a set of language features.

So, as an example, instead of

 <e v="cat"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 <e v="val"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 ...
 <e>       <p><l>ès</l>        <r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 <e r="LR"><p><l>és</l>        <r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>

we will have something closer to


 ...
 <e lf="possessive-v"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 <e lf="possessive-u"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 ...
 <e lf="accents-cat"><p><l>ès</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 <e lf="accents-val"><p><l>és</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 ...	

and a config file like this one (expressed in json, but can be anything)

{
  groups : { 
     "possessive" : ["possesive-v", "possessive-u"], 
     "accents"    : ["accents-val", "accents-cat"],
     ...
  },
  variants : {
     cat: { "possessive-v", "accents-cat", "verbs-cat" },
     cat_valencia: { "possessive-v", "accents-cat", "verbs-val" },
     cat_balear: { "possessive-v", "accents-cat", "verbs-bal" },
     cat_gva:{ "possessive-u", "accents-val", "verbs-val" }
  }
}
Seems like a good solution! If you make the config XML too, I think it'd be quite possible to do using xsltproc, as with the other metadix features. --unhammer (talk) 20:16, 19 February 2017 (CET)

Id-based variants[edit]

The idea is to have entries that are alternative language variant forms to one "general" form marked with a unique ID as follows:

In the monolingual dictionary we have:

<e eid="la-meva-det"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
<e eid="la-meua-det"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>

 ...
<e eid="cafè-n"><p><l>è</l><r>è<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 ...
<e><i>poal</i><pardef n="abric__n"></e>
<e><i>galleda</i><pardef n="abella__n"></e>


In a different file, we explicit what we do to the alternatives marked with ID by variant with sections like

<?xml version=“1.0”?>
<variant name=“valencia_uni”>
 <replace>
   <e eid="la-meva-det">
   <e eid="la-meua-det">
 </replace>
 <replace>
    <e eid="cafè-n">
    <e eid="café-n">
  </replace>
</variant>

In this file, we maintain all the info related to variants in the monolingual package.


In the bilingual dictionary we have:

...
<e eid="galleda-n"><p><l>cubo<s n="n"/><s n="m"/></l><r>galleda<s n="n"/><s n="f"/></r></p></e>
<e eid="poal-n"><p><l>cubo<s n="n"/></l><r>poal<s n="n"/></r></p></e>
...

In a different file, only for bilingual lexical choices, we explicit what we do to the alternatives marked with IDs:

<?xml version=“1.0”?>
<variant name=“valencia_uni”>
 <replace>
   <e eid="galleda-n">
   <e eid="poal-n">
 </replace>
</variant>

In this file, we maintain all the info related to variants in the bilingual package.

Multiple variants per entry[edit]

The idea is to extend the attribute v for entries so as it can carry more than one value:


In the pardef section (orthographic changes)

 <e v="cat"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 <e v="cat_uni cat_gva"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e>
 ...

 <e v="cat cat_uni><p><l>ès</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
 <e v="cat_gva"><p><l>és</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
...

In the main section (lexical changes). 
 I think they must be marked also, even if the choice is made in the bilingual dix, to have the info available in a monolingual view. It does not harm...

 <e v="cat_uni cat_gva"><i>poal</i><pardef n="abric__n"></e>
 <e v="cat"><i>galleda</i><pardef n="abric__n"></e>
...	

In the bilingual

<e v="cat_uni cat_gva"><p><l>cubo<s n="n"/></l><r>poal<s n="n"/></r></p></e>
<e v="cat"><p><l>cubo<s n="n"/><s n="m"/></l><r>galleda<s n="n"/><s n="f"/></r></p></e>


As we do now, we would create the appropriate morphological generator depending on the variant during compilation in the Makefile

...
VAR1=#deixar en blanc
VAR2=_valencia-uni
VAR3=_valencia-gva
...
$(LANG1)$(VAR1).autogen.bin: $(BASENAME).$(LANG1).dix .deps/.d
        apertium-validate-dictionary $<
        lt-comp -v cat rl $< $@ $(BASENAME).$(LANG1).acx

$(LANG1)$(VAR2).autogen.bin: $(BASENAME).$(LANG1).dix .deps/.d
        apertium-validate-dictionary $<
        lt-comp -v cat_uni rl $< $@ $(BASENAME).$(LANG1).acx

$(LANG1)$(VAR3).autogen.bin: $(BASENAME).$(LANG1).dix .deps/.d
        apertium-validate-dictionary $<
        lt-comp -v cat_gva rl $< $@ $(BASENAME).$(LANG1).acx

This gets all variants as LR and set variant choices as RL.

Moreover we propose a change to lt-comp to exclude variants from morphological analyser. That is, for example, to exclude balearic variants from val_uni or cat morphological analysers to avoid ambiguity.

And then we create also the bilingual transducer to get from Spanish to Catalan the language variant dependent lexical choice also in the Makefile.


Benefits: it is simple, all the info stays in the dix files, is consistent and easy to understand.

Requirements: mark explicitly all entries, no grouping can be done; modify v= attribute to carry more than one value (NTOKEN) and lt-comp.