Difference between revisions of "User:Xavivars/Overlapping language variants"
Line 90: | Line 90: | ||
} |
} |
||
: If you make the config XML too, I think it'd be quite possible to do using xsltproc, as with the other metadix features. --[[User:Unhammer|unhammer]] ([[User talk:Unhammer|talk]]) 20:16, 19 February 2017 (CET) |
: Seems like a good solution! If you make the config XML too, I think it'd be quite possible to do using xsltproc, as with the other metadix features. --[[User:Unhammer|unhammer]] ([[User talk:Unhammer|talk]]) 20:16, 19 February 2017 (CET) |
Revision as of 19:18, 19 February 2017
Problem
Language variants are a great feature of Apertium. It allows you to create slightly different language pairs, where the core of the translations are the same.
However, the way variants are implemented is not properly suited for overlapping variants, when we have more than two variants for the same language that are not completely disjoint.
Let's take Catalan as an example:
- Currently, we have the base Catalan translation (cat), plus a Valencian variant (cat_valencia) based on the Universities language model.
- There used to exist an "economic focused" variant (cat_eco).
- Work is being done by Joan Moratinos to add suppor for a balearic variant (cat_balear, temptative name).
Let's imagine that we want to add another Valencian variant, based on the Generalitat (Valencian Government) or the AVL (Valencian Language Accademy), (cat_gva or cat_avl are the names I'll use here to support the example).
Let's also define a set of "language features", and are not overlapping:
- Verbs
- CAT verb terminations (penso)
- VAL verb terminations (pense)
- BAL verb terminations (pens)
- Lexical choice
- CAT lexical (sortir)
- VAL lexical (eixir)
- Accents
- CAT accents (interès)
- VAL accents (interés)
- Demonstratives
- Weak (aquest)
- Strong (este)
- Possessives
- U (meua, teua)
- V (meva, teva)
- Economics lexic
- Generic
- Economics focused
The problem comes on how the different variants use each of the language features. As an example, the following chart shows only two of the features: verbs and accents.
For this example, if we just keep the current variants approach, we would need four variants to be present in the dictionaries. All 4 variants would need to define their own version for each "feature", (even if some of them are identical):
cat_balear
would need to have all the same verb entries thancat
, and same thing would happen forcat_valencia
andcat_gva
.- In case of accents,
cat
,cat_balear
andcat_valencia
, as they share the same accents, would create the same entries three times in the dictionary, while another different entry would be added to supportcat_gva
Proposed approaches
Metadix and variant definition
This is a very naive approach, but could work without much involvement for Apertium as a framework.
The idea would be to create a new "metadix" for that language (and language pairs that use it) so when building the pair, a proper dix gets created. In the metadix, instead of current variants, "language features" will be tagged. And variants will be defined in a "config" file as a set of language features.
So, as an example, instead of
<e v="cat"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e> <e v="val"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e> ... <e> <p><l>ès</l> <r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e> <e r="LR"><p><l>és</l> <r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e>
we will have something closer to
... <e lf="possessive-v"><p><l>la<b/>meva</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e> <e lf="possessive-u"><p><l>la<b/>meua</l><r>el<b/>meu<s n="det"/><s n="pos"/><s n="f"/><s n="sg"/></r></p></e> ... <e lf="accents-cat"><p><l>ès</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e> <e lf="accents-val"><p><l>és</l><r>ès<s n="n"/><s n="m"/><s n="sg"/></r></p></e> ...
and a config file like this one (expressed in json, but can be anything)
{ groups : { "possessive" : ["possesive-v", "possessive-u"], "accents" : ["accents-val", "accents-cat"], ... }, variants : { cat: { "possessive-v", "accents-cat", "verbs-cat" }, cat_valencia: { "possessive-v", "accents-cat", "verbs-val" }, cat_balear: { "possessive-v", "accents-cat", "verbs-bal" }, cat_gva:{ "possessive-u", "accents-val", "verbs-val" } } }