Dialectal or standard variation

From Apertium
Revision as of 08:35, 20 March 2021 by Unhammer (talk | contribs)
Jump to navigation Jump to search

Non-overlapping variants

Some languages have differences in lexis and grammar, but are still desirable to be treated as one side of a language pair, as either they have a largely similar orthography and lexis, or for historical reasons.

For example:

  • Portuguese, Brazilian Portuguese
  • Occitan, Aranese
  • Serbo-Croatian (Bosnia, Croatia, Serbia)

The languages are so similar that duplicating the work in many separate systems is wasteful. In cases such as these, where there are a few well-defined norms/standards for each "macrolanguage", the common method is to define one mode (pipeline) per language norm, with its own set of morphological generators etc. There is built-in support in lttoolbox for marking entries as only relevant for a certain variant, so one can use the same .dix file to create several compiled FST's.

For analysers/generators, we can use the alt attribute to say that this entry should only be included when compiled when using a matching -a/--alt option to lt-comp. If we want to choose between the letter "a" and "e" when generating, but allow both when analysing, we could do

   <pardef n="a_vs_e">

<e r="LR">

<l>e</l> <r></r>

</e> <e r="LR">

<l>a</l> <r></r>

</e> <e r="RL" alt="var1">

<l>e</l> <r></r>

</e> <e r="RL" alt="var2">

<l>a</l> <r></r>

</e>

   </pardef>

(and call that paradigm from other entries), and compile the files with

   $ lt-comp --alt=var1 rl foo.dix foo_var1.autogen.bin
   $ lt-comp --alt=var2 rl foo.dix foo_var2.autogen.bin
   $ lt-comp            lr foo.dix foo_var2.automorf.bin

There is also a v attribute that treats an entry as left-to-right if the -v/--var option to lt-comp is unset.


For bidix, we can use attributes vr and vl to a similar effect, with the lt-comp options -r/--var-right and -l/--var-left, respectively.

The below entries from swe-nor.dix

<e vr="nno">

<l>hamna</l> <r>hamne</r>

<par n="vblex"/></e> <e vr="nob">

<l>hamna</l> <r>havne</r>

<par n="vblex"/></e>

mean that both "hamne" and "havne" translate "hamna" when going right-to-left, while in the left-to-right direction, "hamne" is chosen in the fst compiled with -r nno, while "havne" is chosen in the fst compiled with -r nob.


The options are also documented at Compiling_dictionaries#Compilation_options_and_attributes

Overlapping variants

For some languages, we also have a wide variety of overlapping style preferences, where the above method of one pipeline per "norm" would lead to an explosion of pipelines. Here we need a different method.

For example, in Norwegian Nynorsk all the below are acceptable translations of Bokmål "vi kan søke forskjeller"; the differences are purely stylistic:

  • vi kan søke forskjellar
  • vi kan søka forskjellar
  • vi kan søkje forskjellar
  • vi kan søkja forskjellar
  • vi kan søke skilnader
  • vi kan søka skilnader
  • vi kan søkje skilnader
  • vi kan søkja skilnader
  • me kan søke forskjellar
  • me kan søka forskjellar
  • me kan søkje forskjellar
  • me kan søkja forskjellar
  • me kan søke skilnader
  • me kan søka skilnader
  • me kan søkje skilnader
  • me kan søkja skilnader

<smaller>(actually "skilnadar" is also a possibility …)</smaller>

There may be some correlation of people writing "me" and "søkja" vs "vi" and "søke", but in practice there are too many possibilities to create one pipeline per set of preferences. The method we use to solve this in apertium-nno-nob is to

  1. only compile a single pipeline for the Bokmål to Nynorsk direction,
  1. generate "ambiguous" output with all possibilities,
  1. disambiguate between possibilities, picking which to use based on stream variables

A stream variable here is a little preference cookie inserted at the start of the translation input, understood by cg-proc.

Bidix preferences

Some preference choices are defined in bidix, e.g. those where we choose between synonymous but different lemmas. When translating from right to left, we remove the "LR" on the entries in question:

<e>

<l>skilnad </l><r>forskjell</r>

</e> <e>

<l>forskjell</l><r>forskjell</r>

</e>

We now get ambiguous output from biltrans:

   $ echo forskjell|apertium -f none -d . nob-nno_e-biltrans
   ^forskjell<n><m><sg><ind>/forskjell<n><m><sg><ind>/skilnad<n><m><sg><ind>$

This may look like a lexical selection problem to be handled by lrx-proc, but in this case the words have a purely stylistic difference. Before the lexical selection stage we run the CG-file apertium-nno-nob.nob-nno.biprefs.rlx which matches on lemmas and stream variables. The below rule says that if the variable forskjell_skilnad is set, we choose "forskjell", otherwise we fall back to "skilnad":

   SELECT ("skilnad"i) IF (0 ("forskjell"i) + (VAR:forskjell_skilnad));
   REMOVE ("skilnad"i) IF (0 ("forskjell"i));

The preference variable here is named forskjell_skilnad since the default is "forskjell", but if that option is ticked / variable is set, we choose "skilnad".

Generator preferences

There are quite a few style preferences that have to do with different spellings of the same word, or alternative paradigms. For example, in Nynorsk words containing -ggj- can also be expressed -gg-, so both "byggje" and "bygge" are alternative ways of spelling build. Specifying such things in bidix would lead to a lot of redundancy both within a bidix and across pairs, so we prefer to define these in the generator. By running the generator with lt-proc -b (normally used for bidix), we can apply a disambiguator to the form-ambiguous output. The -g option to cg-proc will output the selected form without surrounding ^$ or tags (and turn @ into #), and -n will suppress printing the input analysis (what CG calls "word form"), so instead of ending the pipeline with

   … | lt-proc -g nob-nno.autogen.bin

we do

   … | lt-proc -b nob-nno.autogen.bin | cg-proc -n -g nob-nno.genprefs.rlx.bin

Now we could do it as with bidix, getting ^byggje<vblex><inf>/byggje/bygge$ out of the generator and disambiguating that to byggje with a rule like

   SELECT ("bygge"i) IF (0 ("byggje"i) + (VAR:byggje_bygge));
   REMOVE ("bygge"i) IF (0 ("byggje"i));

or even use some regexes to handle all ggj/gg pairs, but that would quickly get unwieldy in the CG file. We also wouldn't be taking advantage of the information that's already in the nno.dix file at the point of the style ambiguity. So to make this more wieldy, we output a tag when generating the non-default reading, and put it in a pardef that we can place in all the style-ambiguous spots:

   <pardef n="v:ggj_gg">

<e>

<l>ggj</l> <r>ggj</r>

</e> <e r="LR">

<l>gg</l> <r>ggj</r>

</e> <e r="RL">

<l>gg</l> <r>ggj</r>

</e>

   </pardef>

This does unfortunately mean we get a tag in the middle of the form: /byggje/bygg<v:ggj_gg>e$, but as mentioned above, cg-proc -g will strip tags, so after disambiguation with the following rules:

   SELECT (v:ggj_gg)      IF (0 (VAR:ggj_gg)) ;
   REMOVE (v:ggj_gg) ;

we end up with either byggje or bygge. Now all the style ambiguities are defined in the paradigms of the words in the dictionary, and the CG file only has to do mechanical selection of tags based on stream variables (the generator CG file could technically be created automatically).

See also