Difference between revisions of "Dialectal or standard variation"

From Apertium
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 324: Line 324:


The Makefile.am also needs some changes, see
The Makefile.am also needs some changes, see
* https://github.com/apertium/apertium-nno-nob/compare/1b49a641279c8daec06f52c40875bd38fadaa535..178d28289e1b1944139c04211f3723eda340073f pair
* https://github.com/apertium/apertium-nno-nob/compare/1b49a641279c8daec06f52c40875bd38fadaa535..178d28289e1b1944139c04211f3723eda340073f pair (ignore the <code><nowiki>$(PREFIX2)_e</nowiki></code> thing, that's just for backward compatibility with an old mode)
* https://github.com/apertium/apertium-nno/compare/77d3cf3f90e234ac0e45b17e231399ab364e2f46%5E..0b783a7a40f51bc02e55f30d98261fb5461267c2 generator
* https://github.com/apertium/apertium-nno/compare/77d3cf3f90e234ac0e45b17e231399ab364e2f46%5E..0b783a7a40f51bc02e55f30d98261fb5461267c2 generator



Latest revision as of 10:33, 1 November 2022

Non-overlapping variants[edit]

Some languages have differences in lexis and grammar, but are still desirable to be treated as one side of a language pair, as either they have a largely similar orthography and lexis, or for historical reasons.

For example:

  • Portuguese, Brazilian Portuguese
  • Occitan, Aranese
  • Serbo-Croatian (Bosnia, Croatia, Serbia)

The languages are so similar that duplicating the work in many separate systems is wasteful. In cases such as these, where there are a few well-defined norms/standards for each "macrolanguage", the common method is to define one mode (pipeline) per language norm, with its own set of morphological generators etc. There is built-in support in lttoolbox for marking entries as only relevant for a certain variant, so one can use the same .dix file to create several compiled FST's.

For analysers/generators, we can use the alt attribute to say that this entry should only be included when compiled when using a matching -a/--alt option to lt-comp. If we want to choose between the letter "a" and "e" when generating, but allow both when analysing, we could do

    <pardef n="a_vs_e">
      <e r="LR">           <p><l>e</l>    <r></r></p></e>
      <e r="LR">           <p><l>a</l>    <r></r></p></e>
      <e r="RL" alt="var1"><p><l>e</l>    <r></r></p></e>
      <e r="RL" alt="var2"><p><l>a</l>    <r></r></p></e>
    </pardef>

(and call that paradigm from other entries), and compile the files with

   $ lt-comp --alt=var1 rl foo.dix foo_var1.autogen.bin
   $ lt-comp --alt=var2 rl foo.dix foo_var2.autogen.bin
   $ lt-comp            lr foo.dix foo_var2.automorf.bin

There is also a v attribute that treats an entry as left-to-right if the -v/--var option to lt-comp is unset.


For bidix, we can use attributes vr and vl to a similar effect, with the lt-comp options -r/--var-right and -l/--var-left, respectively.

The below entries from swe-nor.dix

    <e vr="nno"><p><l>hamna</l>	<r>hamne</r></p><par n="vblex"/></e>
    <e vr="nob"><p><l>hamna</l>	<r>havne</r></p><par n="vblex"/></e>

mean that both "hamne" and "havne" translate "hamna" when going right-to-left, while in the left-to-right direction, "hamne" is chosen in the fst compiled with -r nno, while "havne" is chosen in the fst compiled with -r nob.


The options are also documented at Compiling_dictionaries#Compilation_options_and_attributes

Overlapping variants[edit]

For some languages, we also have a wide variety of overlapping style preferences, where the above method of one pipeline per "norm" would lead to an explosion of pipelines. Here we need a different method.

For example, in Norwegian Nynorsk all the below are acceptable translations of Bokmål "vi kan søke forskjeller"; the differences are purely stylistic:

  • vi kan søke forskjellar
  • vi kan søka forskjellar
  • vi kan søkje forskjellar
  • vi kan søkja forskjellar
  • vi kan søke skilnader
  • vi kan søka skilnader
  • vi kan søkje skilnader
  • vi kan søkja skilnader
  • me kan søke forskjellar
  • me kan søka forskjellar
  • me kan søkje forskjellar
  • me kan søkja forskjellar
  • me kan søke skilnader
  • me kan søka skilnader
  • me kan søkje skilnader
  • me kan søkja skilnader

(actually "skilnadAr" is also a possibility …)

There may be some correlation of people writing "me" and "søkja" vs "vi" and "søke", but in practice there are too many possibilities to create one pipeline per set of preferences. The method we use to solve this in apertium-nno-nob is to

  1. only compile a single pipeline for the Bokmål to Nynorsk direction,
  1. generate "ambiguous" output with all possibilities,
  1. disambiguate between possibilities, picking which to use based on stream variables

A stream variable here is a little preference cookie inserted at the start of the translation input, understood by cg-proc. The /usr/bin/apertium script knows how to insert the preference variables at the right spot and remove them again from output, the user just has to export the variable AP_SETVAR with a comma-separated list of preferences:

$ export AP_SETVAR='' ; echo 'Vi kan søke forskjeller' | apertium -d . nob-nno
Me kan søkja forskjellar
$ export AP_SETVAR='me_vi' ; echo 'Vi kan søke forskjeller' | apertium -d . nob-nno
Vi kan søkja forskjellar
$ export AP_SETVAR='me_vi,kj_k' ; echo 'Vi kan søke forskjeller' | apertium -d . nob-nno
Vi kan søka forskjellar
$ export AP_SETVAR='me_vi,kj_k,forskjell_skilnad' ; echo 'Vi kan søke forskjeller' | apertium -d . nob-nno
Vi kan søka skilnader
$ export AP_SETVAR='me_vi,kj_k,forskjell_skilnad,infa_infe' ; echo 'Vi kan søke forskjeller' | apertium -d . nob-nno
Vi kan søke skilnader

Bidix preferences[edit]

Some preference choices are defined in bidix, e.g. those where we choose between synonymous but different lemmas. When translating from right to left, we remove the "LR" on the entries in question:

    <e> <p><l>skilnad<s n="n"/><s n="m"/>  </l><r>forskjell<s n="n"/><s n="m"/></r></p></e>
    <e> <p><l>forskjell<s n="n"/><s n="m"/></l><r>forskjell<s n="n"/><s n="m"/></r></p></e>

We now get ambiguous output from biltrans:

    $ echo forskjell|apertium -f none -d . nob-nno_e-biltrans
    ^forskjell<n><m><sg><ind>/forskjell<n><m><sg><ind>/skilnad<n><m><sg><ind>$

This may look like a lexical selection problem to be handled by lrx-proc, but in this case the words have a purely stylistic difference. Before the lexical selection stage we run the CG-file apertium-nno-nob.nob-nno.biprefs.rlx which matches on lemmas and stream variables. The below rule says that if the variable forskjell_skilnad is set, we choose "forskjell", otherwise we fall back to "skilnad":

   SELECT ("skilnad"i) IF (0 ("forskjell"i) + (VAR:forskjell_skilnad));
   REMOVE ("skilnad"i) IF (0 ("forskjell"i));

The preference variable here is named forskjell_skilnad since the default is "forskjell", but if that option is ticked / variable is set, we choose "skilnad".


See nno-nob commit f448046 (you can ignore the dev/ changes) for an example of going from bidix LR's to variants. Here the word "motsetnad" used to be marked LR in bidix and never chosen when translating right-to-left. After the change, bidix outputs both "motsetnad" and "motsetning", and the CG rule picks one of them based on what stream variable is set.

Generator preferences[edit]

There are quite a few style preferences that have to do with different spellings of the same word, or alternative paradigms. For example, in Nynorsk words containing -ggj- can also be expressed -gg-, so both "byggje" and "bygge" are alternative ways of spelling build. Specifying such things in bidix would lead to a lot of redundancy both within a bidix and across pairs, so we prefer to define these in the generator. By running the generator with lt-proc -b (normally used for bidix), we can apply a disambiguator to the form-ambiguous output. The -g option to cg-proc will output the selected form without surrounding ^$ or tags (and turn @ into #), and -n will suppress printing the input analysis (what CG calls "word form"), so instead of ending the pipeline with

   … | lt-proc -g nob-nno.autogen.bin

we do

   … | lt-proc -b nob-nno.autogen.bin | cg-proc -n -g nob-nno.genprefs.rlx.bin

Now we could do it as with bidix, getting ^byggje<vblex><inf>/byggje/bygge$ out of the generator and disambiguating that to byggje with a rule like

   SELECT ("bygge"i) IF (0 ("byggje"i) + (VAR:byggje_bygge));
   REMOVE ("bygge"i) IF (0 ("byggje"i));

or even use some regexes to handle all ggj/gg pairs, but that would quickly get unwieldy in the CG file. We also wouldn't be taking advantage of the information that's already in the nno.dix file at the point of the style ambiguity. So to make this more wieldy, we output a tag when generating the non-default reading, and put it in a pardef that we can place in all the style-ambiguous spots:

    <pardef n="v:ggj_gg">
      <e>          <p><l>ggj</l>                  <r>ggj</r></p></e>
      <e r="LR">   <p><l>gg</l>                   <r>ggj</r></p></e>
      <e r="RL">   <p><l>gg<s n="v:ggj_gg"/></l>  <r>ggj</r></p></e>
    </pardef>

This does unfortunately mean we get a tag in the middle of the form: /byggje/bygg<v:ggj_gg>e$, but as mentioned above, cg-proc -g will strip tags, so after disambiguation with the following rules:

   SELECT (v:ggj_gg)      IF (0 (VAR:ggj_gg)) ;
   REMOVE (v:ggj_gg) ;

we end up with either byggje or bygge. Now all the style ambiguities are defined in the paradigms of the words in the dictionary, and the CG file only has to do mechanical selection of tags based on stream variables (the generator CG file could technically be created automatically).

Preference sets[edit]

Each of the stream variables works as an off/on switch (or default/override) for a single feature. But oftentimes you have a set of features that define a language norm (e.g. "the Catalan Valencian Language Academy norm"). In order to "hardcode" the features for a certain norm, you can simply create a CG file that runs between the generator and the variable applier CG:

   … | lt-proc -b nob-nno.autogen.bin | cg-proc news-norm.rlx.bin | cg-proc -g -n nob-nno.genprefs.rlx.bin

This file just unconditionally overrides any features that need overriding for that norm:

   SELECT (v:ggj_gg) ;
   SELECT (v:me_vi) ;

(any other features will be REMOVE'd by the following file as long as no stream variables are set).

Inspecting/debugging bidix preferences[edit]

'veps_kvefs' is a bidix-defined preference, output from bidix will be ambiguous

$ echo veps | apertium -f none -d . nob-nno_e-biltrans
^veps<n><m><sg><ind><aa><@subj>/veps<n><m><sg><ind><aa><@subj>/kvefs<n><m><sg><ind><aa><@subj>$

Without setting the variable (or with it set to empty string), the biprefs CG picks veps (first one after slash):

$ echo veps | apertium -f none -d . nob-nno_e-biprefs
^veps<n><m><sg><ind><aa><@subj>/veps<n><m><sg><ind><aa><@subj>/¬kvefs<n><m><sg><ind><aa><@subj><REMOVE:63:veps_veps>$

With it set, the tl reading (first after slash) is kvefs:

$ echo veps | AP_SETVAR='veps_kvefs' apertium -f none -d . nob-nno_e-biprefs
^veps<n><m><sg><ind><aa><@subj>/kvefs<n><m><sg><ind><aa><@subj><SELECT:62:veps_kvefs>/¬veps<n><m><sg><ind><aa><@subj><SELECT:62:veps_kvefs>$

Inspecting/debugging generator preferences[edit]

'infa_infe' and 'kj_k' are generator preferences, so the output from the generator will have these tags on the form side:

$ echo 'søke' | apertium -f none -d . nob-nno-dgen
^søke<vblex><inf>/søkja/søk<v:kj_k>a/søkje<v:infa_infe>/søk<v:kj_k>e<v:infa_infe>$

With variable unset, the genprefs CG picks 'søkja':

$ echo 'søke' | AP_SETVAR='' apertium -d . nob-nno-genprefs
søkja/¬søka<v:kj_k><REMOVE:28>/¬søkje<v:infa_infe><REMOVE:20>/¬søke<v:kj_k><v:infa_infe><REMOVE:20>

With both prefs set, the genprefs CG picks 'søke':

$ echo 'søke' | AP_SETVAR='infa_infe,kj_k' apertium -d . nob-nno-genprefs
søke<v:kj_k><v:infa_infe><SELECT:19><SELECT:27>/¬søkja<SELECT:19>/¬søka<v:kj_k><SELECT:19>/¬søkje<v:infa_infe><SELECT:19><SELECT:27>

(The tags are still visible in output because cg-proc is running in trace mode.)

Enabling preferences in a language pair[edit]

To start using preferences in a regular language pair that wasn't using them before, you first of all need to add the CG's that read the variables to modes.xml and change the generator to use bilingual mode, say we're adding it to the fra-oci mode:

@@ -21,6 +21,9 @@
       <program name="lt-proc -b">
         <file name="fra-oci.autobil.bin"/>
       </program>
+      <program name="cg-proc" debug-suff="biprefs">
+        <file name="fra-oci.biprefs.rlx.bin"/>
+      </program>
       <program name="lrx-proc -m">
         <file name="fra-oci.autolex.bin"/>
       </program>
@@ -60,9 +63,12 @@
         <file name="fra-oci.autosep2.bin"/>
       </program>
       <program name="apertium-posttransfer" debug-suff="posttransfer"/>
-      <program name="lt-proc $1">
+      <program name="lt-proc $1 -b" debug-suff="dgen">
         <file name="fra-oci.autogen.bin"/>
       </program>
+      <program name="cg-proc -1 -n -g" debug-suff="genprefs">
+        <file name="fra-oci.genprefs.rlx.bin"/>
+      </program>
       <program name="lt-proc -p">
         <file name="fra-oci.autopgen.bin"/>
       </program>

You also need files

  • apertium-oci/oci.preferences.xml
  • apertium-oci/apertium-oci.oci.genprefs.rlx
  • apertium-oci-fra/fra-oci.preferences.xml
  • apertium-oci-fra/apertium-oci-fra.fra-oci.biprefs.rlx

The Makefile.am also needs some changes, see

(or grep -i "prefs" apertium-{nno,nno-nob}/Makefile.am)

Describing preferences to the user[edit]

See https://github.com/apertium/apertium/issues/118

See also[edit]