Difference between revisions of "Dialectal or standard variation"

From Apertium
Jump to navigation Jump to search
(lots of docs)
Line 1: Line 1:
== Non-overlapping variants ==
Some languages have differences in lexis and grammar, but are still desirable to be treated as one side of a language pair, as either they have a largely similar orthography and lexis, or for historical reasons.
Some languages have differences in lexis and grammar, but are still
desirable to be treated as one side of a language pair, as either they
have a largely similar orthography and lexis, or for historical
reasons.


For example:
For example:


* Portuguese, Brazilian Portuguese
* Portuguese, Brazilian Portuguese

* Occitan, Aranese
* Occitan, Aranese

* Serbo-Croatian (Bosnia, Croatia, Serbia)
* Serbo-Croatian (Bosnia, Croatia, Serbia)


The languages are so similar that duplicating the work in many separate systems is wasteful. There are a couple of approaches that have been taken, both relying on an intermediate dictionary and transfer format which is then converted by an xsl stylesheet into the "real" .dix files.
The languages are so similar that duplicating the work in many
separate systems is wasteful. In cases such as these, where there are
a few well-defined norms/standards for each "macrolanguage", the
common method is to define one mode (pipeline) per language norm, with
its own set of morphological generators etc. There is built-in support
in [[lttoolbox]] for marking entries as only relevant for a certain
variant, so one can use the same .dix file to create several compiled
FST's.

For '''analysers/generators''', we can use the <tt>alt</tt> attribute to say that
this entry should only be included when compiled when using a matching
<tt>-a/--alt</tt> option to <tt>lt-comp</tt>. If we want to choose between the
letter "a" and "e" when generating, but allow both when analysing, we
could do

<pardef n="a_vs_e">
<e r="LR"> <p><l>e</l> <r></r></p></e>
<e r="LR"> <p><l>a</l> <r></r></p></e>
<e r="RL" alt="var1"><p><l>e</l> <r></r></p></e>
<e r="RL" alt="var2"><p><l>a</l> <r></r></p></e>
</pardef>

(and call that paradigm from other entries), and compile the files
with

$ lt-comp --alt=var1 rl foo.dix foo_var1.autogen.bin
$ lt-comp --alt=var2 rl foo.dix foo_var2.autogen.bin
$ lt-comp lr foo.dix foo_var2.automorf.bin

There is also a <tt>v</tt> attribute that treats an entry as left-to-right
if the <tt>-v/--var</tt> option to <tt>lt-comp</tt> is unset.


For '''bidix''', we can use attributes <tt>vr</tt> and <tt>vl</tt> to a similar effect,
with the <tt>lt-comp</tt> options <tt>-r/--var-right</tt> and
<tt>-l/--var-left</tt>, respectively.

The below entries from swe-nor.dix

<e vr="nno"><p><l>hamna</l> <r>hamne</r></p><par n="vblex"/></e>
<e vr="nob"><p><l>hamna</l> <r>havne</r></p><par n="vblex"/></e>

mean that both "hamne" and "havne" translate "hamna" when going
right-to-left, while in the left-to-right direction, "hamne" is chosen
in the fst compiled with <tt>-r nno</tt>, while "havne" is chosen in the fst
compiled with <tt>-r nob</tt>.


The options are also documented at
[[Compiling_dictionaries#Compilation_options_and_attributes]]

= Overlapping variants =
For some languages, we also have a wide variety of overlapping style
preferences, where the above method of one pipeline per "norm" would
lead to an explosion of pipelines. Here we need a different method.

For example, in Norwegian Nynorsk all the below are acceptable
translations of Bokmål "vi kan søke forskjeller"; the differences are
purely stylistic:

* vi kan søke forskjellar

* vi kan søka forskjellar

* vi kan søkje forskjellar

* vi kan søkja forskjellar

* vi kan søke skilnader

* vi kan søka skilnader

* vi kan søkje skilnader

* vi kan søkja skilnader

* me kan søke forskjellar

* me kan søka forskjellar

* me kan søkje forskjellar

* me kan søkja forskjellar

* me kan søke skilnader

* me kan søka skilnader

* me kan søkje skilnader

* me kan søkja skilnader

<smaller>(actually "skilnadar" is also a possibility …)</smaller>

There may be some correlation of people writing "me" and "søkja" vs
"vi" and "søke", but in practice there are too many possibilities to
create one pipeline per set of preferences. The method we use to solve
this in <tt>apertium-nno-nob</tt> is to

# only compile a single pipeline for the Bokmål to Nynorsk direction,

# generate "ambiguous" output with all possibilities,

# disambiguate between possibilities, picking which to use based on stream variables

A [https://visl.sdu.dk/cg3/single/#streamcmds stream variable] here is a little preference cookie inserted at the
start of the translation input, understood by <tt>cg-proc</tt>.

== Bidix preferences ==
Some preference choices are defined in bidix, e.g. those where we
choose between synonymous but different lemmas. When translating from
right to left, we remove the <tt>"LR"</tt> on the entries in question:

<e> <p><l>skilnad<s n="n"/><s n="m"/> </l><r>forskjell<s n="n"/><s n="m"/></r></p></e>
<e> <p><l>forskjell<s n="n"/><s n="m"/></l><r>forskjell<s n="n"/><s n="m"/></r></p></e>

We now get ambiguous output from biltrans:

$ echo forskjell|apertium -f none -d . nob-nno_e-biltrans
^forskjell<n><m><sg><ind>/forskjell<n><m><sg><ind>/skilnad<n><m><sg><ind>$

This may look like a lexical selection problem to be handled by
<tt>lrx-proc</tt>, but in this case the words have a purely stylistic
difference. Before the lexical selection stage we run the CG-file
<tt>apertium-nno-nob.nob-nno.biprefs.rlx</tt> which matches on lemmas and
stream variables. The below rule says that if the variable
<tt>forskjell_skilnad</tt> is set, we choose "forskjell", otherwise we fall
back to "skilnad":

SELECT ("skilnad"i) IF (0 ("forskjell"i) + (VAR:forskjell_skilnad));
REMOVE ("skilnad"i) IF (0 ("forskjell"i));

The preference variable here is named <tt>forskjell_skilnad</tt> since the
default is "forskjell", but if that option is ticked / variable is set,
we choose "skilnad".

== Generator preferences ==
There are quite a few style preferences that have to do with different
spellings of the same word, or alternative paradigms. For example, in
Nynorsk words containing -ggj- can also be expressed -gg-, so both
"byggje" and "bygge" are alternative ways of spelling
''build''. Specifying such things in bidix would lead to a lot of
redundancy both within a bidix and across pairs, so we prefer to
define these in the '''generator'''. By running the generator with
<tt>lt-proc -b</tt> (normally used for bidix), we can apply a disambiguator
to the form-ambiguous output. The <tt>-g</tt> option to <tt>cg-proc</tt> will output
the selected form without surrounding <tt>^$</tt> or tags (and turn <tt>@</tt> into
<tt>#</tt>), and <tt>-n</tt> will suppress printing the input analysis (what CG
calls "word form"), so instead of ending the pipeline with

… | lt-proc -g nob-nno.autogen.bin

we do

… | lt-proc -b nob-nno.autogen.bin | cg-proc -n -g nob-nno.genprefs.rlx.bin

Now we could do it as with bidix, getting
<tt>^byggje<vblex><inf>/byggje/bygge$</tt> out of the generator and disambiguating that to
<tt>byggje</tt> with a rule like

SELECT ("bygge"i) IF (0 ("byggje"i) + (VAR:byggje_bygge));
REMOVE ("bygge"i) IF (0 ("byggje"i));

or even use some regexes to handle all ggj/gg pairs, but that would
quickly get unwieldy in the CG file. We also wouldn't be taking
advantage of the information that's already in the nno.dix file at the
point of the style ambiguity. So to make this more wieldy, we output
a tag when generating the non-default reading, and put it in a pardef
that we can place in all the style-ambiguous spots:

<pardef n="v:ggj_gg">
<e> <p><l>ggj</l> <r>ggj</r></p></e>
<e r="LR"> <p><l>gg</l> <r>ggj</r></p></e>
<e r="RL"> <p><l>gg<s n="v:ggj_gg"/></l> <r>ggj</r></p></e>
</pardef>

This does unfortunately mean we get a tag in the middle of the form:
<tt>/byggje/bygg<v:ggj_gg>e$</tt>, but as mentioned above, <tt>cg-proc -g</tt> will
strip tags, so after disambiguation with the following rules:

SELECT (v:ggj_gg) IF (0 (VAR:ggj_gg)) ;
REMOVE (v:ggj_gg) ;


we end up with either <tt>byggje</tt> or <tt>bygge</tt>. Now all the style
The first is <code>filter.xsl</code>, which is used for the <code>apertium-es-pt</code> pair. The second is <code>aversion.xsl</code>, which is used with the <code>apertium-oc-ca</code> pair. Neither of these is really appropriate for marking variants though, so we could do with something more sophisticated.
ambiguities are defined in the paradigms of the words in the
dictionary, and the CG file only has to do mechanical selection of
tags based on stream variables (the generator CG file could
technically be created automatically).


== See also ==
== See also ==
[[Unification of metadix and parametrized dictionaries]] on ''variants'' in monodix and transfer rules
* [[Unification of metadix and parametrized dictionaries]] on ''variants'' in monodix and transfer rules (not quite relevant any longer; the xslt features have been built into lt-comp)


[[Category:Development]]
[[Category:Development]]

Revision as of 08:33, 20 March 2021

Non-overlapping variants

Some languages have differences in lexis and grammar, but are still desirable to be treated as one side of a language pair, as either they have a largely similar orthography and lexis, or for historical reasons.

For example:

  • Portuguese, Brazilian Portuguese
  • Occitan, Aranese
  • Serbo-Croatian (Bosnia, Croatia, Serbia)

The languages are so similar that duplicating the work in many separate systems is wasteful. In cases such as these, where there are a few well-defined norms/standards for each "macrolanguage", the common method is to define one mode (pipeline) per language norm, with its own set of morphological generators etc. There is built-in support in lttoolbox for marking entries as only relevant for a certain variant, so one can use the same .dix file to create several compiled FST's.

For analysers/generators, we can use the alt attribute to say that this entry should only be included when compiled when using a matching -a/--alt option to lt-comp. If we want to choose between the letter "a" and "e" when generating, but allow both when analysing, we could do

   <pardef n="a_vs_e">

<e r="LR">

<l>e</l> <r></r>

</e> <e r="LR">

<l>a</l> <r></r>

</e> <e r="RL" alt="var1">

<l>e</l> <r></r>

</e> <e r="RL" alt="var2">

<l>a</l> <r></r>

</e>

   </pardef>

(and call that paradigm from other entries), and compile the files with

   $ lt-comp --alt=var1 rl foo.dix foo_var1.autogen.bin
   $ lt-comp --alt=var2 rl foo.dix foo_var2.autogen.bin
   $ lt-comp            lr foo.dix foo_var2.automorf.bin

There is also a v attribute that treats an entry as left-to-right if the -v/--var option to lt-comp is unset.


For bidix, we can use attributes vr and vl to a similar effect, with the lt-comp options -r/--var-right and -l/--var-left, respectively.

The below entries from swe-nor.dix

<e vr="nno">

<l>hamna</l> <r>hamne</r>

<par n="vblex"/></e> <e vr="nob">

<l>hamna</l> <r>havne</r>

<par n="vblex"/></e>

mean that both "hamne" and "havne" translate "hamna" when going right-to-left, while in the left-to-right direction, "hamne" is chosen in the fst compiled with -r nno, while "havne" is chosen in the fst compiled with -r nob.


The options are also documented at Compiling_dictionaries#Compilation_options_and_attributes

Overlapping variants

For some languages, we also have a wide variety of overlapping style preferences, where the above method of one pipeline per "norm" would lead to an explosion of pipelines. Here we need a different method.

For example, in Norwegian Nynorsk all the below are acceptable translations of Bokmål "vi kan søke forskjeller"; the differences are purely stylistic:

  • vi kan søke forskjellar
  • vi kan søka forskjellar
  • vi kan søkje forskjellar
  • vi kan søkja forskjellar
  • vi kan søke skilnader
  • vi kan søka skilnader
  • vi kan søkje skilnader
  • vi kan søkja skilnader
  • me kan søke forskjellar
  • me kan søka forskjellar
  • me kan søkje forskjellar
  • me kan søkja forskjellar
  • me kan søke skilnader
  • me kan søka skilnader
  • me kan søkje skilnader
  • me kan søkja skilnader

<smaller>(actually "skilnadar" is also a possibility …)</smaller>

There may be some correlation of people writing "me" and "søkja" vs "vi" and "søke", but in practice there are too many possibilities to create one pipeline per set of preferences. The method we use to solve this in apertium-nno-nob is to

  1. only compile a single pipeline for the Bokmål to Nynorsk direction,
  1. generate "ambiguous" output with all possibilities,
  1. disambiguate between possibilities, picking which to use based on stream variables

A stream variable here is a little preference cookie inserted at the start of the translation input, understood by cg-proc.

Bidix preferences

Some preference choices are defined in bidix, e.g. those where we choose between synonymous but different lemmas. When translating from right to left, we remove the "LR" on the entries in question:

<e>

<l>skilnad </l><r>forskjell</r>

</e> <e>

<l>forskjell</l><r>forskjell</r>

</e>

We now get ambiguous output from biltrans:

   $ echo forskjell|apertium -f none -d . nob-nno_e-biltrans
   ^forskjell<n><m><sg><ind>/forskjell<n><m><sg><ind>/skilnad<n><m><sg><ind>$

This may look like a lexical selection problem to be handled by lrx-proc, but in this case the words have a purely stylistic difference. Before the lexical selection stage we run the CG-file apertium-nno-nob.nob-nno.biprefs.rlx which matches on lemmas and stream variables. The below rule says that if the variable forskjell_skilnad is set, we choose "forskjell", otherwise we fall back to "skilnad":

   SELECT ("skilnad"i) IF (0 ("forskjell"i) + (VAR:forskjell_skilnad));
   REMOVE ("skilnad"i) IF (0 ("forskjell"i));

The preference variable here is named forskjell_skilnad since the default is "forskjell", but if that option is ticked / variable is set, we choose "skilnad".

Generator preferences

There are quite a few style preferences that have to do with different spellings of the same word, or alternative paradigms. For example, in Nynorsk words containing -ggj- can also be expressed -gg-, so both "byggje" and "bygge" are alternative ways of spelling build. Specifying such things in bidix would lead to a lot of redundancy both within a bidix and across pairs, so we prefer to define these in the generator. By running the generator with lt-proc -b (normally used for bidix), we can apply a disambiguator to the form-ambiguous output. The -g option to cg-proc will output the selected form without surrounding ^$ or tags (and turn @ into #), and -n will suppress printing the input analysis (what CG calls "word form"), so instead of ending the pipeline with

   … | lt-proc -g nob-nno.autogen.bin

we do

   … | lt-proc -b nob-nno.autogen.bin | cg-proc -n -g nob-nno.genprefs.rlx.bin

Now we could do it as with bidix, getting ^byggje<vblex><inf>/byggje/bygge$ out of the generator and disambiguating that to byggje with a rule like

   SELECT ("bygge"i) IF (0 ("byggje"i) + (VAR:byggje_bygge));
   REMOVE ("bygge"i) IF (0 ("byggje"i));

or even use some regexes to handle all ggj/gg pairs, but that would quickly get unwieldy in the CG file. We also wouldn't be taking advantage of the information that's already in the nno.dix file at the point of the style ambiguity. So to make this more wieldy, we output a tag when generating the non-default reading, and put it in a pardef that we can place in all the style-ambiguous spots:

   <pardef n="v:ggj_gg">

<e>

<l>ggj</l> <r>ggj</r>

</e> <e r="LR">

<l>gg</l> <r>ggj</r>

</e> <e r="RL">

<l>gg</l> <r>ggj</r>

</e>

   </pardef>

This does unfortunately mean we get a tag in the middle of the form: /byggje/bygg<v:ggj_gg>e$, but as mentioned above, cg-proc -g will strip tags, so after disambiguation with the following rules:

   SELECT (v:ggj_gg)      IF (0 (VAR:ggj_gg)) ;
   REMOVE (v:ggj_gg) ;

we end up with either byggje or bygge. Now all the style ambiguities are defined in the paradigms of the words in the dictionary, and the CG file only has to do mechanical selection of tags based on stream variables (the generator CG file could technically be created automatically).

See also