Autoconcord

From Apertium
Revision as of 17:53, 25 October 2010 by Unhammer (talk | contribs) (wiki-markup doesn't work in <pre> ;))
Jump to navigation Jump to search

Making the bidix concord with the monodices

The apertium-dixtools package contains a tool for automatically make symbols (gender, number, ...) in the bidix agree with the monodices.

How does it work?

Some preparations are needed.

The tools looks in the monodices for a special autoconcord comment in the paradigms:

<pardef n="ackord__n" c="autoconcord:nt,sp">
  <e>       <p><l></l>          <r><s n="n"/><s n="nt"/><s n="sp"/><s n="ind"/></r></p></e>
  <e>       <p><l>et</l>        <r><s n="n"/><s n="nt"/><s n="sg"/><s n="def"/></r></p></e>
  <e>       <p><l>en</l>        <r><s n="n"/><s n="nt"/><s n="pl"/><s n="def"/></r></p></e>
</pardef>
...

<e lm="avbrott">         <i>avbrott</i><par n="ackord__n"/></e>

This comment makes all entries using paradigm ackord__n have the autoconcord symbols 'nt' and 'sp'.

The bidix contains

 <e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p></e>


The right dix have autoconcord symbols 'ut' and 'sgpl' for the lemma:

<pardef n="abe__n" c="autoconcord:ut,sgpl">
  <e>       <p><l></l>          <r><s n="n"/><s n="ut"/><s n="sg"/><s n="ind"/></r></p></e>
  <e>       <p><l>n</l>         <r><s n="n"/><s n="ut"/><s n="sg"/><s n="def"/></r></p></e>
  <e>       <p><l>r</l>         <r><s n="n"/><s n="ut"/><s n="pl"/><s n="ind"/></r></p></e>
  <e>       <p><l>rne</l>       <r><s n="n"/><s n="ut"/><s n="pl"/><s n="def"/></r></p></e>
</pardef>
...

<e lm="afbrydelse">      <i>afbrydelse</i><par n="abe__n"/></e>

What does it do?

Autoconcord will try to make the autoconcord symbols of left dix (nt,sp) concord with those of the right dix (ut,sgpl). It does so by pairing them one by one: nt-ut and sp-sgpl. Then it searches the bidix for paradigms with the special autoconcord comments "autoconcord:nt-ut" and "autoconcord:sp-sgpl":

<pardef n="_nt_ut" c="autoconcord:nt-ut">
  <e>       <p><l><s n="nt"/></l><r><s n="ut"/></r></p></e>
</pardef>

<pardef n="_sp_sgpl" c="autoconcord:sp-sgpl">
  <e r="LR"><p><l><s n="sp"/><s n="ind"/></l><r><s n="ND"/><s n="ind"/></r></p></e>
  <e r="RL"><p><l><s n="sp"/><s n="ind"/></l><r><s n="sg"/><s n="ind"/></r></p></e>
  <e r="RL"><p><l><s n="sp"/><s n="ind"/></l><r><s n="pl"/><s n="ind"/></r></p></e>
  <e>       <p><l><s n="sg"/><s n="def"/></l><r><s n="sg"/><s n="def"/></r></p></e>
  <e>       <p><l><s n="pl"/><s n="def"/></l><r><s n="pl"/><s n="def"/></r></p></e>
</pardef>

and then it will change the bidix entry from

 <e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p></e>

to include the autocondord paradigms in the bidix:

 <e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p><par n="_nt_ut"/><par n="_sp_sgpl"/></e>


Note: The _ prefix in the pardef names have no special meaning, its just for being able to distinguish them. The pardef names used for autoconcord can be anything.

Variations

Some autocondord paradigms are not really usefull to insert. For example sp-sp and sgpl-sgpl are trivial. You can avoid insertion of these paradigms by appending '/omit to these paradigms in the bidix:

<pardef n="_sgpl_sgpl" c="autoconcord:sgpl-sgpl/omit">
  <e>       <i></i></e>
</pardef>

<pardef n="_sp_sp" c="autoconcord:sp-sp/omit">
  <e>       <i></i></e>
</pardef>

If you want to 'inline' a paradigm, that is, have paradims symbols expanded directly in the entry, you add /expand to the autoconcord comment:

<pardef n="_nt_ut" c="autoconcord:nt-ut/expand">
  <e>       <p><l><s n="nt"/></l><r><s n="ut"/></r></p></e>
</pardef>

Then the corrected bidix entry will be:

 <e><p><l>avbrott<s n="n"/><s n="nt"/></l><r>afbrydelse<s n="n"/><s n="ut"/></r></p><par n="_sp_sgpl"/></e>

Note that inline/expandable paradigms must have exactly one entry.

The -replace parameter

During processing of the bidix entries autoconcord will first delete all paradigms and the symbols to be replaced (usually gender symbols like m, f, nt and ut). This is to support inlining/expansions of the symbols as explained above.

The -replace parameter specifies which symbols should be deleted if they appear in an entry. Default value is 'm,f,mf,ut,nt,un'.

If you are the unlucky owner of a language pair where you must maintain the synthetic adjective tag (<sint>) in the bidix, you could write autoconcord rules to fix that (i.e. adding/removing the <sint> in the bidix automatically). In that case you would i.a. pass -replace sint as parameter.


Invocation

Usage: apertium-dixtools autoconcord [-prefix symbol(s)] [-replace symbols]  [-leftMon mon1.dix] [-rightMon mon1.dix] bidix.dix [output.dix]
autoconcord -prepare [-leftMon mon1.dix] [-rightMon mon1.dix] bidix.dix

Automatically makes symbols (gender, number, ...) in the bidix agree with the monodices
in the cases where the concordance beyound doubt can be resolved automatically.
 -leftMon and -rightMon specify the monodices file names. If not specified they will be guessed according to default naming schemes
 -prefix Only concord entries starting with this list of comma-separated symbols. Default: -prefix n
 -replace Replace (remove) these symbols during processing. Default: m,f,mf,ut,nt,un
 -prepare attempts to detect and insert autoconcord data into the monodices, 

There are also a number of generic options.

If you don't provide an output filename the new bidix will be written to the original with a '.new' suffix.

When you use it its a good idea to format your dictionary first:

$ apertium-dixtools format apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix.formatted

Check if format is OK:

$ diff  apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix.formatted | less

Then do autoconcord:

$ mv apertium-sv-da.sv-da.dix.formatted apertium-sv-da.sv-da.dix
$ apertium-dixtools autoconcord apertium-sv-da.sv-da.dix

And check if autocondord corrections are OK:

$ diff  apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix.new | less

Working on other word classes than nouns

Default is to only process noun entries in bidix (-prefix n). To process fex both nouns and adjectives use -prefix n,adj

Preparation of a language pair to use autoconcord

Manually putting autoconcord comments in paradigm can take some time. If you don't want to do it manually dixtools can do some of the work for you.

Here is an example of how

$ apertium-dixtools autoconcord -prepare -prefix n -replace m,f,mf,ut,nt,NUMBER:sgpl{sg+pl},NUMBER:sp apertium-sv-da.sv-da.dix

As the command it very seldom used you may want to check the source code, and perhaps even modify it. Its method prepareBidixAndMonodixes() in file AutoconcordBidix.java.