Difference between revisions of "Autoconcord"
Line 96: | Line 96: | ||
Then the corrected bidix entry will be: |
Then the corrected bidix entry will be: |
||
<pre> |
|||
<e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p><par n="_nt_ut"/><par n="_sp_sgpl"/></e> |
<e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p><par n="_nt_ut"/><par n="_sp_sgpl"/></e> |
||
</pre> |
|||
=== The -replace parameter === |
=== The -replace parameter === |
Revision as of 12:45, 25 October 2010
Contents
Making the bidix concord with the monodices
The apertium-dixtools package contains a tool for automatically make symbols (gender, number, ...) in the bidix agree with the monodices.
How does it work?
Some preparations are needed.
The tools looks in the monodices for a special autoconcord comment in the paradigms:
<pardef n="ackord__n" '''c="autoconcord:nt,sp"'''> <e> <p><l></l> <r><s n="n"/><s n="nt"/><s n="sp"/><s n="ind"/></r></p></e> <e> <p><l>et</l> <r><s n="n"/><s n="nt"/><s n="sg"/><s n="def"/></r></p></e> <e> <p><l>en</l> <r><s n="n"/><s n="nt"/><s n="pl"/><s n="def"/></r></p></e> </pardef> ... <e lm="avbrott"> <i>avbrott</i><par n="ackord__n"/></e>
This comment makes all entries using paradigm ackord__n have the autoconcord symbols 'nt' and 'sp'.
The bidix contains
<e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p></e>
The right dix have autoconcord symbols 'ut' and 'sgpl' for the lemma:
<pardef n="abe__n" c="autoconcord:ut,sgpl"> <e> <p><l></l> <r><s n="n"/><s n="ut"/><s n="sg"/><s n="ind"/></r></p></e> <e> <p><l>n</l> <r><s n="n"/><s n="ut"/><s n="sg"/><s n="def"/></r></p></e> <e> <p><l>r</l> <r><s n="n"/><s n="ut"/><s n="pl"/><s n="ind"/></r></p></e> <e> <p><l>rne</l> <r><s n="n"/><s n="ut"/><s n="pl"/><s n="def"/></r></p></e> </pardef> ... <e lm="afbrydelse"> <i>afbrydelse</i><par n="abe__n"/></e>
What does it do?
Autoconcord will try to make the autoconcord symbols of left dix (nt,sp) concord with those of the right dix (ut,sgpl). It does so by pairing them one by one: nt-ut and sp-sgpl. Then it searches the bidix for paradigms with the special autoconcord comments "autoconcord:nt-ut" and "autoconcord:sp-sgpl":
<pardef n="_nt_ut" c="autoconcord:nt-ut"> <e> <p><l><s n="nt"/></l><r><s n="ut"/></r></p></e> </pardef> <pardef n="_sp_sgpl" c="autoconcord:sp-sgpl"> <e r="LR"><p><l><s n="sp"/><s n="ind"/></l><r><s n="ND"/><s n="ind"/></r></p></e> <e r="RL"><p><l><s n="sp"/><s n="ind"/></l><r><s n="sg"/><s n="ind"/></r></p></e> <e r="RL"><p><l><s n="sp"/><s n="ind"/></l><r><s n="pl"/><s n="ind"/></r></p></e> <e> <p><l><s n="sg"/><s n="def"/></l><r><s n="sg"/><s n="def"/></r></p></e> <e> <p><l><s n="pl"/><s n="def"/></l><r><s n="pl"/><s n="def"/></r></p></e> </pardef>
and then it will change the bidix entry from
<e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p></e>
to include the autocondord paradigms in the bidix:
<e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p><par n="_nt_ut"/><par n="_sp_sgpl"/></e>
Note: The _ prefix in the pardef names have no special meaning, its just for being able to distinguish them. The pardef names used for autoconcord can be anything.
Variations
Some autocondord paradigms are not really usefull to insert. For example sp-sp and sgpl-sgpl are trivial. You can avoid insertion of these paradigms by appending '/omit to these paradigms in the bidix:
<pardef n="_sgpl_sgpl" c="autoconcord:sgpl-sgpl/omit"> <e> <i></i></e> </pardef> <pardef n="_sp_sp" c="autoconcord:sp-sp/omit"> <e> <i></i></e> </pardef>
If you want to 'inline' a paradigm, that is, have paradims symbols expanded directly in the entry, you add /expand to the autoconcord comment:
<pardef n="_nt_ut" c="autoconcord:nt-ut/expand"> <e> <p><l><s n="nt"/></l><r><s n="ut"/></r></p></e> </pardef>
Then the corrected bidix entry will be:
<e><p><l>avbrott<s n="n"/></l><r>afbrydelse<s n="n"/></r></p><par n="_nt_ut"/><par n="_sp_sgpl"/></e>
The -replace parameter
During processing of the bidix entries autoconcord will first delete all paradigms and the symbols to be replaced (usually gender symbols like m, f, nt and ut). This is to support inlining/expansions of the symbols as explained above.
The -replace parameter specifies which symbols should be deleted if they appear in an entry. Default value is 'm,f,mf,ut,nt,un'.
If you are the unlucky owner of a language pair where you must maintain the synthetic adjective tag (<sint>) in the bidix, you could write autoconcord rules to fix that (i.e. adding/removing the <sint> in the bidix automatically). In that case you would i.a. pass -replace sint as parameter.
Invocation
Usage: apertium-dixtools autoconcord [-prefix symbol(s)] [-replace symbols] [-leftMon mon1.dix] [-rightMon mon1.dix] bidix.dix [output.dix] autoconcord -prepare [-leftMon mon1.dix] [-rightMon mon1.dix] bidix.dix Automatically makes symbols (gender, number, ...) in the bidix agree with the monodices in the cases where the concordance beyound doubt can be resolved automatically. -leftMon and -rightMon specify the monodices file names. If not specified they will be guessed according to default naming schemes -prefix Only concord entries starting with this list of comma-separated symbols. Default: -prefix n -replace Replace (remove) these symbols during processing. Default: m,f,mf,ut,nt,un -prepare attempts to detect and insert autoconcord data into the monodices,
There are also a number of generic options.
If you don't provide an output filename the new bidix will be written to the original with a '.new' suffix.
When you use it its a good idea to format your dictionary first:
$ apertium-dixtools format apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix.formatted
Check if format is OK:
$ diff apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix.formatted | less
$ mv apertium-sv-da.sv-da.dix.formatted apertium-sv-da.sv-da.dix $ apertium-dixtools autoconcord apertium-sv-da.sv-da.dix
Check if autocondord corrections are OK:
$ diff apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix.new | less
Working on other word classes than nouns
Default is to only process noun entries in bidix (-prefix n). To process fex both nouns and adjectives use -prefix n,adj
Preparation of a language pair to use autoconcord
Manually putting autoconcord comments in paradigm can take some time. If you don't want to do it manually dixtools can do some of the work for you.
Here is an example of how
$ apertium-dixtools autoconcord -prepare -prefix n -replace m,f,mf,ut,nt,NUMBER:sgpl{sg+pl},NUMBER:sp apertium-sv-da.sv-da.dix
As the command it very seldom used you may want to check the source code, and perhaps even modify it. Its method prepareBidixAndMonodixes() in file AutoconcordBidix.java.