Difference between revisions of "Dictionary coverage"
(Documentation in English) |
|||
Line 98: | Line 98: | ||
[[Category:Dixtools]] |
[[Category:Dixtools]] |
||
[[Category:Documentation in English]] |
Revision as of 17:29, 3 September 2011
There exists an experimental tool in in Apertium-dixtools to do frequency statistics on a dictionary, i.a. which entries are used and which entries are not used. It works both on normal entries as well as all entries in paradigms.
This is i.a. usable if have a language pair in which you want to do the opposite direction. The tool was successfully on English and Esperanto to make the eo-en direction. Please contact me (--Jacob Nordfalk 18:07, 3 November 2009 (UTC)) for help using it.
Here is how it could be applied on sv-da to analyse how much of the Danish dictionary (which contains many entries not used from sv to da).
First remove duplicates in dixes, e.g. do:
apertium-dixtools fix -alignBidix apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix apertium-dixtools fix -alignMonodix apertium-sv-da.da.dix apertium-sv-da.da.dix apertium-dixtools fix -alignMonodix apertium-sv-da.sv.dix apertium-sv-da.sv.dix
Then, make a copy of the pair in the profiler/ subdirectory.
$ mkdir profiler/ $ cp * profiler/*
Then, create a "profiler" version of your dictionaries:
$ apertium-dixtools profilecreate . sv-da
This will create a file dixtools-profilekeys.txt and overwrite the profiler/*.dix dictionaries. Now do
$ cd profiler $ make
Then edit the modes.xml and add a mode where you replace
<program name="lt-proc $1"> <file name="sv-da.autogen.bin"/> </program>
with usage of the profiling version of the dictionary, after which the task apertium-dixtools profilecollect must be inserted to collect the data (saving it to dixtools-profiledata.txt) and filter the output for the following stages in the mode file:
<program name="lt-proc $1"> <file name="profiler/sv-da.autogen.bin"/> </program> <program name="apertium-dixtools profilecollect"> <file name="dixtools-profiledata.txt"/> </program>
Use the mode, for example on a corpus and/or with your favorite testvoc script:
$ apertium-dixtools profileresult Reading dixtools-profilekeys.txt Reading dixtools-profiledata.txt Writing dixtools-profileresult.txt
How it works
It is adding keys to the dixes. So
<pardef n="b/urde__vbmod"> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="inf"/><s n="actv"/></r></p></e> <e> <p><l>ør</l> <r>urde<s n="vbmod"/><s n="pres"/><s n="actv"/></r></p></e> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="past"/><s n="actv"/></r></p></e> <e> <p><l>urdet</l> <r>urde<s n="vbmod"/><s n="pp"/></r></p></e> </pardef> <pardef n="må/tte__vaux"> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="inf"/><s n="actv"/></r></p></e> <e> <p><l></l> <r>tte<s n="vaux"/><s n="pres"/><s n="actv"/></r></p></e> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="past"/><s n="actv"/></r></p></e> <e> <p><l>ttet</l> <r>tte<s n="vaux"/><s n="pp"/></r></p></e> </pardef>
becomes
<pardef n="b/urde__vbmod"> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="inf"/><s n="actv"/></r></p><p><l>%bct%</l><r/></p></e> <e> <p><l>ør</l> <r>urde<s n="vbmod"/><s n="pres"/><s n="actv"/></r></p><p><l>%bcu%</l><r/></p></e> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="past"/><s n="actv"/></r></p><p><l>%bcv%</l><r/></p></e> <e> <p><l>urdet</l> <r>urde<s n="vbmod"/><s n="pp"/></r></p><p><l>%bcw%</l><r/></p></e> </pardef> <pardef n="må/tte__vaux"> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="inf"/><s n="actv"/></r></p><p><l>%bcx%</l><r/></p></e> <e> <p><l></l> <r>tte<s n="vaux"/><s n="pres"/><s n="actv"/></r></p><p><l>%bcy%</l><r/></p></e> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="past"/><s n="actv"/></r></p><p><l>%bcz%</l><r/></p></e> <e> <p><l>ttet</l> <r>tte<s n="vaux"/><s n="pp"/></r></p><p><l>%bd0%</l><r/></p></e> </pardef>
The dixtools-profileresult.txt contains
0 bct <e><l>urde</l><r>urde<vbmod><inf><actv></r></e> 0 bcu <e><l>ør</l><r>urde<vbmod><pres><actv></r></e> 0 bcv <e><l>urde</l><r>urde<vbmod><past><actv></r></e> 0 bcw <e><l>urdet</l><r>urde<vbmod><pp></r></e> 1 bcx <e><l>tte</l><r>tte<vaux><inf><actv></r></e> 41 bcy <e><l></l><r>tte<vaux><pres><actv></r></e> 1 bcz <e><l>tte</l><r>tte<vaux><past><actv></r></e> 1 bd0 <e><l>ttet</l><r>tte<vaux><pp></r></e>
so you can see that the first paradigm isnt used at all and in the 2nd paradigm the %bcy% was used 42 times and the other entries 1 time each.