Difference between revisions of "Dictionary coverage"

From Apertium
Jump to navigation Jump to search
(Link to French page)
Line 1: Line 1:
[[Couverture du dictionnaire|En français]]
[[Couverture du dictionnaire|En français]]


There exists an experimental tool in in [[Apertium-dixtools]] to do frequency statistics on a dictionary, i.a. which entries are used and which entries are not used. It works both on normal entries as well as all entries in paradigms.
An experimental tool [[Apertium-dixtools]]” exists to do frequency statistics on a dictionary. It finds out which entries have been used and which entries have not been used. It works on both, normal entries as well as entries in paradigms.


The tool can be used, for example, if you have a pair of languages that you want to work in the other direction. The tool has been successfully used on [[English and Esperanto]] to make the eo-en direction. Contact me (--[[User:Jacob Nordfalk|Jacob Nordfalk]], 18: 07, November 3, 2009 (UTC)) for help using it.
This is i.a. usable if have a language pair in which you want to do the opposite direction.
The tool was successfully on [[English and Esperanto]] to make the eo-en direction. Please contact me (--[[User:Jacob Nordfalk|Jacob Nordfalk]] 18:07, 3 November 2009 (UTC)) for help using it.


Here is the method how it could be applied on sv-da to analyse how much of the Danish dictionary (which contains many entries not used from sv to da) is useful. The steps would be:
#Remove duplicates in dixes
#Make a copy of the pair in the profiler/ subdirectory
#Create a "profiler" version of your dictionaries.
#Edit the modes.xml and add a mode.
#Use the mode.


Here is how it could be applied on sv-da to analyse how much of the Danish dictionary (which contains many entries not used from sv to da).


The steps are explained with suitable examples as below:

First remove duplicates in dixes, e.g. do:
*To remove duplicates in dixes, do:
apertium-dixtools fix -alignBidix apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix
apertium-dixtools fix -alignBidix apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix
apertium-dixtools fix -alignMonodix apertium-sv-da.da.dix apertium-sv-da.da.dix
apertium-dixtools fix -alignMonodix apertium-sv-da.da.dix apertium-sv-da.da.dix
apertium-dixtools fix -alignMonodix apertium-sv-da.sv.dix apertium-sv-da.sv.dix
apertium-dixtools fix -alignMonodix apertium-sv-da.sv.dix apertium-sv-da.sv.dix


Then, make a copy of the pair in the profiler/ subdirectory.
*To make a copy of the pair in the profiler/ subdirectory, do:
$ mkdir profiler/
$ mkdir profiler/
$ cp * profiler/*
$ cp * profiler/*




Then, create a "profiler" version of your dictionaries:
*To create a "profiler" version of your dictionaries, do:
$ apertium-dixtools profilecreate . sv-da
$ apertium-dixtools profilecreate . sv-da


This will create a file dixtools-profilekeys.txt and overwrite the profiler/*.dix dictionaries.
This will create a file dixtools-profilekeys.txt and overwrite the profiler/*.dix dictionaries.
Now do
*Now do
$ cd profiler
$ cd profiler
$ make
$ make


Then edit the modes.xml and add a mode where you replace
*To edit the modes.xml and add a mode where you replace with usage of the profiling version of the dictionary, after which the task '''apertium-dixtools profilecollect''' must be inserted
to collect the data (saving it to dixtools-profiledata.txt) and filter the output for the following stages in the mode file, do:
<program name="lt-proc $1">
<program name="lt-proc $1">
<file name="sv-da.autogen.bin"/>
<file name="sv-da.autogen.bin"/>
</program>
</program>
with usage of the profiling version of the dictionary, after which the task '''apertium-dixtools profilecollect''' must be inserted to collect the data (saving it to dixtools-profiledata.txt) and filter the output for the following stages in the mode file:
<program name="lt-proc $1">
<program name="lt-proc $1">
<file name="profiler/sv-da.autogen.bin"/>
<file name="profiler/sv-da.autogen.bin"/>
Line 40: Line 44:
</program>
</program>


Use the mode, for example on a corpus and/or with your favorite testvoc script:
*To use the mode, for example on a corpus and/or with your favorite test script, do:


$ apertium-dixtools profileresult
$ apertium-dixtools profileresult

Revision as of 11:23, 5 December 2019

En français

An experimental tool “Apertium-dixtools” exists to do frequency statistics on a dictionary. It finds out which entries have been used and which entries have not been used. It works on both, normal entries as well as entries in paradigms.

The tool can be used, for example, if you have a pair of languages that you want to work in the other direction. The tool has been successfully used on English and Esperanto to make the eo-en direction. Contact me (--Jacob Nordfalk, 18: 07, November 3, 2009 (UTC)) for help using it.

Here is the method how it could be applied on sv-da to analyse how much of the Danish dictionary (which contains many entries not used from sv to da) is useful. The steps would be:

  1. Remove duplicates in dixes
  2. Make a copy of the pair in the profiler/ subdirectory
  3. Create a "profiler" version of your dictionaries.
  4. Edit the modes.xml and add a mode.
  5. Use the mode.


The steps are explained with suitable examples as below:

  • To remove duplicates in dixes, do:
 apertium-dixtools fix -alignBidix apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix
 apertium-dixtools fix -alignMonodix apertium-sv-da.da.dix apertium-sv-da.da.dix
 apertium-dixtools fix -alignMonodix apertium-sv-da.sv.dix apertium-sv-da.sv.dix
  • To make a copy of the pair in the profiler/ subdirectory, do:
 $ mkdir profiler/
 $ cp * profiler/*


  • To create a "profiler" version of your dictionaries, do:
 $ apertium-dixtools profilecreate . sv-da 

This will create a file dixtools-profilekeys.txt and overwrite the profiler/*.dix dictionaries.

  • Now do
 $ cd profiler
 $ make
  • To edit the modes.xml and add a mode where you replace with usage of the profiling version of the dictionary, after which the task apertium-dixtools profilecollect must be inserted
to collect the data (saving it to dixtools-profiledata.txt) and filter the output for the following stages in the mode file, do:
     <program name="lt-proc $1">
       <file name="sv-da.autogen.bin"/>
     </program>
     <program name="lt-proc $1">
       <file name="profiler/sv-da.autogen.bin"/>
     </program>
     <program name="apertium-dixtools profilecollect">
       <file name="dixtools-profiledata.txt"/>
     </program>
  • To use the mode, for example on a corpus and/or with your favorite test script, do:
 $ apertium-dixtools profileresult
 Reading dixtools-profilekeys.txt
 Reading dixtools-profiledata.txt
 Writing dixtools-profileresult.txt


How it works

It is adding keys to the dixes. So

<pardef n="b/urde__vbmod">
  <e>       <p><l>urde</l>      <r>urde<s n="vbmod"/><s n="inf"/><s n="actv"/></r></p></e>
  <e>       <p><l>ør</l>        <r>urde<s n="vbmod"/><s n="pres"/><s n="actv"/></r></p></e>
  <e>       <p><l>urde</l>      <r>urde<s n="vbmod"/><s n="past"/><s n="actv"/></r></p></e>
  <e>       <p><l>urdet</l>     <r>urde<s n="vbmod"/><s n="pp"/></r></p></e>
</pardef>

<pardef n="må/tte__vaux">
  <e>       <p><l>tte</l>       <r>tte<s n="vaux"/><s n="inf"/><s n="actv"/></r></p></e>
  <e>       <p><l></l>          <r>tte<s n="vaux"/><s n="pres"/><s n="actv"/></r></p></e>
  <e>       <p><l>tte</l>       <r>tte<s n="vaux"/><s n="past"/><s n="actv"/></r></p></e>
  <e>       <p><l>ttet</l>      <r>tte<s n="vaux"/><s n="pp"/></r></p></e>
</pardef>

becomes

<pardef n="b/urde__vbmod">
  <e>       <p><l>urde</l>      <r>urde<s n="vbmod"/><s n="inf"/><s n="actv"/></r></p><p><l>%bct%</l><r/></p></e>
  <e>       <p><l>ør</l>        <r>urde<s n="vbmod"/><s n="pres"/><s n="actv"/></r></p><p><l>%bcu%</l><r/></p></e>
  <e>       <p><l>urde</l>      <r>urde<s n="vbmod"/><s n="past"/><s n="actv"/></r></p><p><l>%bcv%</l><r/></p></e>
  <e>       <p><l>urdet</l>     <r>urde<s n="vbmod"/><s n="pp"/></r></p><p><l>%bcw%</l><r/></p></e>
</pardef>

<pardef n="må/tte__vaux">
  <e>       <p><l>tte</l>       <r>tte<s n="vaux"/><s n="inf"/><s n="actv"/></r></p><p><l>%bcx%</l><r/></p></e>
  <e>       <p><l></l>          <r>tte<s n="vaux"/><s n="pres"/><s n="actv"/></r></p><p><l>%bcy%</l><r/></p></e>
  <e>       <p><l>tte</l>       <r>tte<s n="vaux"/><s n="past"/><s n="actv"/></r></p><p><l>%bcz%</l><r/></p></e>
  <e>       <p><l>ttet</l>      <r>tte<s n="vaux"/><s n="pp"/></r></p><p><l>%bd0%</l><r/></p></e>
</pardef>

The dixtools-profileresult.txt contains

0 bct <e><l>urde</l><r>urde<vbmod><inf><actv></r></e>
0 bcu <e><l>ør</l><r>urde<vbmod><pres><actv></r></e>
0 bcv <e><l>urde</l><r>urde<vbmod><past><actv></r></e>
0 bcw <e><l>urdet</l><r>urde<vbmod><pp></r></e>
1 bcx <e><l>tte</l><r>tte<vaux><inf><actv></r></e>
41 bcy <e><l></l><r>tte<vaux><pres><actv></r></e>
1 bcz <e><l>tte</l><r>tte<vaux><past><actv></r></e>
1 bd0 <e><l>ttet</l><r>tte<vaux><pp></r></e>

so you can see that the first paradigm isnt used at all and in the 2nd paradigm the %bcy% was used 42 times and the other entries 1 time each.