Difference between revisions of "Dictionary coverage"
(Created page with 'There exists an experimental tool in in Apertium-dixtools to do frequency statistics on a dictionary, i.a. which entries are used and which entries are not used. This is i.a…') |
|||
(15 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
[[Couverture du dictionnaire|En français]] |
|||
⚫ | |||
⚫ | |||
This is i.a. usable if have a language pair in which you want to do the opposite direction. |
|||
The tool can be used to check entries in both directions. This tool has been successfully used on [[English and Esperanto]] to make the eo-en direction. Contact me (--[[User:Jacob Nordfalk|Jacob Nordfalk]], 18: 07, November 3, 2009 (UTC)) on how to use the tool. |
|||
Here is how it can be applied |
|||
The method is explained with an example of sv-da to analyse that how much of the Danish dictionary (which contains many entries not used from sv to da) is useful. The step by step procedure is: |
|||
#Remove duplicates in dictionaries. |
|||
#Make a copy of the pair in the profiler/sub directory. |
|||
#Create a "profiler" version of your dictionaries. |
|||
#Edit the modes.xml and add a mode. |
|||
#Use the mode. |
|||
⚫ | |||
Example: |
|||
⚫ | |||
apertium-dixtools fix -alignBidix apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix |
apertium-dixtools fix -alignBidix apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix |
||
apertium-dixtools fix -alignMonodix apertium-sv-da.da.dix apertium-sv-da.da.dix |
apertium-dixtools fix -alignMonodix apertium-sv-da.da.dix apertium-sv-da.da.dix |
||
apertium-dixtools fix -alignMonodix apertium-sv-da.sv.dix apertium-sv-da.sv.dix |
apertium-dixtools fix -alignMonodix apertium-sv-da.sv.dix apertium-sv-da.sv.dix |
||
*To make a copy of the pair in the profiler/ subdirectory, do: |
|||
$ mkdir profiler/ |
$ mkdir profiler/ |
||
$ cp * profiler/* |
$ cp * profiler/* |
||
*To create a "profiler" version of your dictionaries (A profiler version of dictionary is a subdirectory of a dictionary), do: |
|||
$ apertium-dixtools profilecreate . sv-da |
$ apertium-dixtools profilecreate . sv-da |
||
This will create a file dixtools-profilekeys.txt and overwrite the profiler/*.dix dictionaries. |
This will create a file dixtools-profilekeys.txt and overwrite the profiler/*.dix dictionaries. |
||
Now do |
*Now do: |
||
$ cd profiler |
$ cd profiler |
||
$ make |
$ make |
||
*To edit the modes.xml and add a mode select the place where you want to replace usage of the profiling version of the dictionary and after which the task '''apertium-dixtools profilecollect''' must be inserted. |
|||
⚫ | |||
<program name="lt-proc $1"> |
<program name="lt-proc $1"> |
||
<file name="sv-da.autogen.bin"/> |
<file name="sv-da.autogen.bin"/> |
||
</program> |
</program> |
||
⚫ | |||
<program name="lt-proc $1"> |
<program name="lt-proc $1"> |
||
<file name="profiler/sv-da.autogen.bin"/> |
<file name="profiler/sv-da.autogen.bin"/> |
||
Line 36: | Line 44: | ||
</program> |
</program> |
||
*To use the mode, for example on a corpus and/or with your favorite test script, do: |
|||
$ apertium-dixtools profileresult |
$ apertium-dixtools profileresult |
||
Line 47: | Line 55: | ||
==How it works== |
==How it works== |
||
To add keys to the dictionaries, do: |
|||
So |
|||
<pre> |
<pre> |
||
<pardef n="b/urde__vbmod"> |
<pardef n="b/urde__vbmod"> |
||
Line 65: | Line 72: | ||
</pre> |
</pre> |
||
This results in: |
|||
becomes |
|||
<pre> |
<pre> |
||
<pardef n="b/urde__vbmod"> |
<pardef n="b/urde__vbmod"> |
||
Line 82: | Line 89: | ||
</pre> |
</pre> |
||
The dixtools-profileresult.txt contains |
The dixtools-profileresult.txt now contains: |
||
<pre> |
<pre> |
||
0 bct <e><l>urde</l><r>urde<vbmod><inf><actv></r></e> |
0 bct <e><l>urde</l><r>urde<vbmod><inf><actv></r></e> |
||
Line 93: | Line 100: | ||
1 bd0 <e><l>ttet</l><r>tte<vaux><pp></r></e> |
1 bd0 <e><l>ttet</l><r>tte<vaux><pp></r></e> |
||
</pre> |
</pre> |
||
So you can see that the first paradigm isn't used at all. While, in the 2nd paradigm the %bcy% was used 41 times and the other entries 1 time each. |
|||
*See also [[Monodix basics]]. |
|||
*See also [[List of symbols]] |
|||
[[Category:Dixtools]] |
[[Category:Dixtools]] |
||
[[Category:Documentation in English]] |
Latest revision as of 13:33, 6 December 2019
An experimental tool “Apertium-dixtools” is a good tool to do frequency statistics on a dictionary. It finds out about the entries and those which are not used. It works on both normal entries as well as entries in paradigms.
The tool can be used to check entries in both directions. This tool has been successfully used on English and Esperanto to make the eo-en direction. Contact me (--Jacob Nordfalk, 18: 07, November 3, 2009 (UTC)) on how to use the tool.
The method is explained with an example of sv-da to analyse that how much of the Danish dictionary (which contains many entries not used from sv to da) is useful. The step by step procedure is:
- Remove duplicates in dictionaries.
- Make a copy of the pair in the profiler/sub directory.
- Create a "profiler" version of your dictionaries.
- Edit the modes.xml and add a mode.
- Use the mode.
Example:
- To remove duplicates in dictionary, do:
apertium-dixtools fix -alignBidix apertium-sv-da.sv-da.dix apertium-sv-da.sv-da.dix apertium-dixtools fix -alignMonodix apertium-sv-da.da.dix apertium-sv-da.da.dix apertium-dixtools fix -alignMonodix apertium-sv-da.sv.dix apertium-sv-da.sv.dix
- To make a copy of the pair in the profiler/ subdirectory, do:
$ mkdir profiler/ $ cp * profiler/*
- To create a "profiler" version of your dictionaries (A profiler version of dictionary is a subdirectory of a dictionary), do:
$ apertium-dixtools profilecreate . sv-da
This will create a file dixtools-profilekeys.txt and overwrite the profiler/*.dix dictionaries.
- Now do:
$ cd profiler $ make
- To edit the modes.xml and add a mode select the place where you want to replace usage of the profiling version of the dictionary and after which the task apertium-dixtools profilecollect must be inserted.
- To collect the data (saving it to dixtools-profiledata.txt) and filter the output for the following stages in the mode file, do:
<program name="lt-proc $1"> <file name="sv-da.autogen.bin"/> </program> <program name="lt-proc $1"> <file name="profiler/sv-da.autogen.bin"/> </program> <program name="apertium-dixtools profilecollect"> <file name="dixtools-profiledata.txt"/> </program>
- To use the mode, for example on a corpus and/or with your favorite test script, do:
$ apertium-dixtools profileresult Reading dixtools-profilekeys.txt Reading dixtools-profiledata.txt Writing dixtools-profileresult.txt
How it works[edit]
To add keys to the dictionaries, do:
<pardef n="b/urde__vbmod"> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="inf"/><s n="actv"/></r></p></e> <e> <p><l>ør</l> <r>urde<s n="vbmod"/><s n="pres"/><s n="actv"/></r></p></e> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="past"/><s n="actv"/></r></p></e> <e> <p><l>urdet</l> <r>urde<s n="vbmod"/><s n="pp"/></r></p></e> </pardef> <pardef n="må/tte__vaux"> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="inf"/><s n="actv"/></r></p></e> <e> <p><l></l> <r>tte<s n="vaux"/><s n="pres"/><s n="actv"/></r></p></e> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="past"/><s n="actv"/></r></p></e> <e> <p><l>ttet</l> <r>tte<s n="vaux"/><s n="pp"/></r></p></e> </pardef>
This results in:
<pardef n="b/urde__vbmod"> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="inf"/><s n="actv"/></r></p><p><l>%bct%</l><r/></p></e> <e> <p><l>ør</l> <r>urde<s n="vbmod"/><s n="pres"/><s n="actv"/></r></p><p><l>%bcu%</l><r/></p></e> <e> <p><l>urde</l> <r>urde<s n="vbmod"/><s n="past"/><s n="actv"/></r></p><p><l>%bcv%</l><r/></p></e> <e> <p><l>urdet</l> <r>urde<s n="vbmod"/><s n="pp"/></r></p><p><l>%bcw%</l><r/></p></e> </pardef> <pardef n="må/tte__vaux"> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="inf"/><s n="actv"/></r></p><p><l>%bcx%</l><r/></p></e> <e> <p><l></l> <r>tte<s n="vaux"/><s n="pres"/><s n="actv"/></r></p><p><l>%bcy%</l><r/></p></e> <e> <p><l>tte</l> <r>tte<s n="vaux"/><s n="past"/><s n="actv"/></r></p><p><l>%bcz%</l><r/></p></e> <e> <p><l>ttet</l> <r>tte<s n="vaux"/><s n="pp"/></r></p><p><l>%bd0%</l><r/></p></e> </pardef>
The dixtools-profileresult.txt now contains:
0 bct <e><l>urde</l><r>urde<vbmod><inf><actv></r></e> 0 bcu <e><l>ør</l><r>urde<vbmod><pres><actv></r></e> 0 bcv <e><l>urde</l><r>urde<vbmod><past><actv></r></e> 0 bcw <e><l>urdet</l><r>urde<vbmod><pp></r></e> 1 bcx <e><l>tte</l><r>tte<vaux><inf><actv></r></e> 41 bcy <e><l></l><r>tte<vaux><pres><actv></r></e> 1 bcz <e><l>tte</l><r>tte<vaux><past><actv></r></e> 1 bd0 <e><l>ttet</l><r>tte<vaux><pp></r></e>
So you can see that the first paradigm isn't used at all. While, in the 2nd paradigm the %bcy% was used 41 times and the other entries 1 time each.
- See also Monodix basics.
- See also List of symbols