Difference between revisions of "Weighted transfer rules"
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) |
||
Line 50: | Line 50: | ||
|} |
|} |
||
== |
== Experiment == |
||
The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved. |
The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved. |
||
The contents of new-software-sample.txt looks like the following: |
|||
<pre> |
|||
Mr Stephen said the council had agreed to consider new software which would make the test more difficult. |
|||
What's Next: Simonyi's new software writes its own code |
|||
This new software makes it easier to get a movie done quickly, though harder to get it done well. |
|||
</pre> |
|||
The contents of the unpruned w1x file without generalizing patterns should look like the following: |
The contents of the unpruned w1x file without generalizing patterns should look like the following: |
||
Line 264: | Line 272: | ||
</transfer-weights> |
</transfer-weights> |
||
</pre> |
</pre> |
||
This would mean that 'no-swap' versions of both rules are preferred for each pattern, which tells the transfer module that the translations of 'new' and 'software' should not be swapped (as specified in '-ns' versions of both rules), since in Spanish the adjective 'nuevo' is usually put before the noun as opposed to the fact that most adjectives are put after the noun. |
Latest revision as of 14:43, 17 April 2018
Contents
Related links[edit]
Nikita Medyankin's project at GSoC 2016
https://github.com/apertium/apertium-weights-learner/tree/629b48b306116565bc1d748c298bc28b41506f63
https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/
Fixes[edit]
Nikita's code should work okay now. To run it, download apertium-weights-learner from https://github.com/apertium/apertium-weights-learner/tree/experimental, English - Spanish language pair with ambiguous rules from https://github.com/apertium/apertium-en-es/tree/ambiguous-rules and Apertium core with modified transfer module from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium/.
Coverages[edit]
The number of all possible coverages was calculated 100 times for 100 random sentences for 5 language pairs.
language pair | corpus | mean number of coverages |
English - Spanish | Tatoeba | 3.72 |
English - Spanish | Europarl | 194.35 |
Spanish - Catalan | Tatoeba | 2.94 |
Spanish - Catalan | Europarl | 53.04 |
Basque - Spanish | Tatoeba | 9.19 |
Swedish - Norwegian | Europarl | 488.57 |
Crimean Tatar - Turkish | Crimean Tatar Wikipedia | 3.12 |
Experiment[edit]
The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved.
The contents of new-software-sample.txt looks like the following:
Mr Stephen said the council had agreed to consider new software which would make the test more difficult. What's Next: Simonyi's new software writes its own code This new software makes it easier to get a movie done quickly, though harder to get it done well.
The contents of the unpruned w1x file without generalizing patterns should look like the following:
<?xml version='1.0' encoding='UTF-8'?> <transfer-weights> <rule-group> <rule comment="REGLA: ADJ NOM no-swap-version" id="1" md5="64121bebaee1b179cfc0002db6b06fc3"> <pattern weight="1.625228556310039"> <pattern-item tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="1.625228556310039"> <pattern-item tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="1.625228556310039"> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="1.625228556310039"> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> </rule> <rule comment="REGLA: ADJ NOM" id="2" md5="8eed4b8aee5567fcfebc0de7698f4bdb"> <pattern weight="0.3747714436899609"> <pattern-item tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.3747714436899609"> <pattern-item tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="0.3747714436899609"> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.3747714436899609"> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> </rule> </rule-group> <rule-group> <rule comment="REGLA: DET ADJ NOM no-swap-version" id="3" md5="05d8b437ee595c7d0c992c5ae066a199"> <pattern weight="0.9844006834162787"> <pattern-item tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9844006834162787"> <pattern-item tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="0.9844006834162787"> <pattern-item tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9844006834162787"> <pattern-item tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.9376183345269524"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9844006834162787"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9844006834162787"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="0.9844006834162787"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.9844006834162787"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> </rule> <rule comment="REGLA: DET ADJ NOM" id="4" md5="87fb69c4cd8792f06e0b51c6fd79f127"> <pattern weight="0.0155993165837215"> <pattern-item tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.0155993165837215"> <pattern-item tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="0.0155993165837215"> <pattern-item tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.0155993165837215"> <pattern-item tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item lemma="code" tags="n.sg"/> </pattern> <pattern weight="0.06238166547304746"> <pattern-item lemma="its" tags="det.pos.sp"/> <pattern-item lemma="own" tags="adj"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.0155993165837215"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.0155993165837215"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> <pattern weight="0.0155993165837215"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item tags="n.sg"/> </pattern> <pattern weight="0.0155993165837215"> <pattern-item lemma="this" tags="det.dem.sg"/> <pattern-item lemma="new" tags="adj.sint"/> <pattern-item lemma="software" tags="n.sg"/> </pattern> </rule> </rule-group> </transfer-weights>
This would mean that 'no-swap' versions of both rules are preferred for each pattern, which tells the transfer module that the translations of 'new' and 'software' should not be swapped (as specified in '-ns' versions of both rules), since in Spanish the adjective 'nuevo' is usually put before the noun as opposed to the fact that most adjectives are put after the noun.