Difference between revisions of "User:Francis Tyers/Sandbox"
Line 165: | Line 165: | ||
* Yarowsky, D. (1995) "Unsupervised Word Sense Disambiguation Rivalling Supervised Methods". Proc. 33rd Ann. Meeting. ACL |
* Yarowsky, D. (1995) "Unsupervised Word Sense Disambiguation Rivalling Supervised Methods". Proc. 33rd Ann. Meeting. ACL |
||
* Hang Li and Cong Li "Word Translation Disambigation Useing Bilingual Bootstrapping" |
* Hang Li and Cong Li "Word Translation Disambigation Useing Bilingual Bootstrapping" |
||
;Read/Summarised - not |
|||
;Read/Summarised - not |
;Read/Summarised - not |
Revision as of 10:52, 3 June 2011
Lexical selection
Information
- Surface form -- tud etc.
- Lemma -- den etc.
- Category -- n.f etc.
- Syntax -- @SUBJ etc.
Ideas
For some things linguistic knowledge is better, or easier. It is also better for hacking. For other things, statistics are better. Wider coverage for cheaper. The lexical selection module(s) should allow both the use of rules and of statistics. Rules for things we "know", statistics for those we don't.
Inferring rules from collocations
Rules as described below are already used in apertium-cy-en
, apertium-br-fr
and apertium-sme-smj
. This stage
would be the first pass of lexical selection.
- The bilingual dictionary has several translations for each ambiguous word.
- Rules are created to select between them based on context.
- For each word in the bilingual dictionary, collocations (n-grams) are extracted from a source language corpus.
+ in, skyldi ég þá á munúð hyggja, þar sem bóndi minn er einnig gamall?`` + ,Drottinn hefir séð raunir mínar. Nú mun bóndi minn elska mig.`` þunguð og ól son. Þá sagði hún: ,,Nú mun bóndi minn loks hænast að mér, því að é ,,Guð hefir gefið mér góða gjöf. Nú mun bóndi minn búa við mig, því að ég hefi af, þá haldi hann bótum uppi, slíkum sem bóndi konunnar kveður á hann, og greiði l niður fyrir húsdyrum mannsins, þar sem bóndi hennar var inni, og lá þar, uns b - 27 En er bóndi hennar reis um morguninn og lauk + kubúinn hafi soltið til þess að franskur bóndi þurfi ekki
- For each ambiguous word, these collocations are run with each of the entries in the bilingual dictionary through the translator.
- not the translator but just the bilingual dictionary? --Mlforcada 10:30, 10 October 2009 (UTC)
- how wide is the window around the problem word? is it symmetrical? --Mlforcada 10:30, 10 October 2009 (UTC)
as my farmer is Now #remember my farmer love Now #remember my farmer the lid Now #remember my farmer live *slíkum as the woman's farmer composes as her farmer was But is her farmer rose to French farmer need not as my husband is Now #remember my husband love Now #remember my husband the lid Now #remember my husband live *slíkum as the woman's husband composes as her husband was But is her husband rose to French husband need not
- Translations are scored on a target language corpus. -- The target language model training corpora would need to be preprocessed in some cases, to, for example give the word in POS or syntactic context.
n _farmer_ prn.pos, n _husband_ prn.pos
etc. The number of target words would be limited to the number of correspondences in the bilingual dictionary.
- What do you mean by the number of target words? --Mlforcada 10:30, 10 October 2009 (UTC)
- Wouldn't it be similar to do this as in Sánchez-Martínez et al. (2008), that is, run all "disambiguations" through the dictionary and score the translations themselves? --Mlforcada 10:30, 10 October 2009 (UTC)
Vector Element0 : -6.13119,as my farmer Vector Element1 : -1.5997,as my husband Vector Element0 : -5.93468,Now remember my farmer Vector Element1 : -3.19992,Now remember my husband Vector Element0 : -6.13119,slíkum as my farmer Vector Element1 : -1.5997,slíkum as my husband Vector Element0 : -5.55918,as her farmer Vector Element1 : -2.81087,as her husband Vector Element0 : -5.58205,But is her farmer Vector Element1 : -2.83373,But is her husband Vector Element0 : -4.54752,to French farmer Vector Element1 : -5.27222,to French husband
- Where the difference in score between one translation and another reaches a threshold, a rule is created in the form of:
MAP (husband) ("bóndi") IF (1 ("minn"));
- Morphology or syntax could also be included.
MAP (husband) ("bóndi") IF (1 PrnPos);
MAP (husband) ("bóndi") IF (-1 Genitive);
- It would be interesting to see if rules can be learnt which use different discriminators (e.g. surface form, syntax) etc.
- To select the winner, one could use a maximum-entropy approach in which the absence or presence of particular trigger words in the context would be treated as a feature. Then the winner would be chosen maximizing the probability. There is the work by Márquez et al. and also Armando Suárez's DLSI thesis. However, these fall quite far from being applicable in Apertium, so some engineering would be needed. --10:30, 10 October 2009 (UTC)
- Another interesting question is: instead of rules, could you detect (in some cases) clear multiwords that would go directly into dictionaries? --Mlforcada 10:30, 10 October 2009 (UTC)
- Advantages
- Fairly straightforward -- the rules can be created automatically in constraint grammar.
- Human readable / editable.
- Doesn't require parallel corpus -- although might work better with one.
- Unsupervised.
- Disadvantages
- Many rules will be slow.
- That is why probably it is a good idea to move as much inferred stuff as possible to the dictionary --Mlforcada 10:30, 10 October 2009 (UTC)
- Might not work very well.
- Relevant prior work
- Jin Yang (1999) "Towards the Automatic Acquisition of Lexical Selection Rules"
- Eckhard Bick (2005) "Dan2eng: Wide-Coverage Danish-English Machine Translation"
- Examples
Pediñ can translate as 'prier' or 'inviter'. If it is used transitively it means "inviter", intransitively it means "prier"
- o huñvreal muioc'h eget o pediñ .
- Leur *huñvreal plus que en train de prier .
- Koulskoude e tiviz Francis pediñ e zaou vreur d'ober ...
- Pourtant il décide Francis prier ses deux frères à faire ...
- O fal a zo pediñ arzourien a bep seurt evel kizellerien
- Leur objectif il est inviter des artistes de toute sorte comme les sculpteurs
- ... bleunioù ha peadra da yac'haat o zreid hag o pediñ evito ...
- ... de fleurs et des moyens à guérir leurs pieds et en train de prier pour eux ...
- ha tu a oa bet d'al labourerien pediñ o familhoù hag o mignoned
- ... et il y avait moyen été aux travailleurs prier leurs familles et leurs amis ...
- Raktresoù all a zo ivez : pediñ skrivagnerien a-benn eskemm ganto
- ... de Projets autres il est aussi : inviter des écrivains pour échanger avec eux ...
- Sharon Stone eo bet an hini gwellañ evit pediñ an embregerien da zisammañ
- *Sharon *Stone il a été les ceux le plus mieux pour prier les entrepreneurs à décharger ...
The current rule says: SUBSTITUTE (vblex) (vblex tv) ("pediñ" vblex) (1C NC);
, that is "choose 'inviter' if the next word can only be a common noun". Obviously, this fails in the case of definite NPs, o familhoù 'their families'.
To read
- Yuseop Kim, Jeong Ho Chang, Byoung-Tak Zhang (2002) "Target Word Selection Using WordNet and Data-Driven Models in Machine Translation". Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
- Apidianaki Marianna (2009) "Data-driven semantic analysis for multilingual WSD and lexical selection in translation". Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 30 - April 3, Athens, Greece, p. 77-85.
- Eckhard Bick (2007) "Dan2eng: Wide-Coverage Danish-English Machine Translation". MT Summit XI. pp. 37--43.
- Lucia Specia, Maria das Graças Volpes Nunes and Mark Stevenson (2005) "Exploiting Rules for Word Sense Disambiguation". Procesamiento del Lenguaje Natural (SEPLN-2005), 35. pp. 171--178
- Ido Dagan and Alon Itai (1991) "Word Sense Disambiguation Using a Second Language Corpus". Computational Linguistics.
- Hinrich Schütze "Automatic Word Sense Discrimination"
- E. Crestan "Which length for a Multi-level view of content for WSD"
- Noah Coccaro "Towards better integration of semantic prediction in statistical language modelling"
- Vickrey David "Word-sense disambiguation for machine translation"
- Her, Higinbottom and Pentheroudakis "Lexical and Idiomatic Transfer in Machine Translation: An LFG approach"
- McDonald "Target Word Selection as Proximity in Semantic Space"
- Sánchez-Martínez and Pérez-Ortiz and Forcada "Integrating corpus-based and rule-based approaches in an open-source machine translation system"
- Read/Summarised - Useful
- Jin Yang (1999) "Towards the Automatic Acquisition of Lexical Selection Rules". MT Summit VII
- Maite Melero, Antoni Oliver, Toni Badia and Teresa Suñol (2007) "Dealing with Bilingual Divergences in MT using Target language N-gram Models". Proceedings of the METIS-II Workshop: New Approaches to Machine Translation. CLIN 17 - Computational Linguistics in the Netherlands. pp. 19--26 Leuven, Belgium.
- Read/Summarised - unsure
- Yarowsky, D. (1995) "Unsupervised Word Sense Disambiguation Rivalling Supervised Methods". Proc. 33rd Ann. Meeting. ACL
- Hang Li and Cong Li "Word Translation Disambigation Useing Bilingual Bootstrapping"
- Read/Summarised - not
- Han, Xia, Palmer, Rosenzweig "Capturing language specific constraints on lexical selection with feature-based lexicalised tree-adjoining grammars"
Ref. | Manual | Unsup | Semi-sup | Sup | ParCorp | WordNet | HandEx | TLM | Parser (SynRel) | Eval | |
---|---|---|---|---|---|---|---|---|---|---|---|
Melero et al. (2007) | X | Yes | No | Partial | |||||||
Yang (1999) | X | Yes | Yes | No | |||||||
Han et al. (??) | X | Yes | No | ||||||||
Yarowsky, D. (1995) | X | X | No | Yes |
Pipeline
You need, a tagged source language corpus:
^L'/El<det><def><mf><sg>$ ^origen/origen<n><m><sg>$ ^de/de<pr>$ ^l'/el<det><def><mf><sg>$ ^àbac/àbac<n><m><sg>$ ^està/estar<vblex><pri><p3><sg>$ ^literalment/literalment<adv>$ ^perdut/perdre<vblex><pp><m><sg>$ ^en/en<pr>$ ^el/el<det><def><m><sg>$ ^temps/temps<n><m><sp>$
A list of ambiguities extracted from your bilingual dictionary,
time<n>:<:temps<n><:0> weather<n>:<:temps<n><:1> languge<n>:<:llengua<n><:0> tongue<n>:<:llengua<n><:1> history<n>:<:història<n><:0> story<n>:<:història<n><:1> station<n>:<:estació<n><:0> season<n>:<:estació<n><:1>
Only the first tag is taken into account.
The script generate_sl_ambig_corpus.py
generates the possible paths in the test
corpus, by replacing the tag with the tag and the translation equivalent number and numbers the sentences for later recombination.
[1:0 ].[] ^L'/El<det><def><mf><sg>$ ^origen/origen<n><m><sg>$ ^de/de<pr>$ ^l'/el<det><def><mf><sg>$ ^àbac/àbac<n><m><sg>$ ^està/estar<vblex><pri><p3><sg>$ ^literalment/literalment<adv>$ ^perdut/perdre<vblex><pp><m><sg>$ ^en/en<pr>$ ^el/el<det><def><m><sg>$ ^temps/temps<n><:1><m><sp>$ [1:1 ].[] ^L'/El<det><def><mf><sg>$ ^origen/origen<n><m><sg>$ ^de/de<pr>$ ^l'/el<det><def><mf><sg>$ ^àbac/àbac<n><m><sg>$ ^està/estar<vblex><pri><p3><sg>$ ^literalment/literalment<adv>$ ^perdut/perdre<vblex><pp><m><sg>$ ^en/en<pr>$ ^el/el<det><def><m><sg>$ ^temps/temps<n><:0><m><sp>$ [2:0 || ].[] ^El/El<det><def><m><sg>$ ^territori/territori<n><m><sg>$ ^era/ser<vbser><past><p3><sg>$ ^habitat/habitar<vblex><pp><m><sg>$ ^des de/des de<pr>$ ^temps/temps<n><:1><m><sp>$ ^per/per<pr>$ ^tribus/tribu<n><f><pl>$ ^la/el<det><def><f><sg>$ ^llengua/llengua<n><:0><f><sg>$ ^dels quals/de<pr>+el qual<rel><an><m><pl>$ ^no/no<adv>$ ^entenien/entendre<vblex><pii><p3><pl>$ [2:1 || ].[] ^El/El<det><def><m><sg>$ ^territori/territori<n><m><sg>$ ^era/ser<vbser><past><p3><sg>$ ^habitat/habitar<vblex><pp><m><sg>$ ^des de/des de<pr>$ ^temps/temps<n><:1><m><sp>$ ^per/per<pr>$ ^tribus/tribu<n><f><pl>$ ^la/el<det><def><f><sg>$ ^llengua/llengua<n><:1><f><sg>$ ^dels quals/de<pr>+el qual<rel><an><m><pl>$ ^no/no<adv>$ ^entenien/entendre<vblex><pii><p3><pl>$ [2:2 || ].[] ^El/El<det><def><m><sg>$ ^territori/territori<n><m><sg>$ ^era/ser<vbser><past><p3><sg>$ ^habitat/habitar<vblex><pp><m><sg>$ ^des de/des de<pr>$ ^temps/temps<n><:0><m><sp>$ ^per/per<pr>$ ^tribus/tribu<n><f><pl>$ ^la/el<det><def><f><sg>$ ^llengua/llengua<n><:1><f><sg>$ ^dels quals/de<pr>+el qual<rel><an><m><pl>$ ^no/no<adv>$ ^entenien/entendre<vblex><pii><p3><pl>$ [2:3 || ].[] ^El/El<det><def><m><sg>$ ^territori/territori<n><m><sg>$ ^era/ser<vbser><past><p3><sg>$ ^habitat/habitar<vblex><pp><m><sg>$ ^des de/des de<pr>$ ^temps/temps<n><:0><m><sp>$ ^per/per<pr>$ ^tribus/tribu<n><f><pl>$ ^la/el<det><def><f><sg>$ ^llengua/llengua<n><:0><f><sg>$ ^dels quals/de<pr>+el qual<rel><an><m><pl>$ ^no/no<adv>$ ^entenien/entendre<vblex><pii><p3><pl>$
These are then translated with the rest of the Apertium pipeline:
[1:0 || ].[] The origin of the abacus is literally lost in the weather [1:1 || ].[] The origin of the abacus is literally lost in the time [2:0 || ].[] The territory was inhabited since weather for tribes the language of which did not understand [2:1 || ].[] The territory was inhabited since weather for tribes the tongue of which did not understand [2:2 || ].[] The territory was inhabited since time for tribes the tongue of which did not understand [2:3 || ].[] The territory was inhabited since time for tribes the language of which did not understand
All of the translations are passed through to the irstlm-ranker
which assigns each whole sentence a probability:
-4.44739 || [1:0 || ].[] The origin of the abacus is literally lost in the weather -2.98177 || [1:1 || ].[] The origin of the abacus is literally lost in the time -5.05685 || [2:0 || ].[] The territory was inhabited since weather for tribes the language of which did not understand -6.05685 || [2:1 || ].[] The territory was inhabited since weather for tribes the tongue of which did not understand -3.05685 || [2:2 || ].[] The territory was inhabited since time for tribes the tongue of which did not understand -2.05685 || [2:3 || ].[] The territory was inhabited since time for tribes the language of which did not understand -2.80612 || [3:0 || ].[] When to the futile following year go back the good weather -3.09621 || [3:1 || ].[] When to the futile following year go back the good time
From here, we extract the entries with a large probability difference with extract_candidate_phrases.py
.
$ cat ca.ranked.txt | python extract_candidate_phrases.py -2.63274 || [972:0 || ].[] For the observations in the station of Puerto Rico had of:. -2.47706 || [972:1 || ].[] For the observations in the season of Puerto Rico had of:. -3.40225 || [1223:0 || ].[] The animals prepare for the cold time storing food.. -3.14163 || [1223:1 || ].[] The animals prepare for the cold weather storing food.. -2.99161 || [1225:0 || ].[] The time changes moved by the differences of energy received of the Only. -2.84943 || [1225:1 || ].[] The weather changes moved by the differences of energy received of the Only. -3.39096 || [2421:0 || ].[] However, Sequoyah never learnt other language that his language materna cherokee. -3.27981 || [2421:1 || ].[] However, Sequoyah never learnt other language that his tongue materna cherokee. -3.57633 || [2421:2 || ].[] However, Sequoyah never learnt other tongue that his language materna cherokee. -3.46518 || [2421:3 || ].[] However, Sequoyah never learnt other tongue that his tongue materna cherokee. -2.31143 || [2583:0 || ].[] The only part that opens to the rest of the Peninsula is a narrow language of land south of the city ( entry ) that joins this capital with the municipality of San Fernando. -2.20104 || [2583:1 || ].[] The only part that opens to the rest of the Peninsula is a narrow tongue of land south of the city ( entry ) that joins this capital with the municipality of San Fernando. -2.94579 || [2782:0 || ].[] The climate is tropical with humid station and dry station (savannah). -2.78736 || [2782:1 || ].[] The climate is tropical with humid station and dry season (savannah). -2.85359 || [2782:2 || ].[] The climate is tropical with humid season and dry station (savannah). -2.69516 || [2782:3 || ].[] The climate is tropical with humid season and dry season (savannah). -3.06792 || [4105:0 || ].[] Next, the time cleared and the only dazzled the French. -2.79195 || [4105:1 || ].[] Next, the weather cleared and the only dazzled the French. -3.55562 || [4221:0 || ].[] The Mediterranean zone, islander and inner will suffer a station without rain increasingly long and açò will do necessary the restrictions of water. -3.42125 || [4221:1 || ].[] The Mediterranean zone, islander and inner will suffer a season without rain increasingly long and açò will do necessary the restrictions of water. -2.71152 || [5403:0 || ].[] Are pasturadors, emigrate estacionalment, establishing to the lands of grasses in the humid station and in the banks of the rivers to the dry. -2.60424 || [5403:1 || ].[] Are pasturadors, emigrate estacionalment, establishing to the lands of grasses in the humid season and in the banks of the rivers to the dry. -3.29253 || [6150:0 || ].[] This crossed Flowery bringing a strong rain and a severe time for all the state. -3.19169 || [6150:1 || ].[] This crossed Flowery bringing a strong rain and a severe weather for all the state. -3.13067 || [6450:0 || ].[] The expedition finished with tragedy, then the bad time trapped to 16 muntanyencs to the field VII, in addition to 7.700 m, nine of which died. . -3.02519 || [6450:1 || ].[] The expedition finished with tragedy, then the bad weather trapped to 16 muntanyencs to the field VII, in addition to 7.700 m, nine of which died. . -3.35732 || [6622:0 || ].[] Of gelatinous consistency (if it takes to the hand and shakes it, shivers), dries in dry time and recobra the elasticitat in humid time.. -3.35732 || [6622:1 || ].[] Of gelatinous consistency (if it takes to the hand and shakes it, shivers), dries in dry time and recobra the elasticitat in humid weather.. -3.21541 || [6622:2 || ].[] Of gelatinous consistency (if it takes to the hand and shakes it, shivers), dries in dry weather and recobra the elasticitat in humid time.. -3.21541 || [6622:3 || ].[] Of gelatinous consistency (if it takes to the hand and shakes it, shivers), dries in dry weather and recobra the elasticitat in humid weather.. -2.50038 || [6715:0 || ].[] Originalment planned because it carried out to the first hours of the 16 of December, the Operation Stösser delayed a day for the bad time and the shortage of fuel. -2.39869 || [6715:1 || ].[] Originalment planned because it carried out to the first hours of the 16 of December, the Operation Stösser delayed a day for the bad weather and the shortage of fuel. -3.58485 || [7350:0 || ].[] Still exist words of use agricultural that descend directly of the celtic language of this tribe. -3.45187 || [7350:1 || ].[] Still exist words of use agricultural that descend directly of the celtic tongue of this tribe. -3.02469 || [7735:0 || ].[] were used in stations of work and servers how the ones of the company Silicon Graphic.. -2.91661 || [7735:1 || ].[] were used in seasons of work and servers how the ones of the company Silicon Graphic.. -3.36252 || [8028:0 || ].[] During the humid station accustom to eat fruits, insects, spiders, lizards, frogs and snakes. -3.21487 || [8028:1 || ].[] During the humid season accustom to eat fruits, insects, spiders, lizards, frogs and snakes. -3.10611 || [8029:0 || ].[] To the dry station also feed of mushrooms, being the only micos tropical that do it. . -2.94958 || [8029:1 || ].[] To the dry season also feed of mushrooms, being the only micos tropical that do it. . -2.61307 || [8636:0 || ].[] This brings a hot and dry time in these areas. -2.41358 || [8636:1 || ].[] This brings a hot and dry weather in these areas. -2.833 || [9490:0 || ].[] In a start, was foreseen that the Day-D was the 5 of June of 1944, but the bad time prevented it. -2.62405 || [9490:1 || ].[] In a start, was foreseen that the Day-D was the 5 of June of 1944, but the bad weather prevented it. -3.36351 || [9513:0 || ].[] His maximum exponent was the IRIS 3130, a complete station of work JOINS using the Motorola 68020 with a mathematical coprocessor Weitek. -3.21897 || [9513:1 || ].[] His maximum exponent was the IRIS 3130, a complete season of work JOINS using the Motorola 68020 with a mathematical coprocessor Weitek. -2.18673 || [9613:0 || ].[] The service R is one of the two routes that have two or more stations with the same name. -2.0814 || [9613:1 || ].[] The service R is one of the two routes that have two or more seasons with the same name. -3.07789 || [9872:0 || ].[] In an incredible attack of anger and fury reposa and finally Simba go out victorious of the battle and is appointed king. -3.16539 || [9872:1 || ].[] In an incredible attack of rabies and fury reposa and finally Simba go out victorious of the battle and is appointed king. -2.92372 || [9872:2 || ].[] In an incredible attack of rage and fury reposa and finally Simba go out victorious of the battle and is appointed king. -2.52076 || [9996:0 || ].[] Nevertheless, the Allies achieved to carry out two operations during the dry station of 1942-43. -2.28805 || [9996:1 || ].[] Nevertheless, the Allies achieved to carry out two operations during the dry season of 1942-43. -2.93256 || [10000:0 || ].[] After the battle the bad time forced the Norwegian fleet and manesa to withdraw true the Orcades. -2.66707 || [10000:1 || ].[] After the battle the bad weather forced the Norwegian fleet and manesa to withdraw true the Orcades. -3.31091 || [10428:0 || ].[] characterises for a humid climate, softened by the oceanic influence, with winters templats and with a dry station little stressed. -3.20373 || [10428:1 || ].[] characterises for a humid climate, softened by the oceanic influence, with winters templats and with a dry season little stressed.
Then we generate rules using generate_candidate_rules.py
SUBSTITUTE:r1 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) ; # c: 12 SUBSTITUTE:r2 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) ; # c: 10 SUBSTITUTE:r3 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) ; # c: 9 SUBSTITUTE:r4 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) ; # c: 9 SUBSTITUTE:r5 (n :0) (n :1) ("estació"ri) (-1 ("el"ri)) (0 ("estació"ri)) ; # c: 5 SUBSTITUTE:r6 (n :0) (n :1) ("estació"ri) (-1 ("<l'>"ri)) (0 ("<estació>"ri)) ; # c: 5 SUBSTITUTE:r7 (n :0) (n :1) ("temps"ri) (-1 ("mal"ri)) (0 ("temps"ri)) ; # c: 4 SUBSTITUTE:r8 (n :0) (n :1) ("temps"ri) (-1 ("<mal>"ri)) (0 ("<temps>"ri)) ; # c: 4 SUBSTITUTE:r9 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sec"ri)) ; # c: 4 SUBSTITUTE:r10 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<seca>"ri)) ; # c: 4 SUBSTITUTE:r11 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("humit"ri)) ; # c: 3 SUBSTITUTE:r12 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("de"ri)) ; # c: 3 SUBSTITUTE:r13 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<humida>"ri)) ; # c: 3 SUBSTITUTE:r14 (n :0) (n :1) ("temps"ri) (-1 ("un"ri)) (0 ("temps"ri)) ; # c: 2 SUBSTITUTE:r15 (n :0) (n :1) ("temps"ri) (-1 ("<un>"ri)) (0 ("<temps>"ri)) ; # c: 2 SUBSTITUTE:r16 (n :0) (n :1) ("llengua"ri) (0 ("llengua"ri)) ; # c: 2 SUBSTITUTE:r17 (n :0) (n :1) ("llengua"ri) (0 ("<llengua>"ri)) ; # c: 2 SUBSTITUTE:r18 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("humit"ri)) (2 ("i"ri)) ; # c: 2 SUBSTITUTE:r19 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("de"ri)) (2 ("treball"ri)) ; # c: 2 SUBSTITUTE:r20 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<humida>"ri)) (2 ("<i>"ri)) ; # c: 2 SUBSTITUTE:r21 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<de>"ri)) ; # c: 2 SUBSTITUTE:r22 (n :0) (n :1) ("estació"ri) (0 ("<estacions>"ri)) ; # c: 2 SUBSTITUTE:r23 (n :0) (n :1) ("estació"ri) (-1 ("un"ri)) (0 ("estació"ri)) ; # c: 2 SUBSTITUTE:r24 (n :0) (n :1) ("estació"ri) (-1 ("el"ri)) (0 ("estació"ri)) (1 ("sec"ri)) ; # c: 2 SUBSTITUTE:r25 (n :0) (n :1) ("estació"ri) (-1 ("el"ri)) (0 ("estació"ri)) (1 ("humit"ri)) ; # c: 2 SUBSTITUTE:r26 (n :0) (n :1) ("estació"ri) (-1 ("<una>"ri)) (0 ("<estació>"ri)) ; # c: 2 SUBSTITUTE:r27 (n :0) (n :1) ("estació"ri) (-1 ("<l'>"ri)) (0 ("<estació>"ri)) (1 ("<seca>"ri)) ; # c: 2 SUBSTITUTE:r28 (n :0) (n :1) ("estació"ri) (-1 ("<l'>"ri)) (0 ("<estació>"ri)) (1 ("<humida>"ri)) ; # c: 2 SUBSTITUTE:r29 (n :0) (n :0) ("llengua"ri) (0 ("llengua"ri)) ; # c: 2 SUBSTITUTE:r30 (n :0) (n :0) ("llengua"ri) (0 ("<llengua>"ri)) ; # c: 2 SUBSTITUTE:r31 (n :0) (n :2) ("ràbia"ri) (0 ("ràbia"ri)) (1 ("i"ri)) (2 ("fúria"ri)) ; # c: 1 SUBSTITUTE:r32 (n :0) (n :2) ("ràbia"ri) (0 ("ràbia"ri)) (1 ("i"ri)) ; # c: 1 SUBSTITUTE:r33 (n :0) (n :2) ("ràbia"ri) (0 ("ràbia"ri)) ; # c: 1 SUBSTITUTE:r34 (n :0) (n :2) ("ràbia"ri) (0 ("<ràbia>"ri)) (1 ("<i>"ri)) (2 ("<fúria>"ri)) ; # c: 1 SUBSTITUTE:r35 (n :0) (n :2) ("ràbia"ri) (0 ("<ràbia>"ri)) (1 ("<i>"ri)) ; # c: 1 SUBSTITUTE:r36 (n :0) (n :2) ("ràbia"ri) (0 ("<ràbia>"ri)) ; # c: 1 SUBSTITUTE:r37 (n :0) (n :2) ("ràbia"ri) (-1 ("de"ri)) (0 ("ràbia"ri)) (1 ("i"ri)) ; # c: 1 SUBSTITUTE:r38 (n :0) (n :2) ("ràbia"ri) (-1 ("de"ri)) (0 ("ràbia"ri)) ; # c: 1 SUBSTITUTE:r39 (n :0) (n :2) ("ràbia"ri) (-1 ("<de>"ri)) (0 ("<ràbia>"ri)) (1 ("<i>"ri)) ; # c: 1 SUBSTITUTE:r40 (n :0) (n :2) ("ràbia"ri) (-1 ("<de>"ri)) (0 ("<ràbia>"ri)) ; # c: 1 SUBSTITUTE:r41 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("sever"ri)) (2 ("per"ri)) ; # c: 1 SUBSTITUTE:r42 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("sever"ri)) ; # c: 1 SUBSTITUTE:r43 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("obligar"ri)) (2 ("el"ri)) ; # c: 1 SUBSTITUTE:r44 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("obligar"ri)) ; # c: 1 SUBSTITUTE:r45 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("i"ri)) (2 ("el"ri)) ; # c: 1 SUBSTITUTE:r46 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("i"ri)) ; # c: 1 SUBSTITUTE:r47 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("ho"ri)) ; # c: 1 SUBSTITUTE:r48 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("fred"ri)) (2 ("emmagatzemar"ri)) ; # c: 1 SUBSTITUTE:r49 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("fred"ri)) ; # c: 1 SUBSTITUTE:r50 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("es"ri)) (2 ("anar"ri)) ; # c: 1 SUBSTITUTE:r51 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("es"ri)) ; # c: 1 SUBSTITUTE:r52 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("canviar"ri)) (2 ("moure"ri)) ; # c: 1 SUBSTITUTE:r53 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("canviar"ri)) ; # c: 1 SUBSTITUTE:r54 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("calent"ri)) (2 ("i"ri)) ; # c: 1 SUBSTITUTE:r55 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("calent"ri)) ; # c: 1 SUBSTITUTE:r56 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("atrapar"ri)) (2 ("a"ri)) ; # c: 1 SUBSTITUTE:r57 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("atrapar"ri)) ; # c: 1 SUBSTITUTE:r58 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<sever>"ri)) (2 ("<per>"ri)) ; # c: 1 SUBSTITUTE:r59 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<sever>"ri)) ; # c: 1 SUBSTITUTE:r60 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<obligà>"ri)) (2 ("<la>"ri)) ; # c: 1 SUBSTITUTE:r61 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<obligà>"ri)) ; # c: 1 SUBSTITUTE:r62 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<i>"ri)) (2 ("<l'>"ri)) ; # c: 1 SUBSTITUTE:r63 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<i>"ri)) ; # c: 1 SUBSTITUTE:r64 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<ho>"ri)) ; # c: 1 SUBSTITUTE:r65 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<fred>"ri)) (2 ("<emmagatzemant>"ri)) ; # c: 1 SUBSTITUTE:r66 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<fred>"ri)) ; # c: 1 SUBSTITUTE:r67 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<es>"ri)) (2 ("<va>"ri)) ; # c: 1 SUBSTITUTE:r68 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<es>"ri)) ; # c: 1 SUBSTITUTE:r69 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<canvia>"ri)) (2 ("<mogut>"ri)) ; # c: 1 SUBSTITUTE:r70 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<canvia>"ri)) ; # c: 1
And we then rank these rules using rank-rules.sh
, which takes each one in turn and runs the whole ambiguous corpus through including this rule. We also run the corpus again, using only the baseline translation, without any rule. We then rank each of the translations of each of the sentences produced by each of the rules.
The final ranked rule list is made with show-rule-ranking.sh
by summing the scores and subtracting the baseline. A threshhold may be given, for example if we take the average difference (in this case 0.000013
) and select rules that score above that.
$ sh show-rule-ranking.sh ca-en.rules.txt resul/ 0.000093 SUBSTITUTE:r9 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sec"ri)) ; # c: 4 0.000087 SUBSTITUTE:r10 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<seca>"ri)) ; # c: 4 0.000063 SUBSTITUTE:r123 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("amb"ri)) ; # c: 1 0.000060 SUBSTITUTE:r11 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("humit"ri)) ; # c: 3 0.000059 SUBSTITUTE:r19 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("de"ri)) (2 ("treball"ri)) ; # c: 2 0.000051 SUBSTITUTE:r27 (n :0) (n :1) ("estació"ri) (-1 ("<l'>"ri)) (0 ("<estació>"ri)) (1 ("<seca>"ri)) ; # c: 2 0.000051 SUBSTITUTE:r24 (n :0) (n :1) ("estació"ri) (-1 ("el"ri)) (0 ("estació"ri)) (1 ("sec"ri)) ; # c: 2 0.000050 SUBSTITUTE:r13 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<humida>"ri)) ; # c: 3 0.000045 SUBSTITUTE:r133 (n :0) (n :1) ("estació"ri) (0 ("<estacions>"ri)) (1 ("<de>"ri)) (2 ("<treball>"ri)) ; # c: 1 0.000034 SUBSTITUTE:r28 (n :0) (n :1) ("estació"ri) (-1 ("<l'>"ri)) (0 ("<estació>"ri)) (1 ("<humida>"ri)) ; # c: 2 0.000034 SUBSTITUTE:r25 (n :0) (n :1) ("estació"ri) (-1 ("el"ri)) (0 ("estació"ri)) (1 ("humit"ri)) ; # c: 2 0.000032 SUBSTITUTE:r129 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<seca>"ri)) (2 ("<(>"ri)) ; # c: 1 0.000032 SUBSTITUTE:r128 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<seca>"ri)) (2 ("<de>"ri)) ; # c: 1 0.000032 SUBSTITUTE:r120 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sec"ri)) (2 ("("ri)) ; # c: 1 0.000032 SUBSTITUTE:r119 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sec"ri)) (2 ("de"ri)) ; # c: 1 0.000031 SUBSTITUTE:r155 (n :0) (n :1) ("estació"ri) (-1 ("<una>"ri)) (0 ("<estació>"ri)) (1 ("<seca>"ri)) ; # c: 1 0.000031 SUBSTITUTE:r141 (n :0) (n :1) ("estació"ri) (-1 ("un"ri)) (0 ("estació"ri)) (1 ("sec"ri)) ; # c: 1 0.000028 SUBSTITUTE:r20 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<humida>"ri)) (2 ("<i>"ri)) ; # c: 2 0.000028 SUBSTITUTE:r18 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("humit"ri)) (2 ("i"ri)) ; # c: 2 0.000027 SUBSTITUTE:r99 (n :0) (n :1) ("llengua"ri) (0 ("llengua"ri)) (1 ("de"ri)) (2 ("terra"ri)) ; # c: 1 0.000027 SUBSTITUTE:r116 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sense"ri)) ; # c: 1 0.000027 SUBSTITUTE:r103 (n :0) (n :1) ("llengua"ri) (0 ("<llengua>"ri)) (1 ("<de>"ri)) (2 ("<terra>"ri)) ; # c: 1 0.000026 SUBSTITUTE:r161 (n :0) (n :1) ("estació"ri) (-1 ("<en>"ri)) (0 ("<estacions>"ri)) (1 ("<de>"ri)) ; # c: 1 0.000026 SUBSTITUTE:r146 (n :0) (n :1) ("estació"ri) (-1 ("en"ri)) (0 ("estació"ri)) (1 ("de"ri)) ; # c: 1 0.000025 SUBSTITUTE:r34 (n :0) (n :2) ("ràbia"ri) (0 ("<ràbia>"ri)) (1 ("<i>"ri)) (2 ("<fúria>"ri)) ; # c: 1 0.000025 SUBSTITUTE:r31 (n :0) (n :2) ("ràbia"ri) (0 ("ràbia"ri)) (1 ("i"ri)) (2 ("fúria"ri)) ; # c: 1 0.000025 SUBSTITUTE:r159 (n :0) (n :1) ("estació"ri) (-1 ("<i>"ri)) (0 ("<estació>"ri)) (1 ("<seca>"ri)) ; # c: 1 0.000025 SUBSTITUTE:r144 (n :0) (n :1) ("estació"ri) (-1 ("i"ri)) (0 ("estació"ri)) (1 ("sec"ri)) ; # c: 1 0.000025 SUBSTITUTE:r126 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<seca>"ri)) (2 ("<també>"ri)) ; # c: 1 0.000025 SUBSTITUTE:r117 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sec"ri)) (2 ("també"ri)) ; # c: 1 0.000024 SUBSTITUTE:r39 (n :0) (n :2) ("ràbia"ri) (-1 ("<de>"ri)) (0 ("<ràbia>"ri)) (1 ("<i>"ri)) ; # c: 1 0.000024 SUBSTITUTE:r37 (n :0) (n :2) ("ràbia"ri) (-1 ("de"ri)) (0 ("ràbia"ri)) (1 ("i"ri)) ; # c: 1 0.000024 SUBSTITUTE:r35 (n :0) (n :2) ("ràbia"ri) (0 ("<ràbia>"ri)) (1 ("<i>"ri)) ; # c: 1 0.000024 SUBSTITUTE:r32 (n :0) (n :2) ("ràbia"ri) (0 ("ràbia"ri)) (1 ("i"ri)) ; # c: 1 0.000024 SUBSTITUTE:r165 (n :0) (n :1) ("estació"ri) (-1 ("<completa>"ri)) (0 ("<estació>"ri)) ; # c: 1 0.000024 SUBSTITUTE:r164 (n :0) (n :1) ("estació"ri) (-1 ("<completa>"ri)) (0 ("<estació>"ri)) (1 ("<de>"ri)) ; # c: 1 0.000024 SUBSTITUTE:r151 (n :0) (n :1) ("estació"ri) (-1 ("complet"ri)) (0 ("estació"ri)) ; # c: 1 0.000024 SUBSTITUTE:r150 (n :0) (n :1) ("estació"ri) (-1 ("complet"ri)) (0 ("estació"ri)) (1 ("de"ri)) ; # c: 1 0.000024 SUBSTITUTE:r131 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<de>"ri)) (2 ("<treball>"ri)) ; # c: 1 0.000024 SUBSTITUTE:r130 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<humida>"ri)) (2 ("<acostumen>"ri)) ; # c: 1 0.000024 SUBSTITUTE:r121 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("humit"ri)) (2 ("acostumar"ri)) ; # c: 1 0.000023 SUBSTITUTE:r154 (n :0) (n :1) ("estació"ri) (-1 ("<una>"ri)) (0 ("<estació>"ri)) (1 ("<sense>"ri)) ; # c: 1 0.000023 SUBSTITUTE:r143 (n :0) (n :1) ("estació"ri) (-1 ("més"ri)) (0 ("estació"ri)) ; # c: 1 0.000023 SUBSTITUTE:r140 (n :0) (n :1) ("estació"ri) (-1 ("un"ri)) (0 ("estació"ri)) (1 ("sense"ri)) ; # c: 1 0.000023 SUBSTITUTE:r125 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<sense>"ri)) ; # c: 1 0.000023 SUBSTITUTE:r124 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<sense>"ri)) (2 ("<pluja>"ri)) ; # c: 1 0.000023 SUBSTITUTE:r115 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sense"ri)) (2 ("pluja"ri)) ; # c: 1 0.000022 SUBSTITUTE:r167 (n :0) (n :1) ("estació"ri) (-1 ("<amb>"ri)) (0 ("<estació>"ri)) ; # c: 1 0.000022 SUBSTITUTE:r162 (n :0) (n :1) ("estació"ri) (-1 ("<en>"ri)) (0 ("<estacions>"ri)) ; # c: 1 0.000022 SUBSTITUTE:r160 (n :0) (n :1) ("estació"ri) (-1 ("<i>"ri)) (0 ("<estació>"ri)) ; # c: 1 0.000022 SUBSTITUTE:r147 (n :0) (n :1) ("estació"ri) (-1 ("en"ri)) (0 ("estació"ri)) ; # c: 1 0.000021 SUBSTITUTE:r114 (n :0) (n :1) ("llengua"ri) (-1 ("<estreta>"ri)) (0 ("<llengua>"ri)) ; # c: 1 0.000021 SUBSTITUTE:r113 (n :0) (n :1) ("llengua"ri) (-1 ("<estreta>"ri)) (0 ("<llengua>"ri)) (1 ("<de>"ri)) ; # c: 1 0.000021 SUBSTITUTE:r108 (n :0) (n :1) ("llengua"ri) (-1 ("estret"ri)) (0 ("llengua"ri)) ; # c: 1 0.000021 SUBSTITUTE:r107 (n :0) (n :1) ("llengua"ri) (-1 ("estret"ri)) (0 ("llengua"ri)) (1 ("de"ri)) ; # c: 1 0.000020 SUBSTITUTE:r157 (n :0) (n :1) ("estació"ri) (-1 ("<més>"ri)) (0 ("<estacions>"ri)) ; # c: 1 0.000020 SUBSTITUTE:r156 (n :0) (n :1) ("estació"ri) (-1 ("<més>"ri)) (0 ("<estacions>"ri)) (1 ("<amb>"ri)) ; # c: 1 0.000020 SUBSTITUTE:r142 (n :0) (n :1) ("estació"ri) (-1 ("més"ri)) (0 ("estació"ri)) (1 ("amb"ri)) ; # c: 1 0.000020 SUBSTITUTE:r127 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<seca>"ri)) (2 ("<poc>"ri)) ; # c: 1 0.000020 SUBSTITUTE:r118 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("sec"ri)) (2 ("poc"ri)) ; # c: 1 0.000019 SUBSTITUTE:r166 (n :0) (n :1) ("estació"ri) (-1 ("<amb>"ri)) (0 ("<estació>"ri)) (1 ("<humida>"ri)) ; # c: 1 0.000019 SUBSTITUTE:r152 (n :0) (n :1) ("estació"ri) (-1 ("amb"ri)) (0 ("estació"ri)) (1 ("humit"ri)) ; # c: 1 0.000017 SUBSTITUTE:r40 (n :0) (n :2) ("ràbia"ri) (-1 ("<de>"ri)) (0 ("<ràbia>"ri)) ; # c: 1 0.000017 SUBSTITUTE:r38 (n :0) (n :2) ("ràbia"ri) (-1 ("de"ri)) (0 ("ràbia"ri)) ; # c: 1 0.000017 SUBSTITUTE:r135 (n :0) (n :1) ("estació"ri) (0 ("<estacions>"ri)) (1 ("<amb>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r98 (n :0) (n :1) ("temps"ri) (-1 ("<El>"ri)) (0 ("<temps>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r97 (n :0) (n :1) ("temps"ri) (-1 ("<El>"ri)) (0 ("<temps>"ri)) (1 ("<canvia>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r96 (n :0) (n :1) ("temps"ri) (-1 ("<el>"ri)) (0 ("<temps>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r95 (n :0) (n :1) ("temps"ri) (-1 ("<el>"ri)) (0 ("<temps>"ri)) (1 ("<es>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r94 (n :0) (n :1) ("temps"ri) (-1 ("<mal>"ri)) (0 ("<temps>"ri)) (1 ("<atrapà>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r93 (n :0) (n :1) ("temps"ri) (-1 ("<mal>"ri)) (0 ("<temps>"ri)) (1 ("<ho>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r92 (n :0) (n :1) ("temps"ri) (-1 ("<mal>"ri)) (0 ("<temps>"ri)) (1 ("<i>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r91 (n :0) (n :1) ("temps"ri) (-1 ("<mal>"ri)) (0 ("<temps>"ri)) (1 ("<obligà>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r90 (n :0) (n :1) ("temps"ri) (-1 ("<preparen>"ri)) (0 ("<temps>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r8 (n :0) (n :1) ("temps"ri) (-1 ("<mal>"ri)) (0 ("<temps>"ri)) ; # c: 4 0.000010 SUBSTITUTE:r89 (n :0) (n :1) ("temps"ri) (-1 ("<preparen>"ri)) (0 ("<temps>"ri)) (1 ("<fred>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r88 (n :0) (n :1) ("temps"ri) (-1 ("<un>"ri)) (0 ("<temps>"ri)) (1 ("<calent>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r87 (n :0) (n :1) ("temps"ri) (-1 ("<un>"ri)) (0 ("<temps>"ri)) (1 ("<sever>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r86 (n :0) (n :1) ("temps"ri) (-1 ("El"ri)) (0 ("temps"ri)) ; # c: 1 0.000010 SUBSTITUTE:r85 (n :0) (n :1) ("temps"ri) (-1 ("El"ri)) (0 ("temps"ri)) (1 ("canviar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r84 (n :0) (n :1) ("temps"ri) (-1 ("el"ri)) (0 ("temps"ri)) ; # c: 1 0.000010 SUBSTITUTE:r83 (n :0) (n :1) ("temps"ri) (-1 ("el"ri)) (0 ("temps"ri)) (1 ("es"ri)) ; # c: 1 0.000010 SUBSTITUTE:r82 (n :0) (n :1) ("temps"ri) (-1 ("mal"ri)) (0 ("temps"ri)) (1 ("atrapar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r81 (n :0) (n :1) ("temps"ri) (-1 ("mal"ri)) (0 ("temps"ri)) (1 ("ho"ri)) ; # c: 1 0.000010 SUBSTITUTE:r80 (n :0) (n :1) ("temps"ri) (-1 ("mal"ri)) (0 ("temps"ri)) (1 ("i"ri)) ; # c: 1 0.000010 SUBSTITUTE:r7 (n :0) (n :1) ("temps"ri) (-1 ("mal"ri)) (0 ("temps"ri)) ; # c: 4 0.000010 SUBSTITUTE:r79 (n :0) (n :1) ("temps"ri) (-1 ("mal"ri)) (0 ("temps"ri)) (1 ("obligar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r78 (n :0) (n :1) ("temps"ri) (-1 ("preparar"ri)) (0 ("temps"ri)) ; # c: 1 0.000010 SUBSTITUTE:r77 (n :0) (n :1) ("temps"ri) (-1 ("preparar"ri)) (0 ("temps"ri)) (1 ("fred"ri)) ; # c: 1 0.000010 SUBSTITUTE:r76 (n :0) (n :1) ("temps"ri) (-1 ("un"ri)) (0 ("temps"ri)) (1 ("calent"ri)) ; # c: 1 0.000010 SUBSTITUTE:r75 (n :0) (n :1) ("temps"ri) (-1 ("un"ri)) (0 ("temps"ri)) (1 ("sever"ri)) ; # c: 1 0.000010 SUBSTITUTE:r74 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<atrapà>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r73 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<atrapà>"ri)) (2 ("<a>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r72 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<calent>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r71 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<calent>"ri)) (2 ("<i>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r70 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<canvia>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r69 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<canvia>"ri)) (2 ("<mogut>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r68 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<es>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r67 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<es>"ri)) (2 ("<va>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r66 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<fred>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r65 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<fred>"ri)) (2 ("<emmagatzemant>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r64 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<ho>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r63 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<i>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r62 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<i>"ri)) (2 ("<l'>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r61 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<obligà>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r60 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<obligà>"ri)) (2 ("<la>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r59 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<sever>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r58 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) (1 ("<sever>"ri)) (2 ("<per>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r57 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("atrapar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r56 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("atrapar"ri)) (2 ("a"ri)) ; # c: 1 0.000010 SUBSTITUTE:r55 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("calent"ri)) ; # c: 1 0.000010 SUBSTITUTE:r54 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("calent"ri)) (2 ("i"ri)) ; # c: 1 0.000010 SUBSTITUTE:r53 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("canviar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r52 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("canviar"ri)) (2 ("moure"ri)) ; # c: 1 0.000010 SUBSTITUTE:r51 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("es"ri)) ; # c: 1 0.000010 SUBSTITUTE:r50 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("es"ri)) (2 ("anar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r4 (n :0) (n :1) ("temps"ri) (0 ("<temps>"ri)) ; # c: 9 0.000010 SUBSTITUTE:r49 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("fred"ri)) ; # c: 1 0.000010 SUBSTITUTE:r48 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("fred"ri)) (2 ("emmagatzemar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r47 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("ho"ri)) ; # c: 1 0.000010 SUBSTITUTE:r46 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("i"ri)) ; # c: 1 0.000010 SUBSTITUTE:r45 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("i"ri)) (2 ("el"ri)) ; # c: 1 0.000010 SUBSTITUTE:r44 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("obligar"ri)) ; # c: 1 0.000010 SUBSTITUTE:r43 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("obligar"ri)) (2 ("el"ri)) ; # c: 1 0.000010 SUBSTITUTE:r42 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("sever"ri)) ; # c: 1 0.000010 SUBSTITUTE:r41 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) (1 ("sever"ri)) (2 ("per"ri)) ; # c: 1 0.000010 SUBSTITUTE:r3 (n :0) (n :1) ("temps"ri) (0 ("temps"ri)) ; # c: 9 0.000010 SUBSTITUTE:r177 (n :0) (n :0) ("llengua"ri) (-1 ("<que>"ri)) (0 ("<llengua>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r176 (n :0) (n :0) ("llengua"ri) (-1 ("<que>"ri)) (0 ("<llengua>"ri)) (1 ("<materna>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r175 (n :0) (n :0) ("llengua"ri) (-1 ("altre"ri)) (0 ("llengua"ri)) ; # c: 1 0.000010 SUBSTITUTE:r174 (n :0) (n :0) ("llengua"ri) (-1 ("altre"ri)) (0 ("llengua"ri)) (1 ("que"ri)) ; # c: 1 0.000010 SUBSTITUTE:r173 (n :0) (n :0) ("llengua"ri) (-1 ("que"ri)) (0 ("llengua"ri)) ; # c: 1 0.000010 SUBSTITUTE:r172 (n :0) (n :0) ("llengua"ri) (-1 ("que"ri)) (0 ("llengua"ri)) (1 ("*materna"ri)) ; # c: 1 0.000010 SUBSTITUTE:r171 (n :0) (n :0) ("llengua"ri) (0 ("<llengua>"ri)) (1 ("<materna>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r170 (n :0) (n :0) ("llengua"ri) (0 ("<llengua>"ri)) (1 ("<que>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r169 (n :0) (n :0) ("llengua"ri) (0 ("llengua"ri)) (1 ("*materna"ri)) ; # c: 1 0.000010 SUBSTITUTE:r163 (n :0) (n :1) ("estació"ri) (-1 ("<de>"ri)) (0 ("<:>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r15 (n :0) (n :1) ("temps"ri) (-1 ("<un>"ri)) (0 ("<temps>"ri)) ; # c: 2 0.000010 SUBSTITUTE:r14 (n :0) (n :1) ("temps"ri) (-1 ("un"ri)) (0 ("temps"ri)) ; # c: 2 0.000010 SUBSTITUTE:r149 (n :0) (n :1) ("estació"ri) (-1 ("de"ri)) (0 (":"ri)) ; # c: 1 0.000010 SUBSTITUTE:r139 (n :0) (n :1) ("estació"ri) (0 (":"ri)) ; # c: 1 0.000010 SUBSTITUTE:r138 (n :0) (n :1) ("estació"ri) (0 (":"ri)) (1 ("."ri)) (2 ("Per a"ri)) ; # c: 1 0.000010 SUBSTITUTE:r137 (n :0) (n :1) ("estació"ri) (0 ("<:>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r136 (n :0) (n :1) ("estació"ri) (0 ("<:>"ri)) (1 ("<.>"ri)) (2 ("<Per a>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r132 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<de>"ri)) (2 ("<es>"ri)) ; # c: 1 0.000010 SUBSTITUTE:r122 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("de"ri)) (2 ("es"ri)) ; # c: 1 0.000009 SUBSTITUTE:r153 (n :0) (n :1) ("estació"ri) (-1 ("amb"ri)) (0 ("estació"ri)) ; # c: 1 0.000006 SUBSTITUTE:r30 (n :0) (n :0) ("llengua"ri) (0 ("<llengua>"ri)) ; # c: 2 0.000006 SUBSTITUTE:r134 (n :0) (n :1) ("estació"ri) (0 ("<estacions>"ri)) (1 ("<de>"ri)) ; # c: 1 0.000005 SUBSTITUTE:r168 (n :0) (n :0) ("llengua"ri) (0 ("llengua"ri)) (1 ("que"ri)) ; # c: 1 0.000000 SUBSTITUTE:r105 (n :0) (n :1) ("llengua"ri) (0 ("<llengua>"ri)) (1 ("<celta>"ri)) (2 ("<d'>"ri)) ; # c: 1 0.000000 SUBSTITUTE:r101 (n :0) (n :1) ("llengua"ri) (0 ("llengua"ri)) (1 ("celta"ri)) (2 ("de"ri)) ; # c: 1 -0.000003 SUBSTITUTE:r111 (n :0) (n :1) ("llengua"ri) (-1 ("<la>"ri)) (0 ("<llengua>"ri)) (1 ("<celta>"ri)) ; # c: 1 -0.000003 SUBSTITUTE:r109 (n :0) (n :1) ("llengua"ri) (-1 ("el"ri)) (0 ("llengua"ri)) (1 ("celta"ri)) ; # c: 1 -0.000010 SUBSTITUTE:r145 (n :0) (n :1) ("estació"ri) (-1 ("i"ri)) (0 ("estació"ri)) ; # c: 1 -0.000010 SUBSTITUTE:r106 (n :0) (n :1) ("llengua"ri) (0 ("<llengua>"ri)) (1 ("<celta>"ri)) ; # c: 1 -0.000010 SUBSTITUTE:r102 (n :0) (n :1) ("llengua"ri) (0 ("llengua"ri)) (1 ("celta"ri)) ; # c: 1 -0.000011 SUBSTITUTE:r36 (n :0) (n :2) ("ràbia"ri) (0 ("<ràbia>"ri)) ; # c: 1 -0.000011 SUBSTITUTE:r33 (n :0) (n :2) ("ràbia"ri) (0 ("ràbia"ri)) ; # c: 1 -0.000043 SUBSTITUTE:r158 (n :0) (n :1) ("estació"ri) (-1 ("<l'>"ri)) (0 ("<estació>"ri)) (1 ("<de>"ri)) ; # c: 1 -0.000056 SUBSTITUTE:r29 (n :0) (n :0) ("llengua"ri) (0 ("llengua"ri)) ; # c: 2 -0.000140 SUBSTITUTE:r21 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) (1 ("<de>"ri)) ; # c: 2 -0.000216 SUBSTITUTE:r12 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) (1 ("de"ri)) ; # c: 3 -0.000247 SUBSTITUTE:r148 (n :0) (n :1) ("estació"ri) (-1 ("el"ri)) (0 ("estació"ri)) (1 ("de"ri)) ; # c: 1 -0.000393 SUBSTITUTE:r23 (n :0) (n :1) ("estació"ri) (-1 ("un"ri)) (0 ("estació"ri)) ; # c: 2 -0.000397 SUBSTITUTE:r26 (n :0) (n :1) ("estació"ri) (-1 ("<una>"ri)) (0 ("<estació>"ri)) ; # c: 2 -0.000630 SUBSTITUTE:r104 (n :0) (n :1) ("llengua"ri) (0 ("<llengua>"ri)) (1 ("<de>"ri)) ; # c: 1 -0.000694 SUBSTITUTE:r22 (n :0) (n :1) ("estació"ri) (0 ("<estacions>"ri)) ; # c: 2 -0.001345 SUBSTITUTE:r6 (n :0) (n :1) ("estació"ri) (-1 ("<l'>"ri)) (0 ("<estació>"ri)) ; # c: 5 -0.001903 SUBSTITUTE:r5 (n :0) (n :1) ("estació"ri) (-1 ("el"ri)) (0 ("estació"ri)) ; # c: 5 -0.002080 SUBSTITUTE:r2 (n :0) (n :1) ("estació"ri) (0 ("<estació>"ri)) ; # c: 10 -0.002281 SUBSTITUTE:r100 (n :0) (n :1) ("llengua"ri) (0 ("llengua"ri)) (1 ("de"ri)) ; # c: 1 -0.002785 SUBSTITUTE:r1 (n :0) (n :1) ("estació"ri) (0 ("estació"ri)) ; # c: 12 -0.006686 SUBSTITUTE:r112 (n :0) (n :1) ("llengua"ri) (-1 ("<la>"ri)) (0 ("<llengua>"ri)) ; # c: 1 -0.009206 SUBSTITUTE:r110 (n :0) (n :1) ("llengua"ri) (-1 ("el"ri)) (0 ("llengua"ri)) ; # c: 1 -0.013721 SUBSTITUTE:r17 (n :0) (n :1) ("llengua"ri) (0 ("<llengua>"ri)) ; # c: 2 -0.020805 SUBSTITUTE:r16 (n :0) (n :1) ("llengua"ri) (0 ("llengua"ri)) ; # c: 2
Bugs
- Needs to include gender in building the ambig corpus (e.g. apply to n.f but not n.m)
- Deal with :<sent>
Tests
- Catalan→English
- 128 nouns (abús, acord, adoració, ànsia, anunci, armari, arrel, arribada, aspecte, assaig, audiència, avançament, bany, blau, bomba, camp, cap, cargol, carn, càrrega, carta, casa, cel, cinta, classe, cobertura, company, compra, compromís, conferència, confiança, consell, consulta, cop, cor, criatura, cuina, cura, desig, deure, dipòsit, director, disposició, dit, emissió, ensenyament, entorn, entrada, escàner, escenari, estació, estudi, exposició, feina, fons, font, força, formació, forma, fotografia, fuet, fusió, herència, història, hora, impressió, incapacitat, índex, institut, interpretació, intèrpret, investigació, llaç, llengua, lletra, lloguer, mal, marca, marxa, matrimoni, moneda, nau, ocupació, oferta, participació, partit, passeig, pati, patró, pell, pena, pensió, persecució, perspectiva, pis, pla, política, porc, pressupost, prova, puntuació, ràbia, raça, recanvi, recepta, redacció, règim, relació, rellotge, resposta, ritme, segell, sentit, sonda, sortida, substitució, subvenció, tall, teixit, terra, test, to, vestit, vista, xarxa, xicot, xoc).
- Average 2.19 translations per noun.
- Source example corpus: Catalan Wikipedia, 376,857 lines after stripping, approx. 11,500,999 words.
- Target LM corpus: English Wikipedia, 5,860,665 lines after stripping, approx. 183,341,729 words.
- Total sentences with one or more of the ambiguous words: 86,045
- Total ambiguous sentences: 258,619 = Average 3.00 translations per sentence.
- Total candidate sentences: 3,756 with prob. diff. of > 0.1
- Candidate unigram/bigram/trigram rules of surface forms / lemmas: 20,462
- Rendiment
- Ambiguate 376,857 line corpus (~148 mins)
- Translate ambiguated corpus (~23 mins)
- Rank translations (~7 mins)
- Process 86,045 lines with 1 CG rule (~2 mins)
- Process ~17,740 rules (~14 hours)
- Aggregate ~35G rule output (~24 mins) -- giving ~184,429 lines out of
- Lines of results 37,107,736 (970M)
Breton--French
- Ambiguate 185,462 line corpus (744m)
More notes and comments
- Evaluation
- 10-fold cross validation.
- Differential evaluation.
- Ambiguous words with a frequency of >threshold in the corpus should be avoided (e.g. 'get').
- Can we extract entries from a tagged 'probabilistic' dictionary ?
- Are we relying too much on the language model score ? Perhaps we can look at both score, rule-application frequency, and ratio of correct/incorrect applications. e.g. a rule that increases the score little, but always increases it might be better than a rule that increases the score a lot and decreases it a lot.
- Discard rules with conflicting acceptions for a given context.
- Remove closed classes from both LM and from the text.
- Try with the 5-gram LM (rules and bayes)