Difference between revisions of "Generating lexical-selection rules"
(14 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
:''This module deals with [[lexical selection]], for more information on the topic, see the [[lexical selection|main page]].'' |
|||
{{deprecated}} |
|||
{{TOCD}} |
{{TOCD}} |
||
== Preparation == |
== Preparation == |
||
Line 20: | Line 23: | ||
<pre> |
<pre> |
||
$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt |
$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt |
||
</pre> |
|||
You may also want to create a tagged frequency list from the corpus to avoid generating rules for very common ambiguous words: |
|||
<pre> |
|||
cat is.tagged.txt | sed 's/\$\W*\^/$\n^/g' | grep '<' | sed 's/><.*$/>/g' | cut -f2 -d'/' | sed 's/\$//g' | sort -f \ |
|||
grep -v '<num>' | grep -v '<np>' | grep '>$' | uniq -c | sort -gr > is.frequency.txt |
|||
</pre> |
</pre> |
||
Line 25: | Line 35: | ||
It is assumed you have a language model left over, probably from previous NLP experiments, that this model is in [[IRSTLM]] binary format and is called <code>en.blm</code>. Instructions on how to make one of these may be added here in future. |
It is assumed you have a language model left over, probably from previous NLP experiments, that this model is in [[IRSTLM]] binary format and is called <code>en.blm</code>. Instructions on how to make one of these may be added here in future. |
||
===Files=== |
|||
So now we should have four files: |
|||
* <code>is.crp.txt</code>: The original corpus of the source language |
|||
* <code>is.tagged.txt</code>: The tagged corpus of the source language |
|||
* <code>is.frequency.txt</code>: A frequency list of lemmas with part-of-speech generated from the source language corpus |
|||
* <code>en.blm</code>: A binarised language model of surface forms of the target language |
|||
== Steps == |
== Steps == |
||
Line 137: | Line 156: | ||
===Score candidate rules=== |
===Score candidate rules=== |
||
After we've generated all the possible rules, we want to score them to find out which ones offer an improvement over the baseline translation. |
|||
<pre> |
<pre> |
||
Line 149: | Line 170: | ||
<pre> |
<pre> |
||
121. |
121.237 SUBSTITUTE:r14 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) ; #c:1 |
||
4. |
4.0657 SUBSTITUTE:r2 ("einn") ("einn:1") ("einn") (0 ("einn")) (1 ("og")) ; #c:1 |
||
2. |
2.3651 SUBSTITUTE:r23 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) ; #c:1 |
||
1. |
1.2004 SUBSTITUTE:r4 ("einn") ("einn:1") ("einn") (0 ("<einn>")) (1 ("<og>")) ; #c:1 |
||
0. |
0.8574 SUBSTITUTE:r20 ("almennur") ("almennur:2") ("almennur") (0 ("almennur")) (1 ("leiðbeining")) ; #c:1 |
||
0. |
0.6348 SUBSTITUTE:r1 ("einn") ("einn:1") ("einn") (0 ("einn")) (1 ("og")) (2 ("með")) ; #c:1 |
||
0. |
0.6034 SUBSTITUTE:r9 ("einn") ("einn:1") ("einn") (-1 ("<bæði>")) (0 ("<einn>")) ; #c:1 |
||
0. |
0.6034 SUBSTITUTE:r8 ("einn") ("einn:1") ("einn") (-1 ("<bæði>")) (0 ("<einn>")) (1 ("<og>")) ; #c:1 |
||
0. |
0.6034 SUBSTITUTE:r7 ("einn") ("einn:1") ("einn") (-1 ("báðir")) (0 ("einn")) ; #c:1 |
||
0. |
0.6034 SUBSTITUTE:r6 ("einn") ("einn:1") ("einn") (-1 ("báðir")) (0 ("einn")) (1 ("og")) ; #c:1 |
||
0. |
0.6034 SUBSTITUTE:r3 ("einn") ("einn:1") ("einn") (0 ("<einn>")) (1 ("<og>")) (2 ("<með>")) ; #c:1 |
||
0. |
0.5934 SUBSTITUTE:r22 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) (1 ("<leiðbeiningar>")) ; #c:1 |
||
0. |
0.4772 SUBSTITUTE:r27 ("almennur") ("almennur:2") ("almennur") (-1 ("<inniheldur>")) (0 ("<almennar>")) ; #c:1 |
||
0. |
0.4772 SUBSTITUTE:r26 ("almennur") ("almennur:2") ("almennur") (-1 ("<inniheldur>")) (0 ("<almennar>")) (1 ("<leiðbeiningar>")) ; #c:1 |
||
0. |
0.4772 SUBSTITUTE:r25 ("almennur") ("almennur:2") ("almennur") (-1 ("innihalda")) (0 ("almennur")) ; #c:1 |
||
0. |
0.4772 SUBSTITUTE:r24 ("almennur") ("almennur:2") ("almennur") (-1 ("innihalda")) (0 ("almennur")) (1 ("leiðbeining")) ; #c:1 |
||
0. |
0.4772 SUBSTITUTE:r21 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) (1 ("<leiðbeiningar>")) (2 ("<um>")) ; #c:1 |
||
0. |
0.4772 SUBSTITUTE:r19 ("almennur") ("almennur:2") ("almennur") (0 ("almennur")) (1 ("leiðbeining")) (2 ("um")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r18 ("eiga") ("eiga:1") ("eiga") (-1 ("<Svíþjóð>")) (0 ("<eiga>")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r17 ("eiga") ("eiga:1") ("eiga") (-1 ("<Svíþjóð>")) (0 ("<eiga>")) (1 ("<nærri>")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r16 ("eiga") ("eiga:1") ("eiga") (-1 ("Svíþjóð")) (0 ("eiga")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r15 ("eiga") ("eiga:1") ("eiga") (-1 ("Svíþjóð")) (0 ("eiga")) (1 ("nærri")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r13 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) (1 ("<nærri>")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r12 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) (1 ("<nærri>")) (2 ("<700>")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r11 ("eiga") ("eiga:1") ("eiga") (0 ("eiga")) (1 ("nærri")) ; #c:1 |
||
0. |
0.3314 SUBSTITUTE:r10 ("eiga") ("eiga:1") ("eiga") (0 ("eiga")) (1 ("nærri")) (2 ("700")) ; #c:1 |
||
-75.448 |
-75.448 SUBSTITUTE:r5 ("einn") ("einn:1") ("einn") (0 ("<einn>")) ; #c:1 |
||
</pre> |
</pre> |
||
[[Category:Development]] |
[[Category:Development]] |
||
[[Category:Lexical selection]] |
|||
[[Category:Documentation in English]] |
Latest revision as of 21:07, 1 December 2013
- This module deals with lexical selection, for more information on the topic, see the main page.
This discussion page is deprecated as the functionality now exists.
Preparation[edit]
Wikipedia[edit]
Wikipedia can be downloaded from downloads.wikimedia.org.
$ wget http://download.wikimedia.org/iswiki/20100306/iswiki-20100306-pages-articles.xml.bz2
You will need to make a nice corpus from Wikipedia, with around one sentence per line. There are scripts in apertium-lex-learner/wikipedia
that do this for different Wikipedias, they will probably need some supervision. You will also need the analyser from your language pair and pass it as the second argument to the script.
$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin > is.crp.txt
Then tag the corpus:
$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt
You may also want to create a tagged frequency list from the corpus to avoid generating rules for very common ambiguous words:
cat is.tagged.txt | sed 's/\$\W*\^/$\n^/g' | grep '<' | sed 's/><.*$/>/g' | cut -f2 -d'/' | sed 's/\$//g' | sort -f \ grep -v '<num>' | grep -v '<np>' | grep '>$' | uniq -c | sort -gr > is.frequency.txt
Language model[edit]
It is assumed you have a language model left over, probably from previous NLP experiments, that this model is in IRSTLM binary format and is called en.blm
. Instructions on how to make one of these may be added here in future.
Files[edit]
So now we should have four files:
is.crp.txt
: The original corpus of the source languageis.tagged.txt
: The tagged corpus of the source languageis.frequency.txt
: A frequency list of lemmas with part-of-speech generated from the source language corpusen.blm
: A binarised language model of surface forms of the target language
Steps[edit]
Ambiguate source language corpus[edit]
In the first stage, we take the source language corpus, expand using the bilingual dictionary to generate all the possible translation paths, translate it and score it on the target language model.
$ cat is.tagged.txt | python generate_sl_ambig_corpus.py apertium-is-en.is-en.dix lr > is.ambig.txt $ cat is.ambig.txt | sh translator_pipeline.sh > is.translated.txt $ cat is.translated.txt | irstlm-ranker en.blm > is.ranked.txt
This gives us a set of sentences in the target language with attached scores. The number in the first column gives the language model score for the sentence. The second column is split into four: The sentence ID, the translation variant, the position in the sentence of the ambiguous word(s) and the number of words (lexical units) in the sentence.
-3.15967 || [4:0:4:10 || ].[] Finland and Sweden own near 700 years of joint history. -2.89170 || [4:1:4:10 || ].[] Finland and Sweden have near 700 years of joint history. -3.80183 || [15:0:3:13 || ].[] Now when counted that about 270 Saimaa - hringanórar are on life. -3.81782 || [15:1:3:13 || ].[] Now when reckoned that about 270 Saimaa - hringanórar are on life. -3.01545 || [39:0:1:7 || ].[] The universal suffrage was come on 1918-1921. -3.30002 || [39:1:1:7 || ].[] The common suffrage was come on 1918-1921. -3.26693 || [39:2:1:7 || ].[] The general suffrage was come on 1918-1921. -2.74306 || [20:0:5:10 || ].[] Swedish is according to laws the only official language on Álandseyjum. -3.29975 || [20:1:5:10 || ].[] Swedish is according to laws the alone official language on Álandseyjum. -3.30206 || [60:0:3:9 || ].[] Nominative case is only of four falls in Icelandic. -3.56592 || [60:1:3:9 || ].[] Nominative case is alone of four falls in Icelandic. -2.50803 || [147:0:6:9 || ].[] The state, the municipality and the town are called all Ósló. -2.96652 || [147:1:6:9 || ].[] The state, the municipality and the town promise all Ósló. -3.56343 || [154:0:1,9:11 || ].[] Been called is seldom used about authors the ones that novels compose. -3.56343 || [154:1:1,9:11 || ].[] Been called is seldom used about authors the ones that novels negotiate. -3.56343 || [154:2:1,9:11 || ].[] Been called is seldom used about authors the ones that novels write. -3.57915 || [154:3:1,9:11 || ].[] Promised is seldom used about authors the ones that novels compose. -3.57915 || [154:4:1,9:11 || ].[] Promised is seldom used about authors the ones that novels negotiate. -3.57915 || [154:5:1,9:11 || ].[] Promised is seldom used about authors the ones that novels write. -3.28529 || [22791:0:3:11 || ].[] He has composed 32 books and on fifth hundred detect. -3.52811 || [22791:1:3:11 || ].[] He has negotiated 32 books and on fifth hundred detect. -3.11449 || [22791:2:3:11 || ].[] He has written 32 books and on fifth hundred detect.
As can be seen, in most of the cases, the default translation (second column 0
) is the best one, but in some cases, for example in 22791
, another translation is better.
Doing this step can also be an interesting way to find holes in the dictionaries. For example, the translation of af fjórum föllum, 'of four cases' as 'of four falls', shows the missing translation of "fall" as "case".
Extract candidate phrases[edit]
In this stage, we extract the candidate phrases where a translation with a non-default word in receives a score higher than the default one by a certain threshold (in this case 0.1
).
$ cat is.ranked.txt | python extract_candidate_phrases.py 0.1 > en.candidates.txt
So, for example,
-3.15967 || [4:0:4:10 || ].[] Finland and Sweden own near 700 years of joint history. -2.89170 || [4:1:4:10 || ].[] Finland and Sweden have near 700 years of joint history. -3.07414 || [330:0:7:15 || ].[] Has sung on records, both only and with another, among other things with Ladda. -2.92558 || [330:1:7:15 || ].[] Has sung on records, both alone and with another, among other things with Ladda. -4.27694 || [1315:0:4:8 || ].[] Manual of Wikipedias contains universal instructions about Wikipedia. -4.02396 || [1315:1:4:8 || ].[] Manual of Wikipedias contains common instructions about Wikipedia. -3.98526 || [1315:2:4:8 || ].[] Manual of Wikipedias contains general instructions about Wikipedia.
This file will be much smaller than the original file as it only has sentences where the default translation gives a lower score than one of the alternatives. We hope that these sentences contain interesting information that can be used to make rules for lexical selection.
Generate candidate rules[edit]
The next step is to generate the candidate lexical selection rules from the candidate phrases. We suppose that there is some information (be it lexical, syntactic or semantic) in the phrases that leads one translation to be a better choice than another. The idea of this stage is to try and generate all the possibilities so that later we can try them and find out which possibility increases our translation score.
$ python generate_candidate_rules.py is.ambig.txt en.candidates.txt > is.rules.txt
At present the rule generation works only on lemma and surface form, and with a word window of unigrams to trigrams. It is planned to add other user-defined features (such as syntactic function, case, part-of-speech etc.). The rules are currently output in the VISL constraint grammar formalism, but in principle they could work with other systems.
DELIMITERS = "<.>"ri "<:>"ri "<!>"ri "<?>"ri sent; SOFT-DELIMITERS = "<,>" ; SECTION SUBSTITUTE:r1 ("einn") ("einn:1") ("einn") (0 ("einn")) (1 ("og")) (2 ("með")) ; #c:1 SUBSTITUTE:r2 ("einn") ("einn:1") ("einn") (0 ("einn")) (1 ("og")) ; #c:1 SUBSTITUTE:r3 ("einn") ("einn:1") ("einn") (0 ("<einn>")) (1 ("<og>")) (2 ("<með>")) ; #c:1 SUBSTITUTE:r4 ("einn") ("einn:1") ("einn") (0 ("<einn>")) (1 ("<og>")) ; #c:1 SUBSTITUTE:r5 ("einn") ("einn:1") ("einn") (0 ("<einn>")) ; #c:1 SUBSTITUTE:r6 ("einn") ("einn:1") ("einn") (-1 ("báðir")) (0 ("einn")) (1 ("og")) ; #c:1 SUBSTITUTE:r7 ("einn") ("einn:1") ("einn") (-1 ("báðir")) (0 ("einn")) ; #c:1 SUBSTITUTE:r8 ("einn") ("einn:1") ("einn") (-1 ("<bæði>")) (0 ("<einn>")) (1 ("<og>")) ; #c:1 SUBSTITUTE:r9 ("einn") ("einn:1") ("einn") (-1 ("<bæði>")) (0 ("<einn>")) ; #c:1 SUBSTITUTE:r10 ("eiga") ("eiga:1") ("eiga") (0 ("eiga")) (1 ("nærri")) (2 ("700")) ; #c:1 SUBSTITUTE:r11 ("eiga") ("eiga:1") ("eiga") (0 ("eiga")) (1 ("nærri")) ; #c:1 SUBSTITUTE:r12 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) (1 ("<nærri>")) (2 ("<700>")) ; #c:1 SUBSTITUTE:r13 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) (1 ("<nærri>")) ; #c:1 SUBSTITUTE:r14 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) ; #c:1 SUBSTITUTE:r15 ("eiga") ("eiga:1") ("eiga") (-1 ("Svíþjóð")) (0 ("eiga")) (1 ("nærri")) ; #c:1 SUBSTITUTE:r16 ("eiga") ("eiga:1") ("eiga") (-1 ("Svíþjóð")) (0 ("eiga")) ; #c:1 SUBSTITUTE:r17 ("eiga") ("eiga:1") ("eiga") (-1 ("<Svíþjóð>")) (0 ("<eiga>")) (1 ("<nærri>")) ; #c:1 SUBSTITUTE:r18 ("eiga") ("eiga:1") ("eiga") (-1 ("<Svíþjóð>")) (0 ("<eiga>")) ; #c:1 SUBSTITUTE:r19 ("almennur") ("almennur:2") ("almennur") (0 ("almennur")) (1 ("leiðbeining")) (2 ("um")) ; #c:1 SUBSTITUTE:r20 ("almennur") ("almennur:2") ("almennur") (0 ("almennur")) (1 ("leiðbeining")) ; #c:1 SUBSTITUTE:r21 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) (1 ("<leiðbeiningar>")) (2 ("<um>")) ; #c:1 SUBSTITUTE:r22 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) (1 ("<leiðbeiningar>")) ; #c:1 SUBSTITUTE:r23 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) ; #c:1 SUBSTITUTE:r24 ("almennur") ("almennur:2") ("almennur") (-1 ("innihalda")) (0 ("almennur")) (1 ("leiðbeining")) ; #c:1 SUBSTITUTE:r25 ("almennur") ("almennur:2") ("almennur") (-1 ("innihalda")) (0 ("almennur")) ; #c:1 SUBSTITUTE:r26 ("almennur") ("almennur:2") ("almennur") (-1 ("<inniheldur>")) (0 ("<almennar>")) (1 ("<leiðbeiningar>")) ; #c:1 SUBSTITUTE:r27 ("almennur") ("almennur:2") ("almennur") (-1 ("<inniheldur>")) (0 ("<almennar>")) ; #c:1
Score candidate rules[edit]
After we've generated all the possible rules, we want to score them to find out which ones offer an improvement over the baseline translation.
$ cg-comp is.rules.txt is.rules.bin $ cg-comp empty.rlx empty.rlx.bin $ cat is.ambig.txt | grep -e '^\[[0-9]\+:0:' | sed 's/:0</</g' | cg-proc empty.rlx.bin > is.baseline.txt $ mkdir ranking $ python generate_rule_diffs.py is.baseline.txt is.rules.txt is.rules.bin translator_pipeline.sh ranking $ python rank_candidate_rules.py is.baseline.txt is.rules.txt translator_pipeline.sh ranking $ python aggregate_rule_ranks.py is.rules.txt ranking
121.237 SUBSTITUTE:r14 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) ; #c:1 4.0657 SUBSTITUTE:r2 ("einn") ("einn:1") ("einn") (0 ("einn")) (1 ("og")) ; #c:1 2.3651 SUBSTITUTE:r23 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) ; #c:1 1.2004 SUBSTITUTE:r4 ("einn") ("einn:1") ("einn") (0 ("<einn>")) (1 ("<og>")) ; #c:1 0.8574 SUBSTITUTE:r20 ("almennur") ("almennur:2") ("almennur") (0 ("almennur")) (1 ("leiðbeining")) ; #c:1 0.6348 SUBSTITUTE:r1 ("einn") ("einn:1") ("einn") (0 ("einn")) (1 ("og")) (2 ("með")) ; #c:1 0.6034 SUBSTITUTE:r9 ("einn") ("einn:1") ("einn") (-1 ("<bæði>")) (0 ("<einn>")) ; #c:1 0.6034 SUBSTITUTE:r8 ("einn") ("einn:1") ("einn") (-1 ("<bæði>")) (0 ("<einn>")) (1 ("<og>")) ; #c:1 0.6034 SUBSTITUTE:r7 ("einn") ("einn:1") ("einn") (-1 ("báðir")) (0 ("einn")) ; #c:1 0.6034 SUBSTITUTE:r6 ("einn") ("einn:1") ("einn") (-1 ("báðir")) (0 ("einn")) (1 ("og")) ; #c:1 0.6034 SUBSTITUTE:r3 ("einn") ("einn:1") ("einn") (0 ("<einn>")) (1 ("<og>")) (2 ("<með>")) ; #c:1 0.5934 SUBSTITUTE:r22 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) (1 ("<leiðbeiningar>")) ; #c:1 0.4772 SUBSTITUTE:r27 ("almennur") ("almennur:2") ("almennur") (-1 ("<inniheldur>")) (0 ("<almennar>")) ; #c:1 0.4772 SUBSTITUTE:r26 ("almennur") ("almennur:2") ("almennur") (-1 ("<inniheldur>")) (0 ("<almennar>")) (1 ("<leiðbeiningar>")) ; #c:1 0.4772 SUBSTITUTE:r25 ("almennur") ("almennur:2") ("almennur") (-1 ("innihalda")) (0 ("almennur")) ; #c:1 0.4772 SUBSTITUTE:r24 ("almennur") ("almennur:2") ("almennur") (-1 ("innihalda")) (0 ("almennur")) (1 ("leiðbeining")) ; #c:1 0.4772 SUBSTITUTE:r21 ("almennur") ("almennur:2") ("almennur") (0 ("<almennar>")) (1 ("<leiðbeiningar>")) (2 ("<um>")) ; #c:1 0.4772 SUBSTITUTE:r19 ("almennur") ("almennur:2") ("almennur") (0 ("almennur")) (1 ("leiðbeining")) (2 ("um")) ; #c:1 0.3314 SUBSTITUTE:r18 ("eiga") ("eiga:1") ("eiga") (-1 ("<Svíþjóð>")) (0 ("<eiga>")) ; #c:1 0.3314 SUBSTITUTE:r17 ("eiga") ("eiga:1") ("eiga") (-1 ("<Svíþjóð>")) (0 ("<eiga>")) (1 ("<nærri>")) ; #c:1 0.3314 SUBSTITUTE:r16 ("eiga") ("eiga:1") ("eiga") (-1 ("Svíþjóð")) (0 ("eiga")) ; #c:1 0.3314 SUBSTITUTE:r15 ("eiga") ("eiga:1") ("eiga") (-1 ("Svíþjóð")) (0 ("eiga")) (1 ("nærri")) ; #c:1 0.3314 SUBSTITUTE:r13 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) (1 ("<nærri>")) ; #c:1 0.3314 SUBSTITUTE:r12 ("eiga") ("eiga:1") ("eiga") (0 ("<eiga>")) (1 ("<nærri>")) (2 ("<700>")) ; #c:1 0.3314 SUBSTITUTE:r11 ("eiga") ("eiga:1") ("eiga") (0 ("eiga")) (1 ("nærri")) ; #c:1 0.3314 SUBSTITUTE:r10 ("eiga") ("eiga:1") ("eiga") (0 ("eiga")) (1 ("nærri")) (2 ("700")) ; #c:1 -75.448 SUBSTITUTE:r5 ("einn") ("einn:1") ("einn") (0 ("<einn>")) ; #c:1