Difference between revisions of "Generating lexical-selection rules from monolingual corpora"
Fpetkovski (talk | contribs) |
Fpetkovski (talk | contribs) |
||
Line 13: | Line 13: | ||
==Annotation== |
==Annotation== |
||
+ | |||
⚫ | |||
+ | Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page. |
||
+ | |||
+ | We're going to do the example with EuroParl and the English to Spanish pair in Apertium. |
||
+ | |||
+ | Given that you've got all the stuff installed, the work will be as follows: |
||
+ | |||
⚫ | |||
<pre> |
<pre> |
||
− | cat |
+ | cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged |
</pre> |
</pre> |
||
+ | Make an ambiguous version of your corpus and trim redundant tags: |
||
− | Then select only the lines which have more than one and less than 10,000 translations, which have an ambiguous noun/verb/adjective and which have >= 90% coverage of the morphology. |
||
<pre> |
<pre> |
||
+ | cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig |
||
− | cat $< | python3 $(SCRIPTS)/trim-fertile-lines.py | python3 $(SCRIPTS)/biltrans-line-only-pos-ambig.py | python3 $(SCRIPTS)/biltrans-trim-uncovered.py > $@ |
||
</pre> |
</pre> |
||
− | + | Next, generate all the possible disambiguation paths while trimming redundant tags: |
|
<pre> |
<pre> |
||
+ | cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed |
||
− | cat $< | python $(SCRIPTS)/biltrans-to-multitrans-line-recursive.py > $@ |
||
</pre> |
</pre> |
||
− | Translate all possible disambiguation paths: |
+ | Translate and score all possible disambiguation paths: |
<pre> |
<pre> |
||
+ | cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n | |
||
− | cat $< | apertium -f none -d $(DATA) $(DIR)-multi > $@ |
||
+ | apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker ~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated |
||
</pre> |
</pre> |
||
+ | Now we have a pseudo-parallel corpus where each possible translation is scored. |
||
− | Score all the possible disambiguation paths with IRSTLM. |
||
+ | We start by extracting a frequency lexicon: |
||
<pre> |
<pre> |
||
+ | python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq |
||
− | |||
</pre> |
</pre> |
||
Revision as of 05:42, 23 September 2013
This page describes how to generate lexical selection rules without relying on a parallel corpus.
Prerequisites
- apertium-lex-tools
- IRSTLM
- A language pair (e.g. apertium-br-fr)
- The language pair should have the following two modes:
-multi
which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)-pretransfer
which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)
- The language pair should have the following two modes:
Annotation
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.
Given that you've got all the stuff installed, the work will be as follows:
Take your corpus and make a tagged version of it:
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
Make an ambiguous version of your corpus and trim redundant tags:
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
Next, generate all the possible disambiguation paths while trimming redundant tags:
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
Translate and score all possible disambiguation paths:
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n | apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker ~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
Now we have a pseudo-parallel corpus where each possible translation is scored. We start by extracting a frequency lexicon:
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
Rule-extraction
First extract the default translations:
Then the ngram partial counts: