Difference between revisions of "Generating lexical-selection rules from monolingual corpora"
Jump to navigation
Jump to search
(Created page with '{{TOCD}} This page describes how to generate lexical selection rules without relying on a parallel corpus. ==Prerequisites== * apertium-lex-tools * IRSTLM * A language …') |
Fpetkovski (talk | contribs) |
||
Line 8: | Line 8: | ||
* A language pair (e.g. apertium-br-fr) |
* A language pair (e.g. apertium-br-fr) |
||
** The language pair should have the following two modes: |
** The language pair should have the following two modes: |
||
− | *** <code>-multi</code> which is all the modules after lexical transfer |
+ | *** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml) |
− | *** <code>-pretransfer</code> which is all the modules up to lexical transfer |
+ | *** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml) |
==Annotation== |
==Annotation== |
Revision as of 05:24, 23 September 2013
This page describes how to generate lexical selection rules without relying on a parallel corpus.
Prerequisites
- apertium-lex-tools
- IRSTLM
- A language pair (e.g. apertium-br-fr)
- The language pair should have the following two modes:
-multi
which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)-pretransfer
which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)
- The language pair should have the following two modes:
Annotation
Take your corpus and run it through the lexical transfer:
cat $(CORPUS).$(DIR).txt | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-pretransfer | lt-proc -b $(DATA)/$(AUTOBIL) > $@
Then select only the lines which have more than one and less than 10,000 translations, which have an ambiguous noun/verb/adjective and which have >= 90% coverage of the morphology.
cat $< | python3 $(SCRIPTS)/trim-fertile-lines.py | python3 $(SCRIPTS)/biltrans-line-only-pos-ambig.py | python3 $(SCRIPTS)/biltrans-trim-uncovered.py > $@
Generate all the possible disambiguation paths:
cat $< | python $(SCRIPTS)/biltrans-to-multitrans-line-recursive.py > $@
Translate all possible disambiguation paths:
cat $< | apertium -f none -d $(DATA) $(DIR)-multi > $@
Score all the possible disambiguation paths with IRSTLM.
Rule-extraction
First extract the default translations:
Then the ngram partial counts: