Difference between revisions of "Generating lexical-selection rules from a parallel corpus"
Jump to navigation
Jump to search
Line 19: | Line 19: | ||
== Getting started == |
== Getting started == |
||
+ | |||
+ | We're going to do the example with Europarl and the English to Spanish pair in Apertium. |
||
Given that you've got all the stuff installed, the work will be as follows: |
Given that you've got all the stuff installed, the work will be as follows: |
||
=== Prepare corpus === |
=== Prepare corpus === |
||
+ | |||
+ | To generate the rules, we need three files, |
||
+ | |||
+ | * The tagged source corpus |
||
+ | * The tagged target corpus |
||
+ | * The output of the lexical transfer module in the source→target direction |
||
+ | |||
+ | These three files should be sentence aligned. |
||
+ | |||
+ | The first thing that you need to do is clean the corpus, to remove long sentences. |
||
+ | |||
+ | <pre> |
||
+ | $ perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl-v6.es-en es en europarl.clean 1 40 |
||
+ | clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40 |
||
+ | ..... |
||
+ | </pre> |
||
+ | |||
+ | (Replace the path <code>/home/fran/local/bin/scripts-20120109-1229/training/</code> with the path to where you put the Moses scripts) |
||
+ | |||
+ | |||
=== Align corpus === |
=== Align corpus === |
Revision as of 11:59, 17 January 2012
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.
You will need
Here is a list of software that you will need installed:
- Giza++ (or some other word aligner)
- Moses (for making Giza++ less human hostile)
- All the Moses scripts
- lttoolbox
- Apertium
- apertium-lex-tools
Furthermore you'll need:
- an Apertium language pair
- a parallel corpus
Getting started
We're going to do the example with Europarl and the English to Spanish pair in Apertium.
Given that you've got all the stuff installed, the work will be as follows:
Prepare corpus
To generate the rules, we need three files,
- The tagged source corpus
- The tagged target corpus
- The output of the lexical transfer module in the source→target direction
These three files should be sentence aligned.
The first thing that you need to do is clean the corpus, to remove long sentences.
$ perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl-v6.es-en es en europarl.clean 1 40 clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40 .....
(Replace the path /home/fran/local/bin/scripts-20120109-1229/training/
with the path to where you put the Moses scripts)