Difference between revisions of "Building a pseudo-parallel corpus"

From Apertium
Jump to navigation Jump to search
 
(9 intermediate revisions by 2 users not shown)
Line 2: Line 2:
   
 
== IRSTLM ==
 
== IRSTLM ==
IRSTLM is a tool for building n-gram language models from corpora. It supports different smoothing methods, including Written-Bell smoothing, Kneser-Ney smoothing and others.
+
IRSTLM is a tool for building n-gram language models from corpora. It supports different smoothing and interpolation methods, including Written-Bell smoothing, Kneser-Ney smoothing and others.
  +
biltrans-to-multitrans-recu.py
 
The full documentation can be viewed [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Main_Page here], and the whole toolkit can be downloaded [http://sourceforge.net/projects/irstlm/ here]
   
  +
== Building a pseudo-parallel corpus ==
The full documentation can be viewed [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Main_Page here]
 
  +
  +
The main idea is to get a source-language corpus and run it through the apertium pipeline, but this time let the language model choose the preposition instead of apertium. The main algorithm is as follows (example for mk-en):
  +
  +
* run the corpus through mk-en-biltrans
  +
* Run through apertium-ldx-proc to select the defaults for words with a POS different from <pr>. This step is necessary in order not to get an explosion of the possible TL sentences.
  +
* Run through apertium-lex-tools/scripts/biltrans-to-multitrans.py to expand the biltrans sentence to cover all the possible lexical transfers. You can also use apertium-lex-tools/scripts/biltrans-to-multitrans-line-recursive.py, a slower version but it uses less memory.
  +
* Run through the rest of the pipeline from apertium-transfer -b onwards to get target-language sentences
  +
* Run through apertium-lex-learner/irstlm-ranker-max
  +
  +
  +
This way for each expanded translation you will get source-language to target-language probability for each SL:TL pair. The most probable translation will be marked with |@|.
  +
  +
== Example ==
  +
An example script for generating a mk-en pseudo-parallel corpus:
  +
  +
<pre>
  +
  +
# extract defaults
  +
~/Apertium/apertium-lex-tools/scripts/extract-default-ldx.py ~/Apertium/apertium-mk-en/apertium-mk-en.mk-en.dix lr | grep -v '"pr"' > mken-defaults.dix
  +
lt-comp lr mken-defaults.dix mken-defaults.bin
  +
  +
# cat testing-set | generate biltrans | select default
  +
cat testing-set | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans | apertium-ldx-proc mken-defaults.bin > testing-set-biltrans;
  +
  +
# expand
  +
cat testing-set-biltrans | ~/Apertium/apertium-lex-tools/scripts/biltrans-to-multitrans-line-recursive.py > testing-set-biltrans-expanded
  +
  +
# generate tl-side
  +
cat testing-set-biltrans-expanded | bash biltrans-to-end.sh > testing-set-expanded
  +
  +
#train a language model
  +
build-lm.sh -i training-set -o lm.lm.gz -n 5 -b
  +
compile-lm lm.lm.gz lmodel.bin
  +
  +
cat testing-set-expanded | ~/Apertium/apertium-lex-learner/irstlm_ranker_max lmodel.bin | grep "|@|" > testing-set-final
  +
  +
  +
</pre>

Latest revision as of 23:25, 23 August 2012

Acquiring parallel corpora can be a difficult process and for some language pairs such resources might not exist. However, we can use a language model for the target language in order to create pseudo-parallel corpora, and use them in the same way as parallel ones.

IRSTLM[edit]

IRSTLM is a tool for building n-gram language models from corpora. It supports different smoothing and interpolation methods, including Written-Bell smoothing, Kneser-Ney smoothing and others. biltrans-to-multitrans-recu.py The full documentation can be viewed here, and the whole toolkit can be downloaded here

Building a pseudo-parallel corpus[edit]

The main idea is to get a source-language corpus and run it through the apertium pipeline, but this time let the language model choose the preposition instead of apertium. The main algorithm is as follows (example for mk-en):

  • run the corpus through mk-en-biltrans
  • Run through apertium-ldx-proc to select the defaults for words with a POS different from <pr>. This step is necessary in order not to get an explosion of the possible TL sentences.
  • Run through apertium-lex-tools/scripts/biltrans-to-multitrans.py to expand the biltrans sentence to cover all the possible lexical transfers. You can also use apertium-lex-tools/scripts/biltrans-to-multitrans-line-recursive.py, a slower version but it uses less memory.
  • Run through the rest of the pipeline from apertium-transfer -b onwards to get target-language sentences
  • Run through apertium-lex-learner/irstlm-ranker-max


This way for each expanded translation you will get source-language to target-language probability for each SL:TL pair. The most probable translation will be marked with |@|.

Example[edit]

An example script for generating a mk-en pseudo-parallel corpus:


# extract defaults
 ~/Apertium/apertium-lex-tools/scripts/extract-default-ldx.py ~/Apertium/apertium-mk-en/apertium-mk-en.mk-en.dix lr | grep -v '"pr"' > mken-defaults.dix
lt-comp lr mken-defaults.dix mken-defaults.bin

# cat testing-set | generate biltrans | select default 
cat testing-set | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans | apertium-ldx-proc mken-defaults.bin > testing-set-biltrans;

# expand
cat testing-set-biltrans | ~/Apertium/apertium-lex-tools/scripts/biltrans-to-multitrans-line-recursive.py > testing-set-biltrans-expanded

# generate tl-side
cat testing-set-biltrans-expanded | bash biltrans-to-end.sh > testing-set-expanded

#train a language model
build-lm.sh -i training-set -o lm.lm.gz -n 5 -b
compile-lm lm.lm.gz lmodel.bin

cat testing-set-expanded | ~/Apertium/apertium-lex-learner/irstlm_ranker_max lmodel.bin | grep "|@|" > testing-set-final