Difference between revisions of "Building a pseudo-parallel corpus"
Fpetkovski (talk | contribs) |
|||
Line 10: | Line 10: | ||
The main idea is to get a source-language corpus and run it through the apertium pipeline, but this time let the language model choose the preposition instead of apertium. The main algorithm is as follows (example for mk-en): |
The main idea is to get a source-language corpus and run it through the apertium pipeline, but this time let the language model choose the preposition instead of apertium. The main algorithm is as follows (example for mk-en): |
||
* run the corpus through mk-en-biltrans |
|||
<pre> |
|||
* Run through <code>apertium-lex-tools/scripts/biltrans-to-multitrans.py</code> |
|||
Run through |
* Run through the rest of the pipeline from apertium-transfer -b onwards |
||
Run through |
* Run through <code>apertium-lex-learner/irstlm-ranker<code> |
||
Run through <code>apertium-lex-learner/irstlm-ranker<code> |
|||
</pre> |
Revision as of 21:50, 23 August 2012
Acquiring parallel corpora can be a difficult process and for some language pairs such resources might not exist. However, we can use a language model for the target language in order to create pseudo-parallel corpora, and use them in the same way as parallel ones.
IRSTLM
IRSTLM is a tool for building n-gram language models from corpora. It supports different smoothing and interpolation methods, including Written-Bell smoothing, Kneser-Ney smoothing and others.
The full documentation can be viewed here, and the whole toolkit can be downloaded here
Building a pseudo-parallel corpus
The main idea is to get a source-language corpus and run it through the apertium pipeline, but this time let the language model choose the preposition instead of apertium. The main algorithm is as follows (example for mk-en):
- run the corpus through mk-en-biltrans
- Run through
apertium-lex-tools/scripts/biltrans-to-multitrans.py
- Run through the rest of the pipeline from apertium-transfer -b onwards
- Run through
apertium-lex-learner/irstlm-ranker