Corpus based preposition selection - HOWTO

From Apertium
Revision as of 21:13, 20 August 2012 by Fpetkovski (talk | contribs)
Jump to navigation Jump to search

The general algorithm for performing corpus based preposition selection is as follows:

  • Download a parallel corpus
  • Extract patterns which contain prepositions from the source-language corpus
  • Align the patterns to their translations in the target-language corpus
  • Extract the features and corresponding labels (the correct preposition from the target-language corpus) for classification.
  • Train a model
  • Use the trained model in the translation pipeline

The general toolkit for performing these tasks can be found here.

The toolkit

Preposition Extraction

The tool for preposition extraction takes a stream in the format ^lemma<tags>$ ^lemma<tags>$ on standard input and outputs a list of extracted patterns which are later used in the alignment process.

Example:

echo "^This year<adv>$ ^'s<gen>$ ^process<n><sg>$ ^of<pr>$ ^privatisation<n><sg>$^.<sent>$" | ./preposition-extraction.bin 
output:
^process<n><sg>$.. ^of<pr>$.. ^privatisation<n><sg>$


Training phase

The training phase is done in two steps:

  • Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
  • Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
    • 1-feature model -- extract an example in the following format: sl_nv1-sl_pr-sl_nv2<delimiter>tl_pr
    • 2-feature model -- extract an example in the following format: sl_nv1-sl_pr<delimiter>-sl_nv2<delimiter>tl_pr
    • 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr

sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and for the source language preposition. tl_prep stands for the target language preposition, and that is the actual label used in classification

Example

This is an example script that uses these two tools to create a training set:

cat setimes.mk | head -n 150000 | apertium -d ~/Apertium/apertium-mk-en mk-en-pretransfer > training-patterns-mk
cat training-patterns-mk | ~/Apertium/fpetkovski/morph-parser/preposition-extraction \
| lt-proc -g ~/Apertium/apertium-mk-en/en-mk.autogen.bin \
| apertium -d ~/Apertium/apertium-mk-en/ mk-en-postchunk > extracted-patterns-train

# In Macedonian, the definiteness of the noun is encoded in the noun itself, 
# while in English it is denoted by the article before the noun. 
# As a result, the extracted patterns after translation can have up to 5 tokens instead of the desired three. 
# That's why we want to remove the articles from the patterns.

#remove articles
cat extracted-patterns-train | sed 's/[ ]*\^[Tt]he<[^\$]*\$[ ]*//g' > extracted-patterns-nodef-train;

# tag the tl-set
cat setimes.en | head -n 150000 | apertium -d ~/Apertium/apertium-en-es en-es-tagger > training-patterns-en

# alignment
preposition-aligner -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 > training-set

And the output:


head -n 10 training-set

пара--од$од--продажба$from
полза--од$од--приватизација$from
префрли--во$во--банка$to
резултат--на$на--приватизација$of
земја--во$во--регион$in
план--за$за--развој$for
злоупотреба--на$на--положба$of
сметка--со$со--содржина$with
вработи--во$во--медиум$in
слобода--на$на--говор$of

where the '$' character here serves as a delimiter.