Corpus based preposition selection - HOWTO
The general algorithm for performing corpus based preposition selection is as follows:
- Download a parallel corpus
- Extract patterns which contain prepositions from the source-language corpus
- Align the patterns to their translations in the target-language corpus
- Extract the features and corresponding labels (the correct preposition from the target-language corpus) for classification.
- Train a model
- Use the trained model in the translation pipeline
The general toolkit for performing these tasks can be found here.
Contents
The toolkit
Preposition Extraction
The tool for preposition extraction takes a stream in the format ^lemma<tags>$ ^lemma<tags>$ on standard input and outputs a list of extracted patterns which are later used in the alignment process. The program will match patterns in the form of: (n|vblex) (pr) (adj | det)* (n|vblex)
Example:
echo "^Косово<np><top><nt><sg><nom>$ ^испитува<vblex><imperf><tv><pres><p3><sg>$ ^процес<n><m><sg><nom><def>$ ^на<pr>$ ^приватизација<n><f><sg><nom><ind><@P←>$^.<sent>$ " | ./preposition-extraction.bin output: ^процес<n><m><sg><nom><def>$.. ^на<pr>$.. ^приватизација<n><f><sg><nom><ind><@P←>$
These patterns will later be translated using apertium and matched in the target-language corpus.
Preposition alignment
The patterns that were extracted in the previous process need to be aligned to their translations in the target language so the correct preposition can be extracted as a label. This way a training set can be created.
Usage: ./preposition-aligner.bin -s source-file -t target-file -tr translations-file -n number-of-features [-asrc allow-source] [-atrg allow-target] Options: -sl, --source a file with the sorce language sentences -tl, --target a file with the target language sentences -tr, --translations a file with the translations of the files -n number of features -atrg, --allow-only-target path to a file containing the allowed source-language prepositions -asrc, --allow-only-source path to a file containing the allowed target-language prepositions
Example:
preposition-aligner.bin -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 | head -n 10 output: пара--од$од--продажба$from полза--од$од--приватизација$from ефикасност--како$како--концепт$as a префрли--во$во--банка$to резултат--на$на--приватизација$of земја--во$во--регион$in план--за$за--развој$for злоупотреба--на$на--положба$of стави--под$под--контрола$under сметка--со$со--содржина$with
The complete training phase
The training phase is done in two steps:
- Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
- Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
- 1-feature model -- extract an example in the following format: sl_nv1-sl_pr-sl_nv2<delimiter>tl_pr
- 2-feature model -- extract an example in the following format: sl_nv1-sl_pr<delimiter>-sl_nv2<delimiter>tl_pr
- 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr
sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and for the source language preposition. tl_pr stands for the target language preposition, and that is the actual label used in classification
Example
This is an example script that uses these two tools to create a training set:
cat setimes.mk | head -n 150000 | apertium -d ~/Apertium/apertium-mk-en mk-en-pretransfer > training-patterns-mk cat training-patterns-mk | ~/Apertium/fpetkovski/morph-parser/preposition-extraction \ | lt-proc -g ~/Apertium/apertium-mk-en/en-mk.autogen.bin \ | apertium -d ~/Apertium/apertium-mk-en/ mk-en-postchunk > extracted-patterns-train # In Macedonian, the definiteness of the noun is encoded in the noun itself, # while in English it is denoted by the article before the noun. # As a result, the extracted patterns after translation can have up to 5 tokens instead of the desired three. # That's why we want to remove the articles from the patterns. #remove articles cat extracted-patterns-train | sed 's/[ ]*\^[Tt]he<[^\$]*\$[ ]*//g' > extracted-patterns-nodef-train; # tag the tl-set cat setimes.en | head -n 150000 | apertium -d ~/Apertium/apertium-en-es en-es-tagger > training-patterns-en # alignment preposition-aligner -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 > training-set
And the output:
head -n 10 training-set пара--од$од--продажба$from полза--од$од--приватизација$from префрли--во$во--банка$to резултат--на$на--приватизација$of земја--во$во--регион$in план--за$за--развој$for злоупотреба--на$на--положба$of сметка--со$со--содржина$with вработи--во$во--медиум$in слобода--на$на--говор$of
where the '$' character here serves as a delimiter.