Difference between revisions of "Corpus based preposition selection - HOWTO"
Fpetkovski (talk | contribs) |
Fpetkovski (talk | contribs) |
||
(25 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{TOCD}} |
||
The general algorithm for performing corpus based preposition selection is as follows: |
The general algorithm for performing corpus based preposition selection is as follows: |
||
* Download a parallel corpus |
* Download a parallel corpus |
||
* Extract patterns which contain prepositions from the source-language corpus |
* Extract patterns which contain prepositions from the source-language corpus |
||
* Align the patterns to their translations in the target-language corpus |
* Align the patterns to their translations in the target-language corpus |
||
− | * Extract the features and |
+ | * Extract the features and corresponding labels (the correct preposition from the target-language corpus) for classification. |
* Train a model |
* Train a model |
||
− | * Use the trained model in the pipeline |
+ | * Use the trained model in the translation pipeline |
− | The general toolkit for performing these tasks can be found |
+ | The general toolkit for performing these tasks can be found [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2012/fpetkovski/morph-parser/ here]. |
+ | |||
− | [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2012/fpetkovski/morph-parser/ |
||
+ | == The toolkit == |
||
− | toolkig] |
||
+ | === Preposition Extraction === |
||
− | === Extracting training data for your classifier === |
||
+ | The tool for preposition extraction takes a stream in the format ^lemma<tags>$ ^lemma<tags>$ on standard input and outputs a list of extracted patterns which are later used in the alignment process. |
||
− | For the purpose of extracting training data for your classifier, you can use the preposition-extraction tool. |
||
+ | The program will match patterns in the form of: (n|vblex) (pr) (adj | det)* (n|vblex) |
||
+ | |||
+ | Example: |
||
+ | <pre> |
||
+ | echo "^Косово<np><top><nt><sg><nom>$ ^испитува<vblex><imperf><tv><pres><p3><sg>$ ^процес<n><m><sg><nom><def>$ ^на<pr>$ ^приватизација<n><f><sg><nom><ind><@P←>$^.<sent>$ |
||
+ | " | ./preposition-extraction.bin |
||
+ | output: |
||
+ | ^процес<n><m><sg><nom><def>$.. ^на<pr>$.. ^приватизација<n><f><sg><nom><ind><@P←>$ |
||
+ | </pre> |
||
+ | |||
+ | These patterns will later be translated using apertium and matched in the target-language corpus. |
||
+ | |||
+ | === Preposition alignment === |
||
+ | |||
+ | The patterns that were extracted in the previous process need to be aligned to their translations in the target language so the correct preposition can be extracted as a label. This way a training set can be created. |
||
+ | |||
+ | Usage: ./preposition-aligner.bin -s source-file -t target-file -tr translations-file -n number-of-features [-asrc allow-source] [-atrg allow-target] |
||
+ | Options: |
||
+ | -sl, --source a file with the sorce language sentences |
||
+ | -tl, --target a file with the target language sentences |
||
+ | -tr, --translations a file with the translations of the files |
||
+ | -n number of features |
||
+ | -atrg, --allow-only-target path to a file containing the allowed source-language prepositions |
||
+ | -asrc, --allow-only-source path to a file containing the allowed target-language prepositions |
||
+ | |||
+ | Example: |
||
+ | <pre> |
||
+ | preposition-aligner.bin -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 | head -n 10 |
||
+ | |||
+ | output: |
||
+ | пара--од$од--продажба$from |
||
+ | полза--од$од--приватизација$from |
||
+ | ефикасност--како$како--концепт$as a |
||
+ | префрли--во$во--банка$to |
||
+ | резултат--на$на--приватизација$of |
||
+ | земја--во$во--регион$in |
||
+ | план--за$за--развој$for |
||
+ | злоупотреба--на$на--положба$of |
||
+ | стави--под$под--контрола$under |
||
+ | сметка--со$со--содржина$with |
||
+ | </pre> |
||
+ | |||
+ | == The complete training phase == |
||
+ | The training phase is done in two steps: |
||
+ | * Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language. |
||
+ | * Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations: |
||
+ | ** 1-feature model -- extract an example in the following format: sl_nv1-sl_pr-sl_nv2<delimiter>tl_pr |
||
+ | ** 2-feature model -- extract an example in the following format: sl_nv1-sl_pr<delimiter>-sl_nv2<delimiter>tl_pr |
||
+ | ** 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr |
||
+ | |||
+ | sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and for the source language preposition. tl_pr stands for the target language preposition, and that is the actual label used in classification |
||
+ | |||
+ | == Example == |
||
+ | This is an example script that uses these two tools to create a training set: |
||
+ | <pre> |
||
+ | cat setimes.mk | head -n 150000 | apertium -d ~/Apertium/apertium-mk-en mk-en-pretransfer > training-patterns-mk |
||
+ | cat training-patterns-mk | ~/Apertium/fpetkovski/morph-parser/preposition-extraction \ |
||
+ | | lt-proc -g ~/Apertium/apertium-mk-en/en-mk.autogen.bin \ |
||
+ | | apertium -d ~/Apertium/apertium-mk-en/ mk-en-postchunk > extracted-patterns-train |
||
+ | |||
+ | # In Macedonian, the definiteness of the noun is encoded in the noun itself, |
||
+ | # while in English it is denoted by the article before the noun. |
||
+ | # As a result, the extracted patterns after translation can have up to 5 tokens instead of the desired three. |
||
+ | # That's why we want to remove the articles from the translated patterns. |
||
+ | |||
+ | # remove articles |
||
+ | cat extracted-patterns-train | sed 's/[ ]*\^[Tt]he<[^\$]*\$[ ]*//g' > extracted-patterns-nodef-train; |
||
+ | |||
+ | # tag the tl set |
||
+ | cat setimes.en | head -n 150000 | apertium -d ~/Apertium/apertium-en-es en-es-tagger > training-patterns-en |
||
+ | |||
+ | # alignment |
||
+ | preposition-aligner -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 > training-set |
||
+ | </pre> |
||
+ | |||
+ | And the output: |
||
+ | |||
+ | <pre> |
||
+ | |||
+ | head -n 10 training-set |
||
+ | |||
+ | пара--од$од--продажба$from |
||
+ | полза--од$од--приватизација$from |
||
+ | префрли--во$во--банка$to |
||
+ | резултат--на$на--приватизација$of |
||
+ | земја--во$во--регион$in |
||
+ | план--за$за--развој$for |
||
+ | злоупотреба--на$на--положба$of |
||
+ | сметка--со$со--содржина$with |
||
+ | вработи--во$во--медиум$in |
||
+ | слобода--на$на--говор$of |
||
+ | </pre> |
||
+ | where the '$' character here serves as a delimiter. <br/> |
||
+ | Now you have a training set which you can use to train a classifier. |
||
+ | |||
+ | It should be noted that you can specify a list of both source-language and target-language prepositions that you want to allow in your training set. If such a list is specified for source-language prepositions, then patterns that do not contain those prepositions will not be extracted for the training set. |
||
+ | |||
+ | If a list is specified for target-language prepositions, then for those prepositions which are not in the list a new class will be created (class 'other'). This means that it will be left up to apertium to decide how to translate the source-language preposition if the classifier labels some example as a member of the class 'other'. |
||
+ | |||
+ | It is recommended that you use such a 'white-list' for target-language prepositions, and put the most common prepositions there, since for the less common ones there won't be enough coverage for those classes to be learned. |
||
+ | |||
+ | == Applying the model == |
||
+ | |||
+ | In order avoid depending on an external library, a naive bayes classifier was manually constructed, since that was the one used in the experiments. It can be found in the morph-parser directory and it can be used for training a model. |
||
+ | |||
+ | Once you have trained a model, you can insert it in the pipeline so it can be applied in the translation process. For the purpose of applying a naive bayes model, the preposition-selection tool was created which takes a biltrans output as an stream on standard input. |
||
+ | |||
+ | Usage: ./preposition-selection.bin [ -t | -l ] data_file -d delimiter |
||
+ | Options: |
||
+ | -t, --train use the data_file to train a model |
||
+ | -l, --load load a trained model from the data_file |
||
+ | -d, --delimiter sets the delimiter |
||
+ | |||
+ | === Example === |
||
+ | <pre> |
||
+ | cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\ |
||
+ | ./preposition-selection --train "training-set" -d "$" | ./biltrans-to-end |
||
+ | </pre> |
||
+ | or |
||
+ | <pre> |
||
+ | cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\ |
||
+ | ./preposition-selection --load "model" -d "$" | ./biltrans-to-end |
||
+ | </pre> |
||
+ | |||
+ | The biltrans-to-end script should go through the rest of the pipeline, executing the transfer and generation processes. |
||
+ | |||
+ | For mk-en: |
||
+ | <pre> |
||
+ | /usr/local/bin/apertium-transfer -b /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t1x /home/philip/Apertium/apertium-mk-en/mk-en.t1x.bin \ |
||
+ | |/usr/local/bin/apertium-interchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t2x /home/philip/Apertium/apertium-mk-en/mk-en.t2x.bin \ |
||
+ | |/usr/local/bin/apertium-postchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t3x /home/philip/Apertium/apertium-mk-en/mk-en.t3x.bin \ |
||
+ | | sed 's/\[/\\[/g' | lt-proc -g ~/Apertium/apertium-mk-en/mk-en.autogen.bin \ |
||
+ | | lt-proc -p ~/Apertium/apertium-mk-en/mk-en.autopgen.bin | apertium-retxt |
||
+ | </pre> |
Latest revision as of 13:15, 23 August 2012
The general algorithm for performing corpus based preposition selection is as follows:
- Download a parallel corpus
- Extract patterns which contain prepositions from the source-language corpus
- Align the patterns to their translations in the target-language corpus
- Extract the features and corresponding labels (the correct preposition from the target-language corpus) for classification.
- Train a model
- Use the trained model in the translation pipeline
The general toolkit for performing these tasks can be found here.
The toolkit[edit]
Preposition Extraction[edit]
The tool for preposition extraction takes a stream in the format ^lemma<tags>$ ^lemma<tags>$ on standard input and outputs a list of extracted patterns which are later used in the alignment process. The program will match patterns in the form of: (n|vblex) (pr) (adj | det)* (n|vblex)
Example:
echo "^Косово<np><top><nt><sg><nom>$ ^испитува<vblex><imperf><tv><pres><p3><sg>$ ^процес<n><m><sg><nom><def>$ ^на<pr>$ ^приватизација<n><f><sg><nom><ind><@P←>$^.<sent>$ " | ./preposition-extraction.bin output: ^процес<n><m><sg><nom><def>$.. ^на<pr>$.. ^приватизација<n><f><sg><nom><ind><@P←>$
These patterns will later be translated using apertium and matched in the target-language corpus.
Preposition alignment[edit]
The patterns that were extracted in the previous process need to be aligned to their translations in the target language so the correct preposition can be extracted as a label. This way a training set can be created.
Usage: ./preposition-aligner.bin -s source-file -t target-file -tr translations-file -n number-of-features [-asrc allow-source] [-atrg allow-target] Options: -sl, --source a file with the sorce language sentences -tl, --target a file with the target language sentences -tr, --translations a file with the translations of the files -n number of features -atrg, --allow-only-target path to a file containing the allowed source-language prepositions -asrc, --allow-only-source path to a file containing the allowed target-language prepositions
Example:
preposition-aligner.bin -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 | head -n 10 output: пара--од$од--продажба$from полза--од$од--приватизација$from ефикасност--како$како--концепт$as a префрли--во$во--банка$to резултат--на$на--приватизација$of земја--во$во--регион$in план--за$за--развој$for злоупотреба--на$на--положба$of стави--под$под--контрола$under сметка--со$со--содржина$with
The complete training phase[edit]
The training phase is done in two steps:
- Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
- Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
- 1-feature model -- extract an example in the following format: sl_nv1-sl_pr-sl_nv2<delimiter>tl_pr
- 2-feature model -- extract an example in the following format: sl_nv1-sl_pr<delimiter>-sl_nv2<delimiter>tl_pr
- 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr
sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and for the source language preposition. tl_pr stands for the target language preposition, and that is the actual label used in classification
Example[edit]
This is an example script that uses these two tools to create a training set:
cat setimes.mk | head -n 150000 | apertium -d ~/Apertium/apertium-mk-en mk-en-pretransfer > training-patterns-mk cat training-patterns-mk | ~/Apertium/fpetkovski/morph-parser/preposition-extraction \ | lt-proc -g ~/Apertium/apertium-mk-en/en-mk.autogen.bin \ | apertium -d ~/Apertium/apertium-mk-en/ mk-en-postchunk > extracted-patterns-train # In Macedonian, the definiteness of the noun is encoded in the noun itself, # while in English it is denoted by the article before the noun. # As a result, the extracted patterns after translation can have up to 5 tokens instead of the desired three. # That's why we want to remove the articles from the translated patterns. # remove articles cat extracted-patterns-train | sed 's/[ ]*\^[Tt]he<[^\$]*\$[ ]*//g' > extracted-patterns-nodef-train; # tag the tl set cat setimes.en | head -n 150000 | apertium -d ~/Apertium/apertium-en-es en-es-tagger > training-patterns-en # alignment preposition-aligner -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 > training-set
And the output:
head -n 10 training-set пара--од$од--продажба$from полза--од$од--приватизација$from префрли--во$во--банка$to резултат--на$на--приватизација$of земја--во$во--регион$in план--за$за--развој$for злоупотреба--на$на--положба$of сметка--со$со--содржина$with вработи--во$во--медиум$in слобода--на$на--говор$of
where the '$' character here serves as a delimiter.
Now you have a training set which you can use to train a classifier.
It should be noted that you can specify a list of both source-language and target-language prepositions that you want to allow in your training set. If such a list is specified for source-language prepositions, then patterns that do not contain those prepositions will not be extracted for the training set.
If a list is specified for target-language prepositions, then for those prepositions which are not in the list a new class will be created (class 'other'). This means that it will be left up to apertium to decide how to translate the source-language preposition if the classifier labels some example as a member of the class 'other'.
It is recommended that you use such a 'white-list' for target-language prepositions, and put the most common prepositions there, since for the less common ones there won't be enough coverage for those classes to be learned.
Applying the model[edit]
In order avoid depending on an external library, a naive bayes classifier was manually constructed, since that was the one used in the experiments. It can be found in the morph-parser directory and it can be used for training a model.
Once you have trained a model, you can insert it in the pipeline so it can be applied in the translation process. For the purpose of applying a naive bayes model, the preposition-selection tool was created which takes a biltrans output as an stream on standard input.
Usage: ./preposition-selection.bin [ -t | -l ] data_file -d delimiter Options: -t, --train use the data_file to train a model -l, --load load a trained model from the data_file -d, --delimiter sets the delimiter
Example[edit]
cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\ ./preposition-selection --train "training-set" -d "$" | ./biltrans-to-end
or
cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\ ./preposition-selection --load "model" -d "$" | ./biltrans-to-end
The biltrans-to-end script should go through the rest of the pipeline, executing the transfer and generation processes.
For mk-en:
/usr/local/bin/apertium-transfer -b /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t1x /home/philip/Apertium/apertium-mk-en/mk-en.t1x.bin \ |/usr/local/bin/apertium-interchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t2x /home/philip/Apertium/apertium-mk-en/mk-en.t2x.bin \ |/usr/local/bin/apertium-postchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t3x /home/philip/Apertium/apertium-mk-en/mk-en.t3x.bin \ | sed 's/\[/\\[/g' | lt-proc -g ~/Apertium/apertium-mk-en/mk-en.autogen.bin \ | lt-proc -p ~/Apertium/apertium-mk-en/mk-en.autopgen.bin | apertium-retxt