Difference between revisions of "Corpus based preposition selection - HOWTO"

Latest revision as of 13:15, 23 August 2012

Download a parallel corpus
Extract patterns which contain prepositions from the source-language corpus
Align the patterns to their translations in the target-language corpus
Extract the features and corresponding labels (the correct preposition from the target-language corpus) for classification.
Train a model
Use the trained model in the translation pipeline

The general toolkit for performing these tasks can be found here.

The toolkit[edit]

Preposition Extraction[edit]

The tool for preposition extraction takes a stream in the format ^lemma<tags>$ ^lemma<tags>$ on standard input and outputs a list of extracted patterns which are later used in the alignment process. The program will match patterns in the form of: (n|vblex) (pr) (adj | det)* (n|vblex)

Example:

echo "^Косово<np><top><nt><sg><nom>$ ^испитува<vblex><imperf><tv><pres><p3><sg>$ ^процес<n><m><sg><nom><def>$ ^на<pr>$ ^приватизација<n><f><sg><nom><ind><@P←>$^.<sent>$
" | ./preposition-extraction.bin 
output:
^процес<n><m><sg><nom><def>$.. ^на<pr>$.. ^приватизација<n><f><sg><nom><ind><@P←>$

These patterns will later be translated using apertium and matched in the target-language corpus.

Preposition alignment[edit]

The patterns that were extracted in the previous process need to be aligned to their translations in the target language so the correct preposition can be extracted as a label. This way a training set can be created.

Usage: ./preposition-aligner.bin -s source-file -t target-file -tr translations-file -n number-of-features [-asrc allow-source] [-atrg allow-target] Options: -sl, --source a file with the sorce language sentences -tl, --target a file with the target language sentences -tr, --translations a file with the translations of the files -n number of features -atrg, --allow-only-target path to a file containing the allowed source-language prepositions -asrc, --allow-only-source path to a file containing the allowed target-language prepositions

Example:

preposition-aligner.bin -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 | head -n 10

output:
пара--од$од--продажба$from
полза--од$од--приватизација$from
ефикасност--како$како--концепт$as a
префрли--во$во--банка$to
резултат--на$на--приватизација$of
земја--во$во--регион$in
план--за$за--развој$for
злоупотреба--на$на--положба$of
стави--под$под--контрола$under
сметка--со$со--содржина$with

The complete training phase[edit]

The training phase is done in two steps:

Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
- 1-feature model -- extract an example in the following format: sl_nv1-sl_pr-sl_nv2<delimiter>tl_pr
- 2-feature model -- extract an example in the following format: sl_nv1-sl_pr<delimiter>-sl_nv2<delimiter>tl_pr
- 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr

sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and for the source language preposition. tl_pr stands for the target language preposition, and that is the actual label used in classification

Example[edit]

This is an example script that uses these two tools to create a training set:

cat setimes.mk | head -n 150000 | apertium -d ~/Apertium/apertium-mk-en mk-en-pretransfer > training-patterns-mk
cat training-patterns-mk | ~/Apertium/fpetkovski/morph-parser/preposition-extraction \
| lt-proc -g ~/Apertium/apertium-mk-en/en-mk.autogen.bin \
| apertium -d ~/Apertium/apertium-mk-en/ mk-en-postchunk > extracted-patterns-train

# In Macedonian, the definiteness of the noun is encoded in the noun itself, 
# while in English it is denoted by the article before the noun. 
# As a result, the extracted patterns after translation can have up to 5 tokens instead of the desired three. 
# That's why we want to remove the articles from the translated patterns.

# remove articles
cat extracted-patterns-train | sed 's/[ ]*\^[Tt]he<[^\$]*\$[ ]*//g' > extracted-patterns-nodef-train;

# tag the tl set
cat setimes.en | head -n 150000 | apertium -d ~/Apertium/apertium-en-es en-es-tagger > training-patterns-en

# alignment
preposition-aligner -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 > training-set

And the output:


head -n 10 training-set

пара--од$од--продажба$from
полза--од$од--приватизација$from
префрли--во$во--банка$to
резултат--на$на--приватизација$of
земја--во$во--регион$in
план--за$за--развој$for
злоупотреба--на$на--положба$of
сметка--со$со--содржина$with
вработи--во$во--медиум$in
слобода--на$на--говор$of

where the '$' character here serves as a delimiter.
Now you have a training set which you can use to train a classifier.

It should be noted that you can specify a list of both source-language and target-language prepositions that you want to allow in your training set. If such a list is specified for source-language prepositions, then patterns that do not contain those prepositions will not be extracted for the training set.

If a list is specified for target-language prepositions, then for those prepositions which are not in the list a new class will be created (class 'other'). This means that it will be left up to apertium to decide how to translate the source-language preposition if the classifier labels some example as a member of the class 'other'.

It is recommended that you use such a 'white-list' for target-language prepositions, and put the most common prepositions there, since for the less common ones there won't be enough coverage for those classes to be learned.

Applying the model[edit]

In order avoid depending on an external library, a naive bayes classifier was manually constructed, since that was the one used in the experiments. It can be found in the morph-parser directory and it can be used for training a model.

Once you have trained a model, you can insert it in the pipeline so it can be applied in the translation process. For the purpose of applying a naive bayes model, the preposition-selection tool was created which takes a biltrans output as an stream on standard input.

Usage: ./preposition-selection.bin [ -t | -l ] data_file -d delimiter Options: -t, --train use the data_file to train a model -l, --load load a trained model from the data_file -d, --delimiter sets the delimiter

Example[edit]

cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\
./preposition-selection --train "training-set" -d "$" | ./biltrans-to-end

or

cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\
./preposition-selection --load "model" -d "$" | ./biltrans-to-end

The biltrans-to-end script should go through the rest of the pipeline, executing the transfer and generation processes.

For mk-en:

/usr/local/bin/apertium-transfer -b /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t1x /home/philip/Apertium/apertium-mk-en/mk-en.t1x.bin \
|/usr/local/bin/apertium-interchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t2x  /home/philip/Apertium/apertium-mk-en/mk-en.t2x.bin \
|/usr/local/bin/apertium-postchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t3x /home/philip/Apertium/apertium-mk-en/mk-en.t3x.bin \
| sed 's/\[/\\[/g' | lt-proc -g ~/Apertium/apertium-mk-en/mk-en.autogen.bin \
| lt-proc -p ~/Apertium/apertium-mk-en/mk-en.autopgen.bin | apertium-retxt

Difference between revisions of "Corpus based preposition selection - HOWTO"

Latest revision as of 13:15, 23 August 2012

Contents

The toolkit[edit]

Preposition Extraction[edit]

Preposition alignment[edit]

The complete training phase[edit]

Example[edit]

Applying the model[edit]

Example[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+{{TOCD}}
 The general algorithm for performing corpus based preposition selection is as follows:
 * Download a parallel corpus
 * Extract patterns which contain prepositions from the source-language corpus
 * Align the patterns to their translations in the target-language corpus
-* Extract the features and label (the correct preposition from the target-language corpus) for classification.
+* Extract the features and corresponding labels (the correct preposition from the target-language corpus) for classification.
 * Train a model
-* Use the trained model in the pipeline
+* Use the trained model in the translation pipeline
-The general toolkit for performing these tasks can be found here: <br />
+The general toolkit for performing these tasks can be found [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2012/fpetkovski/morph-parser/ here].
-[https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2012/fpetkovski/morph-parser/
+== The toolkit ==
- toolkig]
+=== Preposition Extraction ===
-=== Extracting training data for your classifier ===
+The tool for preposition extraction takes a stream in the format ^lemma<tags>$ ^lemma<tags>$ on standard input and outputs a list of extracted patterns which are later used in the alignment process.
-For the purpose of extracting training data for your classifier, you can use the preposition-extraction tool.
+The program will match patterns in the form of: (n|vblex) (pr) (adj | det)* (n|vblex)
+Example:
+<pre>
+echo "^Косово<np><top><nt><sg><nom>$ ^испитува<vblex><imperf><tv><pres><p3><sg>$ ^процес<n><m><sg><nom><def>$ ^на<pr>$ ^приватизација<n><f><sg><nom><ind><@P←>$^.<sent>$
+" | ./preposition-extraction.bin
+output:
+^процес<n><m><sg><nom><def>$.. ^на<pr>$.. ^приватизација<n><f><sg><nom><ind><@P←>$
+</pre>
+These patterns will later be translated using apertium and matched in the target-language corpus.
+=== Preposition alignment ===
+The patterns that were extracted in the previous process need to be aligned to their translations in the target language so the correct preposition can be extracted as a label. This way a training set can be created.
+Usage: ./preposition-aligner.bin -s source-file -t target-file -tr translations-file -n number-of-features [-asrc allow-source] [-atrg allow-target]
+Options:
+-sl, --source		a file with the sorce language sentences
+-tl, --target		a file with the target language sentences
+-tr, --translations	a file with the translations of the files
+-n          		number of features
+-atrg, --allow-only-target	path to a file containing the allowed source-language prepositions
+-asrc, --allow-only-source	path to a file containing the allowed target-language prepositions
+Example:
+<pre>
+preposition-aligner.bin -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 | head -n 10
+output:
+пара--од$од--продажба$from
+полза--од$од--приватизација$from
+ефикасност--како$како--концепт$as a
+префрли--во$во--банка$to
+резултат--на$на--приватизација$of
+земја--во$во--регион$in
+план--за$за--развој$for
+злоупотреба--на$на--положба$of
+стави--под$под--контрола$under
+сметка--со$со--содржина$with
+</pre>
+== The complete training phase ==
+The training phase is done in two steps:
+* Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
+* Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
+** 1-feature model -- extract an example in the following format: sl_nv1-sl_pr-sl_nv2<delimiter>tl_pr
+** 2-feature model -- extract an example in the following format: sl_nv1-sl_pr<delimiter>-sl_nv2<delimiter>tl_pr
+** 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr
+sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and for the source language preposition. tl_pr stands for the target language preposition, and that is the actual label used in classification
+== Example ==
+This is an example script that uses these two tools to create a training set:
+<pre>
+cat setimes.mk | head -n 150000 | apertium -d ~/Apertium/apertium-mk-en mk-en-pretransfer > training-patterns-mk
+cat training-patterns-mk | ~/Apertium/fpetkovski/morph-parser/preposition-extraction \
+| lt-proc -g ~/Apertium/apertium-mk-en/en-mk.autogen.bin \
+| apertium -d ~/Apertium/apertium-mk-en/ mk-en-postchunk > extracted-patterns-train
+# In Macedonian, the definiteness of the noun is encoded in the noun itself,
+# while in English it is denoted by the article before the noun.
+# As a result, the extracted patterns after translation can have up to 5 tokens instead of the desired three.
+# That's why we want to remove the articles from the translated patterns.
+# remove articles
+cat extracted-patterns-train | sed 's/[ ]*\^[Tt]he<[^\$]*\$[ ]*//g' > extracted-patterns-nodef-train;
+# tag the tl set
+cat setimes.en | head -n 150000 | apertium -d ~/Apertium/apertium-en-es en-es-tagger > training-patterns-en
+# alignment
+preposition-aligner -sl training-patterns-mk -tl training-patterns-en -tr extracted-patterns-nodef-train -n 2 > training-set
+</pre>
+And the output:
+<pre>
+head -n 10 training-set
+пара--од$од--продажба$from
+полза--од$од--приватизација$from
+префрли--во$во--банка$to
+резултат--на$на--приватизација$of
+земја--во$во--регион$in
+план--за$за--развој$for
+злоупотреба--на$на--положба$of
+сметка--со$со--содржина$with
+вработи--во$во--медиум$in
+слобода--на$на--говор$of
+</pre>
+where the '$' character here serves as a delimiter. <br/>
+Now you have a training set which you can use to train a classifier.
+It should be noted that you can specify a list of both source-language and target-language prepositions that you want to allow in your training set. If such a list is specified for source-language prepositions, then patterns that do not contain those prepositions will not be extracted for the training set.
+If a list is specified for target-language prepositions, then for those prepositions which are not in the list a new class will be created (class 'other'). This means that it will be left up to apertium to decide how to translate the source-language preposition if the classifier labels some example as a member of the class 'other'.
+It is recommended that you use such a 'white-list' for target-language prepositions, and put the most common prepositions there, since for the less common ones there won't be enough coverage for those classes to be learned.
+== Applying the model ==
+In order avoid depending on an external library, a naive bayes classifier was manually constructed, since that was the one used in the experiments. It can be found in the morph-parser directory and it can be used for training a model.
+Once you have trained a model, you can insert it in the pipeline so it can be applied in the translation process. For the purpose of applying a naive bayes model, the preposition-selection tool was created which takes a biltrans output as an stream on standard input.
+Usage: ./preposition-selection.bin [ -t | -l ] data_file -d delimiter
+Options:
+-t, --train		use the data_file to train a model
+-l, --load		load a trained model from the data_file
+-d, --delimiter		sets the delimiter
+=== Example ===
+<pre>
+cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\
+./preposition-selection --train "training-set" -d "$" | ./biltrans-to-end
+</pre>
+or
+<pre>
+cat ~/Desktop/setimes-en-mk-nikola/setimes.mk.fixed | tail -n 50000 | apertium -d ~/Apertium/apertium-mk-en mk-en-biltrans |\
+./preposition-selection --load "model" -d "$" | ./biltrans-to-end
+</pre>
+The biltrans-to-end script should go through the rest of the pipeline, executing the transfer and generation processes.
+For mk-en:
+<pre>
+/usr/local/bin/apertium-transfer -b /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t1x /home/philip/Apertium/apertium-mk-en/mk-en.t1x.bin \
+|/usr/local/bin/apertium-interchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t2x  /home/philip/Apertium/apertium-mk-en/mk-en.t2x.bin \
+|/usr/local/bin/apertium-postchunk /home/philip/Apertium/apertium-mk-en/apertium-mk-en.mk-en.t3x /home/philip/Apertium/apertium-mk-en/mk-en.t3x.bin \
+| sed 's/\[/\\[/g' | lt-proc -g ~/Apertium/apertium-mk-en/mk-en.autogen.bin \
+| lt-proc -p ~/Apertium/apertium-mk-en/mk-en.autopgen.bin | apertium-retxt
+</pre>