Difference between revisions of "Corpus based preposition selection - HOWTO"

Revision as of 16:58, 20 August 2012

The general algorithm for performing corpus based preposition selection is as follows:

Download a parallel corpus
Extract patterns which contain prepositions from the source-language corpus
Align the patterns to their translations in the target-language corpus
Extract the features and label (the correct preposition from the target-language corpus) for classification.
Train a model
Use the trained model in the pipeline

The general toolkit for performing these tasks can be found here.

Training phase

The training phase is done in two steps:

Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
- 1-feature model -- extract an example in the following format: sl_nv1 sl_pr sl_nv2<delimiter>tl_pr
- 2-feature model -- extract an example in the following format: sl_nv1 sl_pr<delimiter> sl_nv2<delimiter>tl_pr
- 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr

sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and the source language preposition. tl_prep stands for the target language preposition, and that is the actual label used in classification

@@ Line 13: / Line 13: @@
 * Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
 * Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
-** 1-feature model -- extract an example in the following format sl_nv1 sl_pr sl_nv2<delimiter>label
+** 1-feature model -- extract an example in the following format: sl_nv1 sl_pr sl_nv2<delimiter>tl_pr
-** 2-feature model -- extract an example in the following format sl_nv1 sl_pr<delimiter> sl_nv2<delimiter>label
+** 2-feature model -- extract an example in the following format: sl_nv1 sl_pr<delimiter> sl_nv2<delimiter>tl_pr
-** 3-feature model -- extract an example in the following format sl_n1<delimiter>sl_pr<delimiter>sl_n1<delimiter>label
+** 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr
+sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and the source language preposition. tl_prep stands for the target language preposition, and that is the actual label used in classification

Difference between revisions of "Corpus based preposition selection - HOWTO"

Revision as of 16:58, 20 August 2012

Training phase

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools