Difference between revisions of "Corpus based preposition selection - HOWTO"
Jump to navigation
Jump to search
Fpetkovski (talk | contribs) |
Fpetkovski (talk | contribs) |
||
Line 13: | Line 13: | ||
* Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language. |
* Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language. |
||
* Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations: |
* Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations: |
||
− | ** 1-feature model -- extract an example in the following format sl_nv1 sl_pr sl_nv2<delimiter> |
+ | ** 1-feature model -- extract an example in the following format: sl_nv1 sl_pr sl_nv2<delimiter>tl_pr |
− | ** 2-feature model -- extract an example in the following format sl_nv1 sl_pr<delimiter> sl_nv2<delimiter> |
+ | ** 2-feature model -- extract an example in the following format: sl_nv1 sl_pr<delimiter> sl_nv2<delimiter>tl_pr |
− | ** 3-feature model -- extract an example in the following format |
+ | ** 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr |
+ | |||
+ | sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and the source language preposition. tl_prep stands for the target language preposition, and that is the actual label used in classification |
Revision as of 16:58, 20 August 2012
The general algorithm for performing corpus based preposition selection is as follows:
- Download a parallel corpus
- Extract patterns which contain prepositions from the source-language corpus
- Align the patterns to their translations in the target-language corpus
- Extract the features and label (the correct preposition from the target-language corpus) for classification.
- Train a model
- Use the trained model in the pipeline
The general toolkit for performing these tasks can be found here.
Training phase
The training phase is done in two steps:
- Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
- Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
- 1-feature model -- extract an example in the following format: sl_nv1 sl_pr sl_nv2<delimiter>tl_pr
- 2-feature model -- extract an example in the following format: sl_nv1 sl_pr<delimiter> sl_nv2<delimiter>tl_pr
- 3-feature model -- extract an example in the following format: sl_nv1<delimiter>sl_pr<delimiter>sl_nv2<delimiter>tl_pr
sl_nv1, sl_nv1 and sl_pr stand for the first and second source language noun or verb, and the source language preposition. tl_prep stands for the target language preposition, and that is the actual label used in classification