Difference between revisions of "Corpus based preposition selection - HOWTO"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
   
 
The general toolkit for performing these tasks can be found [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2012/fpetkovski/morph-parser/ here].
 
The general toolkit for performing these tasks can be found [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2012/fpetkovski/morph-parser/ here].
  +
=== Extracting training data for your classifier ===
 
  +
=== Training phase ===
For the purpose of extracting training data for your classifier, you can use the preposition-extraction tool.
 
  +
The training phase is done in two steps:
  +
* Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
  +
* Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
  +
<pre>
  +
** 1-feature model -- extract an example in the following format sl_nv1--sl_pr--sl_nv2<delimiter>label
  +
** 2-feature model -- extract an example in the following format sl_nv1--sl_pr<delimiter>--sl_nv2<delimiter>label
  +
** 3-feature model -- extract an example in the following format sl_n1<delimiter>sl_pr<delimiter>sl_n1<delimiter>label
  +
</pre>

Revision as of 16:54, 20 August 2012

The general algorithm for performing corpus based preposition selection is as follows:

  • Download a parallel corpus
  • Extract patterns which contain prepositions from the source-language corpus
  • Align the patterns to their translations in the target-language corpus
  • Extract the features and label (the correct preposition from the target-language corpus) for classification.
  • Train a model
  • Use the trained model in the pipeline

The general toolkit for performing these tasks can be found here.

Training phase

The training phase is done in two steps:

  • Extract patterns in the form of (n|vblex) (pr) (adj | det)* (n|vblex) from the source language corpus and translate the using apertium to the target language.
  • Go through the source language file again, matching those same patterns and trying to find their translations in the target language. If a translation is found, extract the features and correct preposition as a training-set example. You could theoretically choose any combination of features, however, the tools provided so far support only 3 different combinations:
** 1-feature model -- extract an example in the following format sl_nv1--sl_pr--sl_nv2<delimiter>label
** 2-feature model -- extract an example in the following format sl_nv1--sl_pr<delimiter>--sl_nv2<delimiter>label
** 3-feature model -- extract an example in the following format sl_n1<delimiter>sl_pr<delimiter>sl_n1<delimiter>label