Difference between revisions of "Lexical feature transfer - Second report"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
 
== Review ==
 
== Review ==
In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from <n1> <pr> <n2> -> <n1> <pr> <n2> patterns, and the target-language corpus was used to label the extracted training examples. <b/>
+
In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from <n1> <pr> <n2> -> <n1> <pr> <n2> patterns, and the target-language corpus was used to label the extracted training examples. <b />
   
 
Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set.
 
Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set.
   
 
== Corpora, sets and alignment ==
 
== Corpora, sets and alignment ==
The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded from here: <b/>
+
The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded from here: <b />
http://www.nljubesic.net/resources/corpora/setimes/ <b/>
+
http://www.nljubesic.net/resources/corpora/setimes/ <b />
   
The first 150.000 parallel sentences were used for extracting and aligning training examples, while the last 50.000 sentences were used for testing the model(s). <b/>
+
The first 150.000 parallel sentences were used for extracting and aligning training examples, while the last 50.000 sentences were used for testing the model(s). <b />
   
This time, the aligner was extended to match the following pattern: <b/>
+
This time, the aligner was extended to match the following pattern: <b />
   
<n | v> <pr> <adj | det>* <n | v> <b/>
+
<n | v> <pr> <adj | det>* <n | v> <b />
   
 
== First Model ==
 
== First Model ==

Revision as of 15:12, 26 July 2012

Review

In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from <n1> <pr> <n2> -> <n1> <pr> <n2> patterns, and the target-language corpus was used to label the extracted training examples.

Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set.

Corpora, sets and alignment

The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded from here: http://www.nljubesic.net/resources/corpora/setimes/

The first 150.000 parallel sentences were used for extracting and aligning training examples, while the last 50.000 sentences were used for testing the model(s).

This time, the aligner was extended to match the following pattern:

<n | v> <pr> <adj | det>* <n | v>

First Model