Difference between revisions of "Lexical feature transfer - Second report"

Revision as of 16:28, 26 July 2012

Review

In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from <n1> <pr> <n2> -> <n1> <pr> <n2> patterns, and the target-language corpus was used to label the extracted training examples.

Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set.

Corpora, sets and alignment

The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded here:
http://www.nljubesic.net/resources/corpora/setimes/

The first 150.000 parallel sentences were used for extracting and aligning training examples, while the last 50.000 sentences were used for testing the model(s).

This time, the aligner was extended to match the pattern:

  <n | v> <pr> <adj | adv>* <n | v>  ->    <n | v> <pr> <adj | adv | det>* <n | v> 
                                         | <n> <n>

First Model

In the first model, the lemmas from the extracted nouns / verbs and preposition were used as one feature, and a NB classifier was used.

feature1                  | label
----------------------------------
положба--на--пазар        | of
извор--од--влада          | from
кандидат--во--процес      | in
процес--на--приватизација | of
власт--за--нерегуларност  | for

This made the model quite complex, and every trigram from the testing which was not seen in the training set was discarded since and the model did not know what to do with it. Precision was high and there were improvements, as expected, but only 1.800 lines out of 50.000 from the testing set were actually affected, a sign of an overfit model.
A model which includes smoothing was also of no use since there weren't other features for the model to back-off to, except for the prior probability of the classes, and in case of a missing trigram, the most common label was used.

Second model

The second model is a simpler one, where instead of one trigram, two bigrams are used.

feature1      | feature2           | label
на--влијание  | на--криза          | of
на--капацитет | на--фабрика        | of
за--договор   | за--воспоставување | on
на--удел      | на--профит         | of
за--данок     | за--поединец       | for
за--профит | за--буџет | to

@@ Line 5: / Line 5: @@
 == Corpora, sets and alignment ==
-The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded from here: <br />
+The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded here: <br />
 http://www.nljubesic.net/resources/corpora/setimes/ <br />
@@ Line 12: / Line 12: @@
 This time, the aligner was extended to match the pattern: <br />
 <pre>
-  <n | v> <pr> <adj | det>* <n | v>
+  <n | v> <pr> <adj | adv>* <n | v>  ->    <n | v> <pr> <adj | adv | det>* <n | v>
+                                         | <n> <n>
 </pre>
 == First Model ==
@@ Line 29: / Line 33: @@
 </pre>
-This made the model quite complex, and every trigram from the testing which was not seen in the training set was discarded since and the model did not know what to do with it. Precision was high as expected, but only 1.800 lines out of 50.000 from the testing set were actually affected.
+This made the model quite complex, and every trigram from the testing which was not seen in the training set was discarded since and the model did not know what to do with it. Precision was high and there were improvements, as expected, but only 1.800 lines out of 50.000 from the testing set were actually affected, a sign of an overfit model. <br />
+A model which includes smoothing was also of no use since there weren't other features for the model to back-off to, except for the prior probability of the classes, and in case of a missing trigram, the most common label was used.
+== Second model ==
+The second model is a simpler one, where instead of one trigram, two bigrams are used.
+<pre>
+feature1      | feature2           | label
+на--влијание  | на--криза          | of
+на--капацитет | на--фабрика        | of
+за--договор   | за--воспоставување | on
+на--удел      | на--профит         | of
+за--данок     | за--поединец       | for
+за--профит | за--буџет | to
+</pre>

Difference between revisions of "Lexical feature transfer - Second report"

Revision as of 16:28, 26 July 2012

Contents

Review

Corpora, sets and alignment

First Model

Second model

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools