Difference between revisions of "Lexical feature transfer - Second report"

From Apertium
Jump to navigation Jump to search
Line 60: Line 60:
 
This model affected 11.020 / 50.000 lines which is a significant improvement in coverage from the first model.
 
This model affected 11.020 / 50.000 lines which is a significant improvement in coverage from the first model.
 
The following statistics were meassured:
 
The following statistics were meassured:
<pre>
 
   
WER / PER
 
   
 
WER / PER:
 
<pre>
 
lines affected: 11020
 
lines affected: 11020
 
====================================================
 
====================================================
Line 93: Line 93:
 
Number of position-independent correct words: 192187
 
Number of position-independent correct words: 192187
 
Position-independent word error rate (PER): 54.21 %
 
Position-independent word error rate (PER): 54.21 %
  +
</pre>
   
Bootstrap resampling
+
Bootstrap resampling:
  +
<pre>
--------------------
 
 
before:
 
before:
 
--- Confidence: 0.95 ---
 
--- Confidence: 0.95 ---
Line 105: Line 106:
 
0.678194886402937 in [ 0.67577731092437 , 0.681481371309586 ]
 
0.678194886402937 in [ 0.67577731092437 , 0.681481371309586 ]
 
Score: 0.678629341116978 +/- 0.00285203019260799
 
Score: 0.678629341116978 +/- 0.00285203019260799
  +
</pre>
   
BLEU
+
BLEU:
  +
<pre>
--------------------
 
 
before: 0.1802
 
before: 0.1802
 
after : 0.1909
 
after : 0.1909

Revision as of 16:59, 26 July 2012

Review

In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from <n1> <pr> <n2> -> <n1> <pr> <n2> patterns, and the target-language corpus was used to label the extracted training examples.

Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set.

Corpora, sets and alignment

The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded here:
http://www.nljubesic.net/resources/corpora/setimes/

The first 150.000 parallel sentences were used for extracting and aligning training examples, while the last 50.000 sentences were used for testing the model(s).

This time, the aligner was extended to match the pattern:

  <n | v> <pr> <adj | adv>* <n | v>  ->    <n | v> <pr> <adj | adv | det>* <n | v> 
                                         | <n>     <n> 
                                       

Examples of such alignments would be:

договор<n> за<pr> воспоставување<n> -> agreement<n> on<pr> the<det> establishment<n>

but also

Совет<n> за<pr> Безбедност<n> -> Security<n> Concil<n>

First Model

In the first model, the lemmas from the extracted nouns / verbs and preposition were used as one feature, and a NB classifier was used.

feature1                  | label
----------------------------------
положба--на--пазар        | of
извор--од--влада          | from
кандидат--во--процес      | in
процес--на--приватизација | of
власт--за--нерегуларност  | for

This made the model quite complex, and every trigram from the testing which was not seen in the training set was discarded since and the model did not know what to do with it. Precision was high and there were improvements, as expected, but only 1.800 lines out of 50.000 from the testing set were actually affected, a sign of an overfit model.
A model which includes smoothing was also of no use since there weren't other features for the model to back-off to, except for the prior probability of the classes, and in case of a missing trigram, the most common label was used.

Second model

The second model is a simpler one, where instead of one trigram, two bigrams are used as features. The source-language nouns / verbs are merged with the pronoun to form a set in the following format:

feature1      | feature2           | label
------------------------------------------
на--влијание  | на--криза          | of
на--капацитет | на--фабрика        | of
за--договор   | за--воспоставување | on
на--удел      | на--профит         | of
за--данок     | за--поединец       | for
за--профит    | за--буџет | to

This model affected 11.020 / 50.000 lines which is a significant improvement in coverage from the first model. The following statistics were meassured:


WER / PER:

lines affected: 11020
====================================================
Tested on 50.000 lines
----------------------
before:
Edit distance: 842588
Word error rate (WER): 68.33 %
Number of position-independent correct words: 692408
Position-independent word error rate (PER): 54.52 %

after:
Edit distance: 842418
Word error rate (WER): 68.32 %
Number of position-independent correct words: 692811
Position-independent word error rate (PER): 54.49 %
===================================================
Tested on the affected lines only:
----------------------------------
before:
Edit distance: 241251
Word error rate (WER): 72.12 %
Number of position-independent correct words: 191790
Position-independent word error rate (PER): 56.96 %

after:
Edit distance: 233708
Word error rate (WER): 69.86 %
Number of position-independent correct words: 192187
Position-independent word error rate (PER): 54.21 %

Bootstrap resampling:

before:
--- Confidence: 0.95 ---
0.700752754256641 in [ 0.698265306122449 , 0.703814977318361 ]
Score: 0.701040141720405 +/- 0.00277483559795599

after:
Confidence: 0.95
0.678194886402937 in [ 0.67577731092437 , 0.681481371309586 ]
Score: 0.678629341116978 +/- 0.00285203019260799

BLEU:

before: 0.1802
after : 0.1909