Difference between revisions of "Lexical feature transfer - Second report"
Fpetkovski (talk | contribs) |
Fpetkovski (talk | contribs) |
||
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Review == |
== Review == |
||
In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from |
In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from (n1) (pr) (n2) -> (n1) (pr) (n2) patterns, and the target-language corpus was used to label the extracted training examples. <br /> |
||
Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set. |
Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set. |
||
Line 12: | Line 12: | ||
This time, the aligner was extended to match the pattern: <br /> |
This time, the aligner was extended to match the pattern: <br /> |
||
<pre> |
<pre> |
||
(n | v) (pr) (adj | adv)* (n | v) -> (n | v) (pr) (adj | adv | det)* (n | v) |
|||
| |
| (n) (n) |
||
</pre> |
</pre> |
||
Line 25: | Line 25: | ||
Совет<n> за<pr> Безбедност<n> -> Security<n> Concil<n> |
Совет<n> за<pr> Безбедност<n> -> Security<n> Concil<n> |
||
</pre> |
</pre> |
||
A total of 41.631 training and 62.676 testing examples were extracted. The reason why there are more testing examples extracted, even though the testing set is smaller, is because the translation from apertium is much closer to the original text in terms of synonym use, grammatical structure etc. |
|||
== First Model == |
== First Model == |
||
In the first model, the lemmas from the extracted nouns / verbs and preposition were used as one feature, and a NB classifier was used. |
In the first model, the lemmas from the extracted nouns / verbs and preposition were used as one feature, and a NB classifier was used. <br /> |
||
<pre> |
<pre> |
||
Line 45: | Line 47: | ||
== Second model == |
== Second model == |
||
The second model is a simpler one, where instead of one trigram, two bigrams are used as features. The source-language nouns / verbs are merged with the |
The second model is a simpler one, where instead of one trigram, two bigrams are used as features. The source-language nouns / verbs are merged with the preposition to form a set in the following format: |
||
<pre> |
<pre> |
||
Line 55: | Line 57: | ||
на--удел | на--профит | of |
на--удел | на--профит | of |
||
за--данок | за--поединец | for |
за--данок | за--поединец | for |
||
за--профит | за--буџет | to |
за--профит | за--буџет | to |
||
</pre> |
</pre> |
||
This model affected 11.020 / 50.000 lines which is a significant improvement in coverage from the first model. |
This model affected 11.020 / 50.000 lines which is a significant improvement in coverage from the first model. |
||
The following statistics were meassured: |
The following statistics were meassured: |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
lines affected: 11020 |
lines affected: 11020 |
||
==================================================== |
==================================================== |
||
Line 69: | Line 71: | ||
---------------------- |
---------------------- |
||
before: |
before: |
||
Edit distance: |
Edit distance: 804749 |
||
Word error rate (WER): |
Word error rate (WER): 66.18 % |
||
Number of position-independent correct words: |
Number of position-independent correct words: 720714 |
||
Position-independent word error rate (PER): |
Position-independent word error rate (PER): 51.67 % |
||
after: |
after: |
||
Edit distance: |
Edit distance: 795336 |
||
Word error rate (WER): |
Word error rate (WER): 65.40 % |
||
Number of position-independent correct words: |
Number of position-independent correct words: 720913 |
||
Position-independent word error rate (PER): |
Position-independent word error rate (PER): 50.71 % |
||
=================================================== |
=================================================== |
||
Tested on the affected lines only: |
Tested on the affected lines only: |
||
---------------------------------- |
---------------------------------- |
||
before: |
before: |
||
Edit distance: |
Edit distance: 312138 |
||
Word error rate (WER): |
Word error rate (WER): 69.77 % |
||
Number of position-independent correct words: |
Number of position-independent correct words: 266954 |
||
Position-independent word error rate (PER): |
Position-independent word error rate (PER): 54.26 % |
||
after: |
after: |
||
Edit distance: |
Edit distance: 303829 |
||
Word error rate (WER): |
Word error rate (WER): 67.92 % |
||
Number of position-independent correct words: |
Number of position-independent correct words: 266111 |
||
Position-independent word error rate (PER): |
Position-independent word error rate (PER): 51.50 % |
||
</pre> |
|||
⚫ | |||
-------------------- |
|||
⚫ | |||
<pre> |
|||
before: |
before: |
||
--- Confidence: 0.95 --- |
--- Confidence: 0.95 --- |
||
Line 105: | Line 109: | ||
0.678194886402937 in [ 0.67577731092437 , 0.681481371309586 ] |
0.678194886402937 in [ 0.67577731092437 , 0.681481371309586 ] |
||
Score: 0.678629341116978 +/- 0.00285203019260799 |
Score: 0.678629341116978 +/- 0.00285203019260799 |
||
</pre> |
|||
BLEU |
BLEU: |
||
<pre> |
|||
-------------------- |
|||
before: 0.1802 |
before: 0.1802 |
||
after : 0.1909 |
after : 0.1909 |
Latest revision as of 12:21, 27 July 2012
Review[edit]
In the first attempt at trying to solve the problem of corpus-based preposition selection, both a Naive Bayes and and SVM classifier were tried out. The lemmas and some of the tags of the surrounding words were extracted as features for the classifier. The source-language corpus was used to extract training examples from (n1) (pr) (n2) -> (n1) (pr) (n2) patterns, and the target-language corpus was used to label the extracted training examples.
Around 12.000 of the extracted examples were aligned to their target-language translations and labeled. There was some improvement in the translation quality, however, there were many wrong predictions as a result of the small training set and formatting errors in the training set.
Corpora, sets and alignment[edit]
The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, can be downloaded here:
http://www.nljubesic.net/resources/corpora/setimes/
The first 150.000 parallel sentences were used for extracting and aligning training examples, while the last 50.000 sentences were used for testing the model(s).
This time, the aligner was extended to match the pattern:
(n | v) (pr) (adj | adv)* (n | v) -> (n | v) (pr) (adj | adv | det)* (n | v) | (n) (n)
Examples of such alignments would be:
договор<n> за<pr> воспоставување<n> -> agreement<n> on<pr> the<det> establishment<n>
but also
Совет<n> за<pr> Безбедност<n> -> Security<n> Concil<n>
A total of 41.631 training and 62.676 testing examples were extracted. The reason why there are more testing examples extracted, even though the testing set is smaller, is because the translation from apertium is much closer to the original text in terms of synonym use, grammatical structure etc.
First Model[edit]
In the first model, the lemmas from the extracted nouns / verbs and preposition were used as one feature, and a NB classifier was used.
feature1 | label ---------------------------------- положба--на--пазар | of извор--од--влада | from кандидат--во--процес | in процес--на--приватизација | of власт--за--нерегуларност | for
This made the model quite complex, and every trigram from the testing which was not seen in the training set was discarded since and the model did not know what to do with it. Precision was high and there were improvements, as expected, but only 1.800 lines out of 50.000 from the testing set were actually affected, a sign of an overfit model.
A model which includes smoothing was also of no use since there weren't other features for the model to back-off to, except for the prior probability of the classes, and in case of a missing trigram, the most common label was used.
Second model[edit]
The second model is a simpler one, where instead of one trigram, two bigrams are used as features. The source-language nouns / verbs are merged with the preposition to form a set in the following format:
feature1 | feature2 | label ------------------------------------------ на--влијание | на--криза | of на--капацитет | на--фабрика | of за--договор | за--воспоставување | on на--удел | на--профит | of за--данок | за--поединец | for за--профит | за--буџет | to
This model affected 11.020 / 50.000 lines which is a significant improvement in coverage from the first model. The following statistics were meassured:
WER / PER:
lines affected: 11020 ==================================================== Tested on 50.000 lines ---------------------- before: Edit distance: 804749 Word error rate (WER): 66.18 % Number of position-independent correct words: 720714 Position-independent word error rate (PER): 51.67 % after: Edit distance: 795336 Word error rate (WER): 65.40 % Number of position-independent correct words: 720913 Position-independent word error rate (PER): 50.71 % =================================================== Tested on the affected lines only: ---------------------------------- before: Edit distance: 312138 Word error rate (WER): 69.77 % Number of position-independent correct words: 266954 Position-independent word error rate (PER): 54.26 % after: Edit distance: 303829 Word error rate (WER): 67.92 % Number of position-independent correct words: 266111 Position-independent word error rate (PER): 51.50 %
Bootstrap resampling:
before: --- Confidence: 0.95 --- 0.700752754256641 in [ 0.698265306122449 , 0.703814977318361 ] Score: 0.701040141720405 +/- 0.00277483559795599 after: Confidence: 0.95 0.678194886402937 in [ 0.67577731092437 , 0.681481371309586 ] Score: 0.678629341116978 +/- 0.00285203019260799
BLEU:
before: 0.1802 after : 0.1909