Difference between revisions of "Lexical feature transfer - First report"

From Apertium
Jump to navigation Jump to search
Line 3: Line 3:
for every sentence in the source language corpus:
for every sentence in the source language corpus:
for every pattern in the sentence
for every pattern in the sentence
extract the pattern
extract the pattern
translate the nouns to the target language
translate the nouns to the target language
look for a <n1-translated> <pr> <n2-translated> pattern in the target language
look for a <n1-translated> <pr> <n2-translated> pattern in the target language
make an example <n1-translated> <pr-source language> <n2-translated> <label>,
make an example <n1-translated> <pr-source language> <n2-translated> <label>,
where label is the preposition from the target language
where label is the preposition from the target language
end for
end for
</pre>
</pre>
----------------------------

ES-EN <br />
ES-EN <br />
Analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2> | <n2> <n1>
Analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2> | <n2> <n1>


Training corpus size: 1.000.000 lines chosen from the front of the EUROPARL-es-en corpus, 151.000 aligned examples.
Training corpus size: 1.000.000 lines chosen from the front of the EUROPARL-es-en corpus, 151.000 aligned examples.




Prepositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:
Prepositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:
Line 25: Line 25:
1290 ^entre<pr>$
1290 ^entre<pr>$
...
...
Only those cases where ^de<pr>$ appears in the source language (es) were considered.
Only those cases where ^de<pr>$ appears in the source language (es) were considered, since the other prepositions represent a significantly smaller fraction.


The token ^de<pr>$ in a <n1> <pr> <n2> pattern translates to:
The token ^de<pr>$ in a <n1> <pr> <n2> pattern translates to:
Line 61: Line 61:
429 ^со<pr>$
429 ^со<pr>$
120 ^до<pr>$
120 ^до<pr>$
...
...
Only these six prepositions from the source language were considered.
Only these six prepositions from the source language were considered.


Line 72: Line 72:
373 ^on<pr>$
373 ^on<pr>$
251 ^from<pr>$
251 ^from<pr>$
...

Eight classes were constructed, each for every english preposition, plus an additional 'other' class.
Eight classes were constructed, each for every english preposition from the list above, plus an additional 'other' class for all other prepositions not in the list.
Three features were chosen for classification, the two nouns and the the source language preposition.
Three features were chosen for classification, the two nouns and the the source language preposition.
A linear SVM was trained, and 84.5% accuracy was achieved using 5-fold cross validation.
A linear SVM was trained, and 84.5% accuracy was achieved using 5-fold cross validation.

Revision as of 12:48, 9 June 2012

Pattern in the form of <n> <pr> <n> were aligned in the following way:

for every sentence in the source language corpus:
  for every pattern in the sentence
      extract the pattern
      translate the nouns to the target language
      look for a <n1-translated> <pr> <n2-translated> pattern in the target language
      make an example <n1-translated> <pr-source language> <n2-translated> <label>, 
        where label is the preposition from the target language
  end for
end for

ES-EN
Analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2> | <n2> <n1>

Training corpus size: 1.000.000 lines chosen from the front of the EUROPARL-es-en corpus, 151.000 aligned examples.

Prepositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:

136650 ^de<pr>$
  5160 ^a<pr>$
  2963 ^en<pr>$
  1614 ^por<pr>$
  1495 ^sobre<pr>$
  1290 ^entre<pr>$
  ...

Only those cases where ^de<pr>$ appears in the source language (es) were considered, since the other prepositions represent a significantly smaller fraction.

The token ^de<pr>$ in a <n1> <pr> <n2> pattern translates to:

68134 times ^_$ (<n2> <n1>) 
57424 times ^of<pr>$
4945  times ^for<pr>$
2864  times ^in<pr>$
798   times ^from<pr>$
739   times ^to<pr>$

Three classes were constructed: class '^_$', class '^of<pr>$' and class 'other'. Class ^_$ denotes that the pronoun will be omitted and the nouns will swap places. Example: Trial for murder -> Murder trial. Class ^of<pr>$ denotes the preposition that will be inserted between the two nouns. The class other means that the output from apertium will be taken.

Three features were chosen for classification, the two nouns and the the source language preposition. A linear SVM was trained, and 91% accuracy was achieved using 5-fold cross validation. The model was aditionally tested on 5256 examples, taken from the back of the EUROPARL-es-en corpus and 90% accuracy was achieved.

The WER/PER have not calculated yet. However, apertium translates almost every de<pr> token to of<pr>, so there is room for progress.


MK-EN analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2>

Training corpus size: 100.000 lines chosen from the front of the SETIMES-mk-en corpus, 12.016 aligned examples. The most common repositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:

5003 ^на<pr>$
3084 ^за<pr>$
2452 ^во<pr>$
 928 ^од<pr>$
 429 ^со<pr>$
 120 ^до<pr>$
 ...

Only these six prepositions from the source language were considered.

The most common prepositions that appeear in the <n1> <pr> <n2> pattern in the target language (en), sorted by freq:

5244 ^of<pr>$
2530 ^in<pr>$
1925 ^for<pr>$
 780 ^to<pr>$
 407 ^with<pr>$
 373 ^on<pr>$
 251 ^from<pr>$
 ...

Eight classes were constructed, each for every english preposition from the list above, plus an additional 'other' class for all other prepositions not in the list. Three features were chosen for classification, the two nouns and the the source language preposition. A linear SVM was trained, and 84.5% accuracy was achieved using 5-fold cross validation.

The WER dropped from 7.9% to 5.2% after the model was applied on unseen data.