Lexical feature transfer - First report
Pattern in the form of <n> <pr> <n> were aligned in the following way:
for every sentence in the source language corpus: for every pattern in the sentence extract the pattern translate the nouns to the target language look for a <n1-translated> <pr> <n2-translated> pattern in the target language sentence make an example <n1-translated> <pr-source language> <n2-translated> <label>, where label is the preposition from the target language end for end for
Analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2> | <n2> <n1>
Training corpus size: 1.000.000 lines chosen from the front of the EUROPARL-es-en corpus, 151.000 aligned examples.
Prepositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:
136650 ^de<pr>$ 5160 ^a<pr>$ 2963 ^en<pr>$ 1614 ^por<pr>$ 1495 ^sobre<pr>$ 1290 ^entre<pr>$ ...
Only those cases where ^de<pr>$ appears in the source language (es) were considered, since the other prepositions represent a significantly smaller fraction.
The token ^de<pr>$ in a <n1> <pr> <n2> pattern translates to:
68134 times ^_$ (<n2> <n1>) 57424 times ^of<pr>$ 4945 times ^for<pr>$ 2864 times ^in<pr>$ 798 times ^from<pr>$ 739 times ^to<pr>$
Three classes were constructed: class '^_$', class '^of<pr>$' and class 'other'. Class ^_$ denotes that the pronoun will be omitted and the nouns will swap places. Example: Trial for murder -> Murder trial. Class ^of<pr>$ denotes the preposition that will be inserted between the two nouns. The class other means that the output from apertium will be taken.
Three features were chosen for classification, the two nouns and the the source language preposition. A linear SVM was trained, and 91% accuracy was achieved using 5-fold cross validation. The model was aditionally tested on 5256 examples, taken from the back of the EUROPARL-es-en corpus and 90% accuracy was achieved.
The WER/PER have not calculated yet. However, apertium translates almost every de<pr> token to of<pr>, so there is room for progress.
MK-EN analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2>
Training corpus size: 100.000 lines chosen from the front of the SETIMES-mk-en corpus, 12.016 aligned examples. The most common repositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:
5003 ^на<pr>$ 3084 ^за<pr>$ 2452 ^во<pr>$ 928 ^од<pr>$ 429 ^со<pr>$ 120 ^до<pr>$ ...
Only these six prepositions from the source language were considered.
The most common prepositions that appeear in the <n1> <pr> <n2> pattern in the target language (en), sorted by freq:
5244 ^of<pr>$ 2530 ^in<pr>$ 1925 ^for<pr>$ 780 ^to<pr>$ 407 ^with<pr>$ 373 ^on<pr>$ 251 ^from<pr>$ ...
Eight classes were constructed, each for every english preposition from the list above, plus an additional 'other' class for all other prepositions not in the list. Three features were chosen for classification, the two nouns and the the source language preposition. A linear SVM was trained, and 84.5% accuracy was achieved using 5-fold cross validation.
The WER dropped from 7.9% to 5.2% after the model was applied on unseen data.