Difference between revisions of "Lexical feature transfer - First report"

From Apertium
Jump to navigation Jump to search
 
(9 intermediate revisions by one other user not shown)
Line 1: Line 1:
Tools used:
+
Tools used: <br />
Manually written parser for the output of apertium-tagger. (java, c++).
+
Manually written parser for the output of apertium-tagger. (java, c++). <br />
Manually written pattern extractor (c++).
+
Manually written pattern extractor (c++). <br />
Manually written aligner for <n> <pr> <n> patterns (java, c++ version has some bugs).
+
Manually written aligner for <n> <pr> <n> patterns (java, c++ version has some bugs). <br />
Manually written nominal to numerical feature converter, needed for SVM training (java)
+
Manually written nominal to numerical feature converter, needed for SVM training (java) <br />
  +
Rapidminer <br />
  +
liblinear / libsvm <br />
   
  +
  +
A typical pipeline for extracting patterns would look like this: <br />
  +
<pre>
  +
cat file | apertium-destxt | lt-proc xx-yy.automorph.bin | apertium-tagger -g xx-yy.prob \
  +
| apertium-destxt | some-pattern-extractor
  +
</pre>
 
-------------------------
 
-------------------------
 
Patterns in the form of <n1> <pr> <the>? <n2> were extracted in the following way:
 
Patterns in the form of <n1> <pr> <the>? <n2> were extracted in the following way:
Line 26: Line 34:
 
A linear SVM was trained. Cross validation on the training set showed an accuracy of 91%.
 
A linear SVM was trained. Cross validation on the training set showed an accuracy of 91%.
   
The PER dropped from 83.39, 82.97%, and the WER from 118.97% to 119.25%. However, the original testing corpus has 49.929 non-blank lines, while the translated one has 49.919, 10 lines less, and this is the reason why the errors are so high. Re-evaluation will be done as soon as the two corpora are manually realigned.
+
The PER dropped from 83.39, 82.97%, and the WER from 119.25% to 118.97%. However, the original testing corpus has 49.929 non-blank lines, while the translated one has 49.919, 10 lines less. This is the reason why the errors are so high. Re-evaluation will be done as soon as the two corpora are manually realigned.
   
 
This method takes into account only the original (manually written) target side corpus (english in this case).
 
This method takes into account only the original (manually written) target side corpus (english in this case).
Line 113: Line 121:
   
 
The WER dropped from 7.9% to 5.2% after the model was applied on unseen data.
 
The WER dropped from 7.9% to 5.2% after the model was applied on unseen data.
  +
  +
[[Category:Development]]

Latest revision as of 17:47, 25 July 2012

Tools used:
Manually written parser for the output of apertium-tagger. (java, c++).
Manually written pattern extractor (c++).
Manually written aligner for <n> <pr> <n> patterns (java, c++ version has some bugs).
Manually written nominal to numerical feature converter, needed for SVM training (java)
Rapidminer
liblinear / libsvm


A typical pipeline for extracting patterns would look like this:

cat file | apertium-destxt | lt-proc xx-yy.automorph.bin | apertium-tagger -g xx-yy.prob \ 
| apertium-destxt | some-pattern-extractor

Patterns in the form of <n1> <pr> <the>? <n2> were extracted in the following way:

for every sentence s in the source language corpus:
  for every pattern in the form of <n1> <pr> <the>? <n2> in s
      extract the lemma and the second tag (grammatical number) of the two nouns
      extract the lemma of the preposition
      extract the lemma and the first tag of the first word different from "the" before <n1>
      extract the lemma and the first tag of the first word different from "the" after <n2>
      extract the label (the / none)
  end for
end for

ES-EN
Analysed pattern: <n> <pr> <the>? <n> -> <n> <pr> <the>? <n>

Training corpus size: 1.000.000 lines chosen from the front of the EUROPARL-es-en.en corpus, 613786 training examples. Testing corpus size: 50.000 lines, 51.472 examples.

A linear SVM was trained. Cross validation on the training set showed an accuracy of 91%.

The PER dropped from 83.39, 82.97%, and the WER from 119.25% to 118.97%. However, the original testing corpus has 49.929 non-blank lines, while the translated one has 49.919, 10 lines less. This is the reason why the errors are so high. Re-evaluation will be done as soon as the two corpora are manually realigned.

This method takes into account only the original (manually written) target side corpus (english in this case). Another possibility is to try to align patterns from the manually written and translated target side corpora. This way, the features will be extracted from the translated corpora, and the label will be extracted from the manually written corpora. This method might provide better results since the model will be trained on the output from apertium.


Patterns in the form of <n> <pr> <n> were aligned in the following way:

for every sentence in the source language corpus:
  for every pattern in the sentence
      extract the pattern
      translate the nouns to the target language
      look for a <n1-translated> <pr> <n2-translated> pattern in the target language sentence
      make an example <n1-translated> <pr-source language> <n2-translated> <label>, 
        where label is the preposition from the target language
  end for
end for

After pattern extraction is done, the patterns containing only certain preposition can be simply grepped.


ES-EN
Analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2> | <n2> <n1>

Training corpus size: 1.000.000 lines chosen from the front of the EUROPARL-es-en corpus, 151.000 aligned examples.

Prepositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:

136650 ^de<pr>$
  5160 ^a<pr>$
  2963 ^en<pr>$
  1614 ^por<pr>$
  1495 ^sobre<pr>$
  1290 ^entre<pr>$
  ...

Only those cases where ^de<pr>$ appears in the source language (es) were considered, since the other prepositions represent a significantly smaller fraction.

The token ^de<pr>$ in a <n1> <pr> <n2> pattern translates to:

68134 times ^_$ (<n2> <n1>) 
57424 times ^of<pr>$
4945  times ^for<pr>$
2864  times ^in<pr>$
798   times ^from<pr>$
739   times ^to<pr>$

Three classes were constructed: class '^_$', class '^of<pr>$' and class 'other'. Class ^_$ denotes that the pronoun will be omitted and the nouns will swap places. Example: Trial for murder -> Murder trial. Class ^of<pr>$ denotes the preposition that will be inserted between the two nouns. The class other means that the output from apertium will be taken.

Three features were chosen for classification, the two nouns and the the source language preposition. A linear SVM was trained, and 91% accuracy was achieved using 5-fold cross validation. The model was aditionally tested on 5256 examples, taken from the back of the EUROPARL-es-en corpus and 90% accuracy was achieved.

The WER/PER have not calculated yet. However, apertium translates almost every de<pr> token to of<pr>, so there is room for progress.


MK-EN analysed pattern: <n1> <pr> <n2> -> <n1> <pr> <n2>

Training corpus size: 100.000 lines chosen from the front of the SETIMES-mk-en corpus, 12.016 aligned examples.
The most common repositions that appear in the <n1> <pr> <n2> pattern, sorted by frequency:

5003 ^на<pr>$
3084 ^за<pr>$
2452 ^во<pr>$
 928 ^од<pr>$
 429 ^со<pr>$
 120 ^до<pr>$
 ...

Only these six prepositions from the source language were considered.

The most common prepositions that appeear in the <n1> <pr> <n2> pattern in the target language (en), sorted by frequency:

5244 ^of<pr>$
2530 ^in<pr>$
1925 ^for<pr>$
 780 ^to<pr>$
 407 ^with<pr>$
 373 ^on<pr>$
 251 ^from<pr>$
 ...

Eight classes were constructed, each for every english preposition from the list above, plus an additional 'other' class for all other prepositions not in the list. Three features were chosen for classification, the two nouns and the the source language preposition. A linear SVM was trained, and 84.5% accuracy was achieved using 5-fold cross validation.

The WER dropped from 7.9% to 5.2% after the model was applied on unseen data.