Difference between revisions of "User:Francis Tyers/Sandbox2"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
Constraint-based lexical selection for rule-based machine translation |
|||
<pre> |
<pre> |
||
Corpus: cawiki-20110616-pages-articles.xml.bz2 |
Corpus: cawiki-20110616-pages-articles.xml.bz2 |
||
Line 17: | Line 19: | ||
Test corpus: |
Test corpus: |
||
* 150 test words |
|||
* |
* 1,500 sentences |
||
* 10 per test word |
* 10 per test word |
||
* Randomly selected from the subset of sentences which were found in the corpus. |
* Randomly selected from the subset of sentences which were found in the corpus. |
||
* Only words with >100 example sentences included |
* Only words with >100 example sentences included |
||
⚫ | |||
Training corpus: |
|||
Baseline: |
|||
Baselines: |
|||
⚫ | |||
*: This would require a parallel corpus. |
|||
* TL Frequency-best |
|||
Rationale: |
|||
* TLM-best |
|||
* Linguist set |
|||
⚫ | |||
⚫ | |||
* Rules from phrase table |
|||
==Testing== |
|||
;Input: Les Carmelites el veneren com a sant patró seu. |
|||
<pre> |
|||
^El<det><def><f><pl>/The<det><def><f><pl>$ |
|||
^*Carmelites/*Carmelites$ |
|||
^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$ |
|||
^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$ |
|||
^com a<pr>/as a<pr>$ ^sant<adj><m><sg>/saint<adj><m><sg>$ |
|||
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$ |
|||
^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$ |
|||
</pre> |
|||
;Reference: |
|||
<pre> |
|||
235626 ]^El<det><def><f><pl>/The<det><def><f><pl>$ |
|||
^*Carmelites/*Carmelites$ |
|||
^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$ |
|||
^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$ |
|||
^com a<pr>/as a<pr>$ |
|||
^sant<adj><m><sg>/saint<adj><m><sg>$ |
|||
^patró<n><m><sg>/patron<n><sg>$ |
|||
^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$[ |
|||
</pre> |
|||
;Test 1 (1/6) |
|||
<pre> |
|||
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$ |
|||
</pre> |
|||
;Test 2 (1/1) |
|||
<pre> |
|||
^patró<n><m><sg>/patron<n><sg>$ |
|||
</pre> |
|||
;Test 3 (1/4) |
|||
<pre> |
|||
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/employer<n><sg>$ |
|||
</pre> |
Revision as of 11:05, 7 September 2011
Constraint-based lexical selection for rule-based machine translation
Corpus: cawiki-20110616-pages-articles.xml.bz2 cleaned with `aq-wikicrp' 1758582 lines 531983 unique analyses 531436 lines with >1 translation (30%) 2740 analyses with >1 translation 287 words (lemma+pos) with >1 translation in corpus 712 words in dictionary with >1 translation 1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
Test corpus:
- 150 test words
- 1,500 sentences
- 10 per test word
- Randomly selected from the subset of sentences which were found in the corpus.
- Only words with >100 example sentences included
- Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
Training corpus:
Baselines:
- TL Frequency-best
- TLM-best
- Linguist set
- Full analysis:Full analysis dic from Giza++
- Rules from phrase table