Difference between revisions of "Uighur and Turkish/GSoC2018 report"

From Apertium
Jump to navigation Jump to search
 
(9 intermediate revisions by the same user not shown)
Line 4: Line 4:
   
 
== Commits ==
 
== Commits ==
My commits can be found [https://apertium.projectjj.com/gsoc2018/oguz/oguz.html here].
+
My commits can be found [https://apertium.projectjj.com/gsoc2018/oguz/oguz.html here]. You can also download my work as a [https://apertium.projectjj.com/gsoc2018/oguz.zip zip file].
 
 
   
 
== Corpora and Coverage ==
 
== Corpora and Coverage ==
Our main corpora consisted of [https://www.rfa.org/uyghur/ RFA], [http://uy.ts.cn/ Tanritor], [http://www.trt.net.tr/uyghur/ TRT Uyghurche] and Uyghur Wikipedia, but we also worked on some Uyghur blogs, an collection of Uyghur stories and the Uyghur translation of the Bible to be able to cover different domains. Wikipedia and blog coverages were relatively lower due to nonstandard forms, Arabic and Farsi texts and alphabets.
+
Our main corpora consisted of [https://www.rfa.org/uyghur/ RFA], [http://uy.ts.cn/ Tanritor], [http://www.trt.net.tr/uyghur/ TRT Uyghurche] and Uyghur Wikipedia, but we also worked on some Uyghur blogs, a collection of Uyghur stories and the Uyghur translation of the Bible to be able to cover different domains. Wikipedia and blog coverages were relatively low due to nonstandard forms, Arabic and Farsi texts and alphabets.
   
   
Line 15: Line 13:
 
|-
 
|-
 
! Corpus
 
! Corpus
  +
! Words
 
! Coverage
 
! Coverage
 
|-
 
|-
 
| News
 
| News
| xx.x%
+
| 3447048
  +
| 94.0%
 
|-
 
|-
 
| Bible
 
| Bible
  +
| 1527061
 
| 94.1%
 
| 94.1%
 
|-
 
|-
 
| Wikipedia
 
| Wikipedia
  +
| 1589113
 
| 88.2%
 
| 88.2%
 
|-
 
|-
 
| Blogs
 
| Blogs
| xx.x%
+
| 4055981
  +
| 87.0%
 
|}
 
|}
   
Line 33: Line 36:
 
There are about 50 transfer rules, mostly needed to cover Uyghur's relatively richer tense inventory. We also needed transfer rules for expression that are -optionally-
 
There are about 50 transfer rules, mostly needed to cover Uyghur's relatively richer tense inventory. We also needed transfer rules for expression that are -optionally-
 
expressed synthetically in Uyghur but was analytic in Turkish. To give some examples:
 
expressed synthetically in Uyghur but was analytic in Turkish. To give some examples:
  +
 
'''''bolidighan''''' and '''''bolmaqchi''''' in Uyghur are both equivalents of Turkish '''''olacak'''''.
 
'''''bolidighan''''' and '''''bolmaqchi''''' in Uyghur are both equivalents of Turkish '''''olacak'''''.
  +
 
Uyghur can use the '''-''rAK''''' for comparisons, which is expressed by the preposition '''daha''' in Turkish.
 
Uyghur can use the '''-''rAK''''' for comparisons, which is expressed by the preposition '''daha''' in Turkish.
   
===Disambiguation===
+
==Disambiguation==
Many different analyses are generated for many word forms. To correctly discern the lemma and the morphology so as to be translated correctly into the target language, MT systems have '''disambiguation''' components. The disambiguation in this system is currently carried out using Constraint Grammar (CG). 68 rules remove the wrong analyses and select the the correct ones with the use of contextual morphological information. Ideally this would be either in conjunction with or replaced by a machine-learned POS tagger, which requires a tagged corpus. The tagged corpus will be developed in the near future.
+
To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG). Currently Uyghur has about 45 CG rules for disambiguation.
   
===Lexical Selection===
 
Lexical selection is used when the system needs to choose among multiple possible translations. The lexical selection component uses rules to choose which translation to prefer based on contextual information.
 
   
 
==Lexical Selection==
  +
To determine in which context which translation of a given lemma would be selected, lexical selection is employed. Currently uig-tur has 35 lexsel rules.
   
   
==Sources==
+
== WER results ==
  +
Here is the WER result before I added the unknown words/wrote some CG rules for the text:
I used the online [http://dict.yulghun.com/[Yulghun]] dictionary and ''Uyghurche-Türkche Lughet'' of E. N. Necip the vocabulary. For grammar reference, I used Rıdvan Öztürk's ''Yeni Uygur Türkçesi Grameri''.
 
   
  +
<pre> Test file: 'wikiwertr0.txt'
  +
Reference file 'wikiwertur.txt'
  +
  +
Statistics about input files
  +
-------------------------------------------------------
  +
Number of words in reference: 1068
  +
Number of words in test: 1105
  +
Number of unknown words (marked with a star) in test:
  +
Percentage of unknown words: 0.00 %
  +
  +
Results when removing unknown-word marks (stars)
  +
-------------------------------------------------------
  +
Edit distance: 394
  +
Word error rate (WER): 36.89 %
  +
Number of position-independent correct words: 731
  +
Position-independent word error rate (PER): 35.02 %
  +
  +
Results when unknown-word marks (stars) are not removed
  +
-------------------------------------------------------
  +
Edit distance: 394
  +
Word Error Rate (WER): 36.89 %
  +
Number of position-independent correct words: 731
  +
Position-independent word error rate (PER): 35.02 %
  +
  +
Statistics about the translation of unknown words
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: 0
  +
Percentage of unknown words that were free rides: 0%
  +
</pre>
  +
  +
And after:
  +
  +
<pre>
  +
Test file: 'wikiwertrlast.txt'
  +
Reference file 'wikiwertur.txt'
  +
  +
Statistics about input files
  +
-------------------------------------------------------
  +
Number of words in reference: 1068
  +
Number of words in test: 1065
  +
Number of unknown words (marked with a star) in test:
  +
Percentage of unknown words: 0.00 %
  +
  +
Results when removing unknown-word marks (stars)
  +
-------------------------------------------------------
  +
Edit distance: 179
  +
Word error rate (WER): 16.76 %
  +
Number of position-independent correct words: 907
  +
Position-independent word error rate (PER): 15.07 %
  +
  +
Results when unknown-word marks (stars) are not removed
  +
-------------------------------------------------------
  +
Edit distance: 179
  +
Word Error Rate (WER): 16.76 %
  +
Number of position-independent correct words: 907
  +
Position-independent word error rate (PER): 15.07 %
  +
  +
Statistics about the translation of unknown words
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: 0
  +
Percentage of unknown words that were free rides: 0%
  +
</pre>
  +
==Sources==
 
I used the online [http://dict.yulghun.com/ Yulghun] dictionary and ''Uyghurche-Türkche Lughet'' of E. N. Necip for the vocabulary. For grammar reference, I used Rıdvan Öztürk's ''Yeni Uygur Türkçesi Grameri''.
   
 
==Future Plans==
 
==Future Plans==

Latest revision as of 13:25, 12 August 2018

This project was an application of Apertium to develop an MT between Uyghur and Turkish, two Turkic languages. The project consisted mainly of building a bilingual bidix, writing transfer and disambiguation rules and enriching the Uyghur morphological analyzer.


Commits[edit]

My commits can be found here. You can also download my work as a zip file.

Corpora and Coverage[edit]

Our main corpora consisted of RFA, Tanritor, TRT Uyghurche and Uyghur Wikipedia, but we also worked on some Uyghur blogs, a collection of Uyghur stories and the Uyghur translation of the Bible to be able to cover different domains. Wikipedia and blog coverages were relatively low due to nonstandard forms, Arabic and Farsi texts and alphabets.


Corpus Words Coverage
News 3447048 94.0%
Bible 1527061 94.1%
Wikipedia 1589113 88.2%
Blogs 4055981 87.0%

Transfer[edit]

There are about 50 transfer rules, mostly needed to cover Uyghur's relatively richer tense inventory. We also needed transfer rules for expression that are -optionally- expressed synthetically in Uyghur but was analytic in Turkish. To give some examples:

bolidighan and bolmaqchi in Uyghur are both equivalents of Turkish olacak.

Uyghur can use the -rAK for comparisons, which is expressed by the preposition daha in Turkish.

Disambiguation[edit]

To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG). Currently Uyghur has about 45 CG rules for disambiguation.


Lexical Selection[edit]

To determine in which context which translation of a given lemma would be selected, lexical selection is employed. Currently uig-tur has 35 lexsel rules.


WER results[edit]

Here is the WER result before I added the unknown words/wrote some CG rules for the text:

 Test file: 'wikiwertr0.txt'
Reference file 'wikiwertur.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 1068
Number of words in test: 1105
Number of unknown words (marked with a star) in test: 
Percentage of unknown words: 0.00 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 394
Word error rate (WER): 36.89 %
Number of position-independent correct words: 731
Position-independent word error rate (PER): 35.02 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 394
Word Error Rate (WER): 36.89 %
Number of position-independent correct words: 731
Position-independent word error rate (PER): 35.02 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0%
 

And after:

Test file: 'wikiwertrlast.txt'
Reference file 'wikiwertur.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 1068
Number of words in test: 1065
Number of unknown words (marked with a star) in test: 
Percentage of unknown words: 0.00 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 179
Word error rate (WER): 16.76 %
Number of position-independent correct words: 907
Position-independent word error rate (PER): 15.07 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 179
Word Error Rate (WER): 16.76 %
Number of position-independent correct words: 907
Position-independent word error rate (PER): 15.07 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0%

Sources[edit]

I used the online Yulghun dictionary and Uyghurche-Türkche Lughet of E. N. Necip for the vocabulary. For grammar reference, I used Rıdvan Öztürk's Yeni Uygur Türkçesi Grameri.

Future Plans[edit]

For a more satisfactory translation and analysis, more disambiguation and lexsel rules must be added. Morphological analysis can be extended to cover the vast non-standard forms of modern Uyghur. With some work on coverage and transfer rules, Tur->Uig translation can be made possible.