Difference between revisions of "Turkish and Azerbaijani/Final report"

From Apertium
Jump to navigation Jump to search
Line 35: Line 35:
|-
|-
|}
|}

==How did GSOC go==

The GSOC has been a very stimulating (no homo) and challenging. Our plan, at the beginning, was to settle somewhere around 85% of coverage (and we did), with an accuracy in translation between 75% and 85%. I'd say I'm quite satisfied with the thing itself: the translation is easy to post-edit, so that the software is definitely usable. Could it be better? Sure it could, but we are gonna talk of it in the "FUTURE" section.


== THE SOFTWARE ==

The software consists of: apertium framework (which we didn't touch at all), a morphological analyzer for turkish, a generator for azerbaijani, a bidix and a set of CG rules.

===What I am happy with===

1. Bidix: the bidix is quite vaste. 8000 words are actually a lot of words for agglutinative languages. We set our goal at 10.000 words, which was too much for a language like turkish. So, yes, i'm satisfied with it.

2. The team: it seems that we could manage to get a team of knowledgeable people to fix most of the issues. Pros about the team: nobody knows turkish well but we are all able to reason over it and find better solutions and categorizations than OHOHIHAVEAPHDINTURKISHLINGUISTICS people. (Yes, i'm talking of the people who wrote the infamous Routledge Comprehensive Grammar of Turkish).

3. CG: Yes, the CG is not that good. The point is that the lexicon contained in TRmorph is horrible and hence it makes the disambiguation hard, if not impossible. Our CG set is pretty basic and messy, nonetheless it serves the purpose quite well.

===What i am not not happy with===

1. Lexicons: the lexicon of TRmorph is really bad. The lexicon of Azmorph, which is derived from TRmorph one, is probably worse.

2. Trmorph and Azmorph: I think it is safe to say that dealing with SFST has been pretty frustrating for everybody.

3. Testing platform: For some reasons i couldn't run the tests properly so that i had to abuse of my mentor's time and patience which ended up making him as frustrated as I am when I eat pasta abroad.

4. Cooperation between students: I know i'm not the most sociable person of the universe. Given that I think that cooperation could be improved. Sadly I think it will probably require a whip.

5. JNW failing at understanding what "jussive" means.




==Future work==
==Future work==

For the future we are planning to rewrite TRmorph using another standard (probably lexc + twol). I'm already working at the lexicon, which is available at apertium SVN repository.








Revision as of 15:50, 24 August 2011

placeholder

Description

Statistics

Dictionaries
  • azmorph lexicon:7375
  • trmorph lexicon:7970
  • tr-az bidix:8208
Coverage
  • Turkish Wikipedia ( , std. dev.: )
  • Turkish SETimes ( , std. dev.: ) 84.84591576 +/- 0.568940593619
  • Turkish ... ( , std. dev.: )
Rules
Error rate
File Num. Words % OOV WER (Sur) PER (Sur) WER (Lem) PER (Lem)
setimes.kosova_plate.tr.txt 243 13.83 21.74 20.95 - -
setimes.kosova.tr.txt 424 - - - - -
setimes.bulgar.tr.txt 395 - - - - -
wikipedia.kadinlar_askerler.tr.txt 1165 - - - - -

How did GSOC go

The GSOC has been a very stimulating (no homo) and challenging. Our plan, at the beginning, was to settle somewhere around 85% of coverage (and we did), with an accuracy in translation between 75% and 85%. I'd say I'm quite satisfied with the thing itself: the translation is easy to post-edit, so that the software is definitely usable. Could it be better? Sure it could, but we are gonna talk of it in the "FUTURE" section.


THE SOFTWARE

The software consists of: apertium framework (which we didn't touch at all), a morphological analyzer for turkish, a generator for azerbaijani, a bidix and a set of CG rules.

What I am happy with

1. Bidix: the bidix is quite vaste. 8000 words are actually a lot of words for agglutinative languages. We set our goal at 10.000 words, which was too much for a language like turkish. So, yes, i'm satisfied with it.

2. The team: it seems that we could manage to get a team of knowledgeable people to fix most of the issues. Pros about the team: nobody knows turkish well but we are all able to reason over it and find better solutions and categorizations than OHOHIHAVEAPHDINTURKISHLINGUISTICS people. (Yes, i'm talking of the people who wrote the infamous Routledge Comprehensive Grammar of Turkish).

3. CG: Yes, the CG is not that good. The point is that the lexicon contained in TRmorph is horrible and hence it makes the disambiguation hard, if not impossible. Our CG set is pretty basic and messy, nonetheless it serves the purpose quite well.

What i am not not happy with

1. Lexicons: the lexicon of TRmorph is really bad. The lexicon of Azmorph, which is derived from TRmorph one, is probably worse.

2. Trmorph and Azmorph: I think it is safe to say that dealing with SFST has been pretty frustrating for everybody.

3. Testing platform: For some reasons i couldn't run the tests properly so that i had to abuse of my mentor's time and patience which ended up making him as frustrated as I am when I eat pasta abroad.

4. Cooperation between students: I know i'm not the most sociable person of the universe. Given that I think that cooperation could be improved. Sadly I think it will probably require a whip.

5. JNW failing at understanding what "jussive" means.


Future work

For the future we are planning to rewrite TRmorph using another standard (probably lexc + twol). I'm already working at the lexicon, which is available at apertium SVN repository.