Difference between revisions of "Turkish and Azerbaijani/Final report"
| Line 35: | Line 35: | ||
| |- | |- | ||
| |} | |} | ||
| ==How did  GSOC go== | |||
| The GSOC has been a very stimulating (no homo) and challenging. Our plan, at the beginning, was to settle somewhere around 85% of coverage (and we did), with an accuracy in translation between 75% and 85%. I'd say I'm quite satisfied with the thing itself: the translation is easy to post-edit, so that the software is definitely usable. Could it be better? Sure it could, but we are gonna talk of it in the "FUTURE" section. | |||
| == THE SOFTWARE == | |||
| The software consists of: apertium framework (which we didn't touch at all), a morphological analyzer for turkish, a generator for azerbaijani, a bidix and a set of CG rules. | |||
| ===What I am happy with=== | |||
| 1. Bidix: the bidix is quite vaste. 8000 words are actually a lot of words for agglutinative languages. We set our goal at 10.000 words, which was too much for a language like turkish. So, yes, i'm satisfied with it. | |||
| 2. The team: it seems that we could manage to get a team of knowledgeable people to fix most of the issues. Pros about the team: nobody knows turkish well but we are all able to reason over it and find better solutions and categorizations than OHOHIHAVEAPHDINTURKISHLINGUISTICS people. (Yes, i'm talking of the people who wrote the infamous Routledge Comprehensive Grammar of Turkish). | |||
| 3. CG: Yes, the CG is not that good. The point is that the lexicon contained in TRmorph is horrible and hence it makes the disambiguation hard, if not impossible. Our CG set is pretty basic and messy, nonetheless it serves the purpose quite well.  | |||
| ===What i am not not happy with=== | |||
| 1. Lexicons: the lexicon of TRmorph is really bad. The lexicon of Azmorph, which is derived from TRmorph one, is probably worse. | |||
| 2. Trmorph and Azmorph: I think it is safe to say that dealing with SFST has been pretty frustrating for everybody.  | |||
| 3. Testing platform: For some reasons i couldn't run the tests properly so that i had to abuse of my mentor's time and patience which ended up making him as frustrated as I am when I eat pasta abroad. | |||
| 4. Cooperation between students: I know i'm not the most sociable person of the universe. Given that I think that  cooperation could be improved. Sadly I think it will probably require a whip.  | |||
| 5. JNW failing at understanding what "jussive" means. | |||
| ==Future work== | ==Future work== | ||
| For the future we are planning to rewrite TRmorph using another standard (probably lexc + twol). I'm already working at the lexicon, which is available at apertium SVN repository.  | |||
Revision as of 15:50, 24 August 2011
placeholder
Description
Statistics
- Dictionaries
- azmorph lexicon:7375
- trmorph lexicon:7970
- tr-az bidix:8208
- Coverage
- Turkish Wikipedia ( , std. dev.: )
- Turkish SETimes ( , std. dev.: ) 84.84591576 +/- 0.568940593619
- Turkish ... ( , std. dev.: )
- Rules
- Error rate
| File | Num. Words | % OOV | WER (Sur) | PER (Sur) | WER (Lem) | PER (Lem) | 
|---|---|---|---|---|---|---|
| setimes.kosova_plate.tr.txt | 243 | 13.83 | 21.74 | 20.95 | - | - | 
| setimes.kosova.tr.txt | 424 | - | - | - | - | - | 
| setimes.bulgar.tr.txt | 395 | - | - | - | - | - | 
| wikipedia.kadinlar_askerler.tr.txt | 1165 | - | - | - | - | - | 
How did GSOC go
The GSOC has been a very stimulating (no homo) and challenging. Our plan, at the beginning, was to settle somewhere around 85% of coverage (and we did), with an accuracy in translation between 75% and 85%. I'd say I'm quite satisfied with the thing itself: the translation is easy to post-edit, so that the software is definitely usable. Could it be better? Sure it could, but we are gonna talk of it in the "FUTURE" section.
THE SOFTWARE
The software consists of: apertium framework (which we didn't touch at all), a morphological analyzer for turkish, a generator for azerbaijani, a bidix and a set of CG rules.
What I am happy with
1. Bidix: the bidix is quite vaste. 8000 words are actually a lot of words for agglutinative languages. We set our goal at 10.000 words, which was too much for a language like turkish. So, yes, i'm satisfied with it.
2. The team: it seems that we could manage to get a team of knowledgeable people to fix most of the issues. Pros about the team: nobody knows turkish well but we are all able to reason over it and find better solutions and categorizations than OHOHIHAVEAPHDINTURKISHLINGUISTICS people. (Yes, i'm talking of the people who wrote the infamous Routledge Comprehensive Grammar of Turkish).
3. CG: Yes, the CG is not that good. The point is that the lexicon contained in TRmorph is horrible and hence it makes the disambiguation hard, if not impossible. Our CG set is pretty basic and messy, nonetheless it serves the purpose quite well.
What i am not not happy with
1. Lexicons: the lexicon of TRmorph is really bad. The lexicon of Azmorph, which is derived from TRmorph one, is probably worse.
2. Trmorph and Azmorph: I think it is safe to say that dealing with SFST has been pretty frustrating for everybody.
3. Testing platform: For some reasons i couldn't run the tests properly so that i had to abuse of my mentor's time and patience which ended up making him as frustrated as I am when I eat pasta abroad.
4. Cooperation between students: I know i'm not the most sociable person of the universe. Given that I think that cooperation could be improved. Sadly I think it will probably require a whip.
5. JNW failing at understanding what "jussive" means.
Future work
For the future we are planning to rewrite TRmorph using another standard (probably lexc + twol). I'm already working at the lexicon, which is available at apertium SVN repository.

