Turkish and Azerbaijani/Final report

From Apertium
Jump to navigation Jump to search

placeholder

Description

Statistics

Dictionaries
  • azmorph lexicon:7375
  • trmorph lexicon:7970
  • tr-az bidix:8208
Coverage
  • Turkish Wikipedia ( , std. dev.: )
  • Turkish SETimes ( , std. dev.: ) 84.84591576 +/- 0.568940593619
  • Turkish ... ( , std. dev.: )
Rules
Error rate
File Num. Words % OOV WER (Sur) PER (Sur) WER (Lem) PER (Lem)
setimes.kosova_plate.tr.txt 243 13.83 21.74 20.95 - -
setimes.kosova.tr.txt 424 - - - - -
setimes.bulgar.tr.txt 395 - - - - -
wikipedia.kadinlar_askerler.tr.txt 1165 - - - - -

How did GSOC go

The GSOC has been a very stimulating (no homo) and challenging. Our plan, at the beginning, was to settle somewhere around 85% of coverage (and we did), with an accuracy in translation between 75% and 85%. I'd say I'm quite satisfied with the thing itself: the translation is easy to post-edit, so that the software is definitely usable. Could it be better? Sure it could, but we are gonna talk of it in the "FUTURE" section.


THE SOFTWARE

The software consists of: apertium framework (which we didn't touch at all), a morphological analyzer for turkish, a generator for azerbaijani, a bidix and a set of CG rules.

What I am happy with

1. Bidix: the bidix is quite vaste. 8000 words are actually a lot of words for agglutinative languages. We set our goal at 10.000 words, which was too much for a language like turkish. So, yes, i'm satisfied with it.

2. The team: it seems that we could manage to get a team of knowledgeable people to fix most of the issues. Pros about the team: nobody knows turkish well but we are all able to reason over it and find better solutions and categorizations than OHOHIHAVEAPHDINTURKISHLINGUISTICS people. (Yes, i'm talking of the people who wrote the infamous Routledge Comprehensive Grammar of Turkish). Another point that should probably be investigated is how beneficial variety in the group is to the project. Our group was quite mixed (we had in fact an engineer, a linguist and a nothing -me-) and we had, almost daily, bits of the eternal struggle between sein and sollen, with the linguist playing the part of the guy who always wanted to do things "in the right way" and the software developer playing the "the way it will work sooner" part. I was in the middle of the struggle and my initial position, which was more on the linguistics side of struggle, shifted towards the necessity of doing something that works. When somebody changes his point of view I think we can safely say that the discussion has been productive.

3. CG: Yes, the CG is not that good. The point is that the lexicon contained in TRmorph is horrible and hence it makes the disambiguation hard, if not impossible. Our CG set is pretty basic and messy, nonetheless it serves the purpose quite well. There are known issues with disambiguating turkish using a CG, we are aware of that. We plan to add a more sophisticated CG in the next months. Having ~80% of words translated well, with most of 20% of mistakes consisting of azmorph wrong morphology makes me think that the CG works quite well, given that in some cases a wrong disambiguation is not that detrimental to the translation, since we usually get many free rides, given the fact that turkish and azeri are really closely related languages.

4. Apertium and HFST: We didn't have to touch the code at all (I honestly wouldn't be able to) but the software proved its usefulness anyway. I have no idea if the tools the software provides us with would be enough to translate between two less-related languages. With turkish-azeri it worked pretty well. I think we can say that we can use Apertium to develop, in future, more pairs having turkish as a source language.

5. Turkish language: I definitely think that working at turkish-azerbaijani made us face the complexity and uniqueness of the turkish language. Dealing with turkic languages, while being all native speakers of IE languages, is sometimes frustrating and generally hard. I'm pretty sure that even people who had no clue about turkish before the project (like my mentor spectie) are feeling really happy for having the chance to approach to it in a scientific and hacking-like manner.

What i am not not happy with

1. Lexicons: the lexicon of TRmorph is really bad. The lexicon of Azmorph, which is derived from TRmorph one, is probably worse.

2. Trmorph and Azmorph: I think it is safe to say that dealing with SFST has been pretty frustrating for everybody.

3. Testing platform: For some reasons i couldn't run the tests properly so that i had to abuse of my mentor's time and patience which ended up making him as frustrated as I am when I eat pasta abroad.

4. Cooperation between students: I know i'm not the most sociable person of the universe. Given that I think that cooperation could be improved. Sadly I think it will probably require a whip.

5. JNW failing at understanding what "jussive" means.


Future work

For the future we are planning to rewrite TRmorph using another standard (probably lexc + twol). I'm already working at the lexicon, which is available at apertium SVN repository.