Turkish and Azerbaijani/Final report

From Apertium
Jump to navigation Jump to search

Description[edit]

The GSOC has been a very stimulating (no homo) and challenging. Our plan, at the beginning, was to settle somewhere around 85% of coverage (and we did), with an accuracy in translation between 75% and 85%. I'd say I'm quite satisfied with the thing itself: the translation is easy to post-edit, so that the software is definitely usable. Could it be better? Sure it could, but we are gonna talk of it in the "FUTURE" section.

The software consists of: Apertium framework (which we didn't touch at all), a morphological analyser for turkish, a generator for Azerbaijani, a bidix and a set of CG rules.

What I am happy with[edit]

1. Bidix: the bidix is quite vast. 8000 words are actually a lot of words for agglutinative languages. We set our goal at 10.000 words, which was too much for a language like turkish. So, yes, i'm satisfied with it.

2. The team: it seems that we could manage to get a team of knowledgeable people to fix most of the issues. Pros about the team: nobody knows turkish well but we are all able to reason over it and find better solutions and categorisations than OHOHIHAVEAPHDINTURKISHLINGUISTICS people. (Yes, i'm talking of the people who wrote the infamous Routledge Comprehensive Grammar of Turkish). Another point that should probably be investigated is how beneficial variety in the group is to the project. Our group was quite mixed (we had in fact a developer, a linguist and a nothing -me-) and we had, almost daily, bits of the eternal struggle between sein and sollen, with the linguist playing the part of the guy who always wanted to do things "in the right way" and the software developer playing the "the way it will work sooner" part. I was in the middle of the struggle and my initial position, which was more on the linguistics side of struggle, shifted towards the necessity of doing something that works. When somebody changes somebody's point of view I think we can safely say that the discussion has been productive.

3. CG: Yes, the CG is not that good. The point is that the lexicon contained in TRmorph is horrible and hence it makes the disambiguation hard, if not impossible. Our CG set is pretty basic and messy, nonetheless it serves the purpose quite well. There are known issues with disambiguating turkish using a CG, we are aware of that. We plan to add a more sophisticated CG in the next months. Having ~80% of words translated well, with most of 20% of mistakes consisting of azmorph wrong morphology makes me think that the CG works quite well, given that in some cases a wrong disambiguation is not that detrimental to the translation, since we usually get many free rides, given the fact that Turkish and Azeri are really closely related languages.

4. Apertium and HFST: We didn't have to touch the code at all (I honestly wouldn't be able to) but the software proved its usefulness anyway. I have no idea if the tools the software provides us with would be enough to translate between two less-related languages. With Turkish-Azeri it worked pretty well. I think we can say that we can use Apertium to develop, in future, more pairs having turkish as a source language.

5. Turkish language: I definitely think that working at turkish-azerbaijani made us face the complexity and uniqueness of the turkish language. Dealing with turkic languages, while being all native speakers of IE languages, is sometimes frustrating and generally hard. I'm pretty sure that even people who had no clue about Turkish before the project (like my mentor spectie) are feeling really happy for having the chance to approach to it in a scientific and hacking-like manner.

What i am not not happy with[edit]

1. Lexicons: the lexicon of TRmorph is really bad. The lexicon of Azmorph, which is derived from TRmorph one, is probably worse. I'd say that we have an average of ~25% of misplaced words (e.g. words assigned to the wrong grammatical category). Working in such conditions is pretty disturbing.

2. Trmorph and Azmorph: I think it is safe to say that dealing with SFST has been pretty frustrating for everybody. Adding a new feature to both of them required every time the intervention of the developer of TRmorph, Çağrı. The phonology is really complex, and not because Turkish morphology is complex itself. I'm rather talking about the way it has been implemented. Since what I just said could seem like an attack to the work of Çağrı I'd like to thank him personally taking advantage of this space. Developing a morphological analyser for Turkish is anything but a trivial thing to do. Also, for the sake of justice, it would be good to remind to everybody that TRmorph was originally developed as a stand-alone morphological analyser, with no particular focus on generation or MT. Thanks Çağrı for letting us (and actually every person willing to agree to a GPL licence) use your software.

3. Testing platform: For some reasons i couldn't run the tests properly so that i had to abuse of my mentor's time and patience which ended up making him as frustrated as I am when I eat pasta abroad. We were supposed, after some point, to use part of a software another student was developing, but we had no luck.

4. Cooperation between students: I know i'm not the most sociable person of the universe. Given that I think that cooperation could be improved. Sadly I think it will probably require a whip. Sometimes constant trolling (especially because i'm not part of the -we know how to code in 25 different languages Club -) is challenging and makes you want to quit. As i said before though, having a heterogeneous group helped a lot.

5. JNW failing at understanding what "jussive" means.

Statistics[edit]

Dictionaries
  • Azmorph lexicon: 7,375
  • Trmorph lexicon: 7,970
  • tr-az bidix: 8,208
Coverage
  • Turkish Wikipedia: 70.9 +/- 4.6352598044670
  • Turkish SETimes: 84.8 +/- 0.568940593619
Rules
  • Constraint grammar (disambiguation) rules: 114
  • Transfer rules: 4
Error rate
File Num. Words % OOV WER (Sur) PER (Sur) WER (Lem) PER (Lem)
setimes.kosova_plate.tr.txt 243 13.83 21.74 20.95 - -
setimes.kosova.tr.txt 424 - - - - -
setimes.bulgar.tr.txt 395 - - - - -
wikipedia.kadinlar_askerler.tr.txt 1165 10.79 22.89 22.31 - -

Future work[edit]

For the future we are planning to rewrite TRmorph using another standard (probably lexc + twol). I'm already working at the lexicon, which is available at Apertium SVN repository.

See also[edit]