Difference between revisions of "Kazakh and Tatar"

From Apertium
Jump to navigation Jump to search
Line 52: Line 52:


*технологиялар *компьютер *мобильді *техникаларға *интернет *объектілерін
*технологиялар *компьютер *мобильді *техникаларға *интернет *объектілерін
*радиациялық *экология *сантехник *проблемасы *веб-*сайттар
*радиациялық *экология *сантехник *проблемасы *веб-*сайттар *позитивті


</pre>
</pre>

Revision as of 07:46, 18 August 2012

This is a language pair translating between Kazakh and Tatar.

General TODO

See /Work_plan.

  1. Declination of Tatar nouns ending with -и.
  2. Set up bidix-with-context.sh script (see apertium-kaz-tat/dev/bidix; seems to be very useful, requires another script from spectie).
  3. Add some of the short wikipedia-article-like texts I have for evaluation into texts (should be ~200 words).
  4. Implement cont. class for compound/multiword nouns which already have possessive ending (<px3sp>), e.g. Қытай Халық Республикасы.
    1. This continuation class should link only to CASE (but consider that some of them can have plural form: ишегаллары).
  5. Add "ярты", "ярым" and "чирек" as numerals, but don't link them to common numerals cont. class.
  6. (Lexical selection rule): сондай-ақ > шулай-ук
  7. Fix roman numerals:
    1. add them to tat.lexc too;
    2. change LEXICON NUM-ROMAN to something like this: %<num%>%<ord%>: # ; .
  8. Add transfer rule(s) to handle instrumental case of all parts-of-speech which are subject to substantivation, not only of nouns (this is one of the things which make testvoc results look bad)
  9. A separate cont.class for verbs which have causative forms ending with -дыр/-дер
  10. A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
  11. 'Natinfl cont. class in tat.lexc
  12. Fix "дыр<mod_ind>" thing (it doesn't pass bidix right now)
  13. Pronouns
    1. check cont. classes (note: if it looks like an overgeneration, and me is not sure about it, overgenerate in both lexc's)
    2. translate pronouns from kaz.lexc, add them to bidix and add equivalents into tat.lexc
    3. ^нигез/ни<prn><itg><px2pl><nom>
  14. Determiners
    1. "unify" cont. classes and tags
    2. add stems
  15. Adjectives
    1. personal clitics after adjectives are not implemented yet

Twol related stuff

  1. Current: ^миллион<num><subst><dat>$ --> миллионге Should be: ^миллион<num><subst><dat>$ --> миллионға
  2. Current: ^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлесгенде Should be: ^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлескенде
  3. Deletion of soft sign "ь" before vowels in Tatar (see comments at the end of the apertium-tat/apertium-tat.tat.twol file)
  4. Kazakh: ^ойна<v><tv><ifi><p1><pl>$ --> ойнадык Should be: ойнадыҚ
  5. (tat) *аенда
  6. *жатқандығын
  7. безнекенеме (accusative case before clitics); безнекенгәме
  8. *журналистерді - *журналистеріне - *журналистерді
    • something like т:0 <=> :с/:0 _ %{L%}:/:0
  9. *Назарбаевтың

International vocabulary

*операция  *терроризмге *прокуратураның(phon) *массивіндегі  *террорлық *Факті  
*кодекстің *терроризмге  *Полицейлер *журналистерді  «*АНТИТЕРРОРЛЫҚ *ОПЕРАЦИЯ» 
*полицейлер  *антитеррорлық *операция  *режим   *полицейлер   *журналистерді   
*автоматты  *автобустар   *полицейлер   *журналистеріне  *сайттың   

*технологиялар *компьютер  *мобильді *техникаларға *интернет *объектілерін 
*радиациялық *экология  *сантехник *проблемасы *веб-*сайттар *позитивті

Discuss first

  1. There is only one formal form (<frm>) in Tatar, which can be both sg and plural. But in Kazakh there are two forms. Should I pretend as if in Tatar it *were* the same and duplicate the same form with a different tag or should I handle it in transfer?
  2. Consider турындагы - should it still be tagged as postposition?
  3. How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. әуреле > башын әйләндер)

Part-of-speech related TODO's and DONE's can be found here:

To run tests, use aq-regtest utility from Apertium-quality tools. E.g.

aq-regtest -d . kaz-tat http://wiki.apertium.org/wiki/Special:Export/Kazakh_and_Tatar/Postadvebs

Done

But keep an eye on this
  • Numerals
    • kaz <num><subst>(<px3>) in fractions[1] = tat <num><subst>(<px3>)
    • kaz <num><coll><advl> = tat <num><coll>
    • kaz <num><coll><subst> = tat <num><subst>

Notes

  1. Currently whether it is in fractions or not is not taken into account

See also