Difference between revisions of "Kazakh and Tatar"

From Apertium
Jump to navigation Jump to search
Line 27: Line 27:
 
# Current: <code>^миллион<num><subst><dat>$ --> миллионге</code> Should be: <code>^миллион<num><subst><dat>$ --> миллионға</code>
 
# Current: <code>^миллион<num><subst><dat>$ --> миллионге</code> Should be: <code>^миллион<num><subst><dat>$ --> миллионға</code>
 
# Current: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлесгенде</code> Should be: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлескенде</code>
 
# Current: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлесгенде</code> Should be: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлескенде</code>
# Deletions of soft sign "ь" before vowels in Tatar (see comments at the end of the <code>apertium-tat/apertium-tat.tat.twol</code> file)
+
# Deletion of soft sign "ь" before vowels in Tatar (see comments at the end of the <code>apertium-tat/apertium-tat.tat.twol</code> file)
  +
# Kazakh: <code>^ойна<v><tv><ifi><p1><pl>$ --> ойнадык</code> Should be: ''ойнадыҚ''
   
 
=== Discuss first ===
 
=== Discuss first ===

Revision as of 13:06, 21 June 2012

This is a language pair translating between Kazakh and Tatar.

General TODO

See /Work_plan.

  1. Declination of Tatar nouns ending with -и.
  2. Set up bidix-with-context.sh script (see apertium-kaz-tat/dev/bidix; seems to be very useful, requires another script from spectie).
  3. Add some of the short wikipedia-article-like texts I have for evaluation into texts (should be ~200 words).
  4. Implement cont. class for compound/multiword nouns which already have possessive ending (<px3sp>), e.g. Қытай Халық Республикасы.
    1. This continuation class should link only to CASE (but consider that some of them can have plural form: ишегаллары).
  5. Add "ярты", "ярым" and "чирек" as numerals, but don't link them to common numerals cont. class.
  6. (Lexical selection rule): сондай-ақ > шулай-ук
  7. Fix roman numerals:
    1. add them to tat.lexc too;
    2. change LEXICON NUM-ROMAN to something like this: %<num%>%<ord%>: # ; .
  8. Add transfer rule(s) to handle instrumental case of all parts-of-speech which are subject to substantivation, not only of nouns (this is one of the things which make testvoc results look bad)
  9. Pronouns
    1. check cont. classes (note: if it looks like an overgeneration, and me is not sure about it, overgenerate in both lexc's)
    2. translate pronouns from kaz.lexc, add them to bidix and add equivalents into tat.lexc
  10. Determiners
    1. "unify" cont. classes and tags
    2. add stems

Twol realated stuff

  1. Current: ^миллион<num><subst><dat>$ --> миллионге Should be: ^миллион<num><subst><dat>$ --> миллионға
  2. Current: ^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлесгенде Should be: ^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлескенде
  3. Deletion of soft sign "ь" before vowels in Tatar (see comments at the end of the apertium-tat/apertium-tat.tat.twol file)
  4. Kazakh: ^ойна<v><tv><ifi><p1><pl>$ --> ойнадык Should be: ойнадыҚ

Discuss first

  1. There is only one formal form (<frm>) in Tatar, which can be both sg and plural. But in Kazakh there are two forms. Should I pretend as if in Tatar it *were* the same and duplicate the same form with a different tag or should I handle it in transfer?

Part-of-speech related TODO's and DONE's can be found here:

To run tests, use aq-regtest utility from Apertium-quality tools. E.g.

aq-regtest -d . kaz-tat http://wiki.apertium.org/wiki/Special:Export/Kazakh_and_Tatar/Postadvebs

Done

But keep an eye on this
  • Numerals
    • kaz <num><subst>(<px3>) in fractions[1] = tat <num><subst>(<px3>)
    • kaz <num><coll><advl> = tat <num><coll>
    • kaz <num><coll><subst> = tat <num><subst>

Notes

  1. Currently whether it is in fractions or not is not taken into account

See also