Kazakh and Tatar/TODO

From Apertium
Jump to navigation Jump to search

Goals

In both directions: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text. It has full coverage on Абай жолы. Бірінші кітап and disambiguates it fully with at least 90% of precision. There is a unified and documented testing framework to test morphotactics (specs for each type II LEXICON), morphophonology, coverage, CG performance, regression and pending tests targeted mainly at transfer rules, a “gold standard” parallel corpus to measure the WER and to always have something to work on. Tests are fast (slow parts are decoupled). Testvoc is clean.

Road map

General TODO

  • s/fut3/vol/
  • 0 itself and numbers containing it aren't analyzed (in both directions)
    • This is only true for the transducers in apertium-kaz-tat, apertium-kaz and apertium-tat ones work fine.
  • A number with a following . is analyzed incorrectly and therefore not generated:
    • When apertium (not hfst-proc) is used, this is the case for any number at the end of the line, because deformatter puts a "." at the end of the sentence automatically.
/apertium-kaz$ echo "21." | hfst-proc kaz.automorf.hfst 
^21./21.<num>$
  • Make instrumental case to a clitical postposition, leaving only 6 cases which are the same both in Tatar and Kazakh (see [[1]] and the log from 12.03.2013 for reference)
    • update the t1x files accordingly (i.e. get rid of the rules for handling instrumental case)
  • Revise continuations of gerunds
  • жігіт% %{М%}ен
  • Declination of Tatar nouns ending with -и.
  • A separate cont.class for verbs which have causative forms ending with -дыр/-дер
    • Isn't this the default for <v><iv> ?
  • A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
    • What do you mean? —Firespeaker 16:20, 6 February 2013 (UTC)
  • Better disambiguation
  • көр%<v%>%<tv%>%<imp%>%<p2%>%<sg%>:гөр # ; ! "" Dir/LR get's trimmed
  • ма не - мыни thing
  • handle gna_cond + DA<postadv> issue in lexc, not in CG
  • Handle the sentences from the paper in transfer, not in CG
  • Some nouns in Tatar (and Kazakh) lexc seem to be in NLEX and NLEX-RUS. This is fine for analysis, but which form is generated? There should be some ! Dir/.. filtering somewhere in there.
  • Consider турындагы - should it still be tagged as postposition?
  • How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. әуреле > башын әйләндер)
  • a better default translation for Kazakh past.evid

Algorithm for checking dictionaries (as part of the testvocing)

  • Go through entries in bidix
    • Get rid of duplications, FIXME's, alternatively spelled variants (handling them in lexc instead)
  • Look up bidix stems in lexc's and make sure every one of them is in there (and we don't loose any bit of coverage)
  • Try to get rid of FIXME's for stems in lexc's
  • Have a quick look at continuation lexicons and make sure they follow standards (some of the lexicon names in tat.lexc in particular)
  • Run testvoc (either for each different class separately -- commenting out all other root lexicons -- or without modifying anything if it doesn't take forever)
  • If a Tatar noun marked with 'Use/MT' is not used in kaz-tat.dix, get rid of it in tat.lexc

Notes


See also