Difference between revisions of "Apertium-kaz-kir/TODO"

From Apertium
Jump to navigation Jump to search
Line 2: Line 2:
 
* primary goals:
 
* primary goals:
 
** total 6500 stems in dix
 
** total 6500 stems in dix
  +
** azattyq_24455849 WER ≤8%
** 500-word evaluation, WER ~10% (azattyq_24455849, ideally <8%)
 
** trimmed coverage 72%
+
** trimmed coverage ≥72%
 
** clean testvoc for the following categories:
 
** clean testvoc for the following categories:
 
*** {{tag|postadv}} {{tag|ij}}
 
*** {{tag|postadv}} {{tag|ij}}

Revision as of 04:02, 23 July 2013

By midterm

  • primary goals:
    • total 6500 stems in dix
    • azattyq_24455849 WER ≤8%
    • trimmed coverage ≥72%
    • clean testvoc for the following categories:
      • <postadv> <ij>
      • <num> <post>
      • <cnjcoo> <cnjadv> <cnjsub>
      • <adv>

By 22 July

  • Add another 1000 words
  • Finish WER process for texts/azattyq_24455849.txt
  • Work with JNW on testvoc for closed categories.

By 14 July

  • add 800 stems
    mostly nouns, verbs, adjectives (i.e., simple categories)
    • 100 top stems from wikipedia corpus
    • 100 top stems from rferl/azattyq corpus
    • 100 top stems from bible corpus
    • 100 top stems from quran corpus
    • any 400 words marked i="yes" in dix
      • sort these into their appropriate sections
      • fix the Kyrgyz translation when needed (many will need to be fixed)
      • remove i="yes" part
  • Start work on WER for texts/azattyq_24455849.txt
    • Use kaz-kir and output to texts/azattyq_24455849.kaz-kir.txt
    • Add words/etc. to transducer needed until there are no */#/@
    • Copy to texts/azattyq_24455849.kaz-kir-postedited.txt
    • Post-edit until the postedited Kyrgyz is clean
    • Add lexical selection rules and transfer rules as needed
    • Goal: get WER down to around 10%
  • Fix the following minor problems:
    • words should not be entered with different capitalisation:
      • құран=куран / Құран=Куран (remove one of them)
      • пайғамбар=пайгамбар / Пайғамбар=Пайгамбар (remove one of them)
    • "шәксіз" is not a Kyrgyz word
    • there's an issue with -ақ; I think we'll need to work on it together