Difference between revisions of "Apertium-kaz-kir/TODO"

From Apertium
Jump to navigation Jump to search
(Created page with '== By 13 July == * '''add 800 stems''' *: ''mostly nouns, verbs, adjectives (i.e., simple categories)'' ** '''100''' top stems from wikipedia corpus ** '''100''' top stems from r…')
 
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
== By 13 July ==
+
== By midterm ==
  +
* total 6500 stems in dix
  +
* azattyq_24455849 WER ≤7.5%
  +
* trimmed coverage ≥72%
  +
* clean testvoc for the following categories:
  +
** {{tag|postadv}} {{tag|ij}}
  +
** {{tag|num}} {{tag|post}}
  +
** {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}}
  +
** {{tag|adv}}
  +
* commit at least once every 24 hours!
  +
  +
== By 22 July ==
  +
* Add another '''1000''' words
  +
* Finish WER process for <tt>texts/azattyq_24455849.txt</tt>
  +
* Work with JNW on testvoc for closed categories.
  +
  +
== By 14 July ==
 
* '''add 800 stems'''
 
* '''add 800 stems'''
 
*: ''mostly nouns, verbs, adjectives (i.e., simple categories)''
 
*: ''mostly nouns, verbs, adjectives (i.e., simple categories)''
Line 6: Line 22:
 
** '''100''' top stems from bible corpus
 
** '''100''' top stems from bible corpus
 
** '''100''' top stems from quran corpus
 
** '''100''' top stems from quran corpus
** any '''400''' words marked i="yes" in dix
+
** any '''400''' words marked <tt>i="yes"</tt> in dix
 
*** sort these into their appropriate sections
 
*** sort these into their appropriate sections
*** fix the Kyrgyz translation if needed
+
*** fix the Kyrgyz translation when needed (many will need to be fixed)
  +
*** remove <tt>i="yes"</tt> part
   
* Get WER on <tt>texts/azattyq_24455849.txt</tt> down to around 10%
+
* Start work on WER for <tt>texts/azattyq_24455849.txt</tt>
  +
** Use kaz-kir and output to <tt>texts/azattyq_24455849.kaz-kir.txt</tt>
  +
** Add words/etc. to transducer needed until there are no */#/@
  +
** Copy to <tt>texts/azattyq_24455849.kaz-kir-postedited.txt</tt>
  +
** Post-edit until the postedited Kyrgyz is clean
  +
** Add lexical selection rules and transfer rules as needed
  +
** Goal: get WER down to around 10%
   
 
* Fix the following minor problems:
 
* Fix the following minor problems:
Line 18: Line 41:
 
** "шәксіз" is not a Kyrgyz word
 
** "шәксіз" is not a Kyrgyz word
 
** there's an issue with -ақ; I think we'll need to work on it together
 
** there's an issue with -ақ; I think we'll need to work on it together
  +
  +
[[Category:TODO lists]]

Latest revision as of 21:22, 19 August 2015

By midterm[edit]

  • total 6500 stems in dix
  • azattyq_24455849 WER ≤7.5%
  • trimmed coverage ≥72%
  • clean testvoc for the following categories:
    • <postadv> <ij>
    • <num> <post>
    • <cnjcoo> <cnjadv> <cnjsub>
    • <adv>
  • commit at least once every 24 hours!

By 22 July[edit]

  • Add another 1000 words
  • Finish WER process for texts/azattyq_24455849.txt
  • Work with JNW on testvoc for closed categories.

By 14 July[edit]

  • add 800 stems
    mostly nouns, verbs, adjectives (i.e., simple categories)
    • 100 top stems from wikipedia corpus
    • 100 top stems from rferl/azattyq corpus
    • 100 top stems from bible corpus
    • 100 top stems from quran corpus
    • any 400 words marked i="yes" in dix
      • sort these into their appropriate sections
      • fix the Kyrgyz translation when needed (many will need to be fixed)
      • remove i="yes" part
  • Start work on WER for texts/azattyq_24455849.txt
    • Use kaz-kir and output to texts/azattyq_24455849.kaz-kir.txt
    • Add words/etc. to transducer needed until there are no */#/@
    • Copy to texts/azattyq_24455849.kaz-kir-postedited.txt
    • Post-edit until the postedited Kyrgyz is clean
    • Add lexical selection rules and transfer rules as needed
    • Goal: get WER down to around 10%
  • Fix the following minor problems:
    • words should not be entered with different capitalisation:
      • құран=куран / Құран=Куран (remove one of them)
      • пайғамбар=пайгамбар / Пайғамбар=Пайгамбар (remove one of them)
    • "шәксіз" is not a Kyrgyz word
    • there's an issue with -ақ; I think we'll need to work on it together