Difference between revisions of "Apertium-kaz-kir/TODO"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) (Created page with '== By 13 July == * '''add 800 stems''' *: ''mostly nouns, verbs, adjectives (i.e., simple categories)'' ** '''100''' top stems from wikipedia corpus ** '''100''' top stems from r…') |
Firespeaker (talk | contribs) |
||
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== By |
== By midterm == |
||
* total 6500 stems in dix |
|||
* azattyq_24455849 WER ≤7.5% |
|||
* trimmed coverage ≥72% |
|||
* clean testvoc for the following categories: |
|||
** {{tag|postadv}} {{tag|ij}} |
|||
** {{tag|num}} {{tag|post}} |
|||
** {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}} |
|||
** {{tag|adv}} |
|||
* commit at least once every 24 hours! |
|||
== By 22 July == |
|||
* Add another '''1000''' words |
|||
* Finish WER process for <tt>texts/azattyq_24455849.txt</tt> |
|||
* Work with JNW on testvoc for closed categories. |
|||
== By 14 July == |
|||
* '''add 800 stems''' |
* '''add 800 stems''' |
||
*: ''mostly nouns, verbs, adjectives (i.e., simple categories)'' |
*: ''mostly nouns, verbs, adjectives (i.e., simple categories)'' |
||
Line 6: | Line 22: | ||
** '''100''' top stems from bible corpus |
** '''100''' top stems from bible corpus |
||
** '''100''' top stems from quran corpus |
** '''100''' top stems from quran corpus |
||
** any '''400''' words marked i="yes" in dix |
** any '''400''' words marked <tt>i="yes"</tt> in dix |
||
*** sort these into their appropriate sections |
*** sort these into their appropriate sections |
||
*** fix the Kyrgyz translation |
*** fix the Kyrgyz translation when needed (many will need to be fixed) |
||
*** remove <tt>i="yes"</tt> part |
|||
* |
* Start work on WER for <tt>texts/azattyq_24455849.txt</tt> |
||
** Use kaz-kir and output to <tt>texts/azattyq_24455849.kaz-kir.txt</tt> |
|||
** Add words/etc. to transducer needed until there are no */#/@ |
|||
** Copy to <tt>texts/azattyq_24455849.kaz-kir-postedited.txt</tt> |
|||
** Post-edit until the postedited Kyrgyz is clean |
|||
** Add lexical selection rules and transfer rules as needed |
|||
** Goal: get WER down to around 10% |
|||
* Fix the following minor problems: |
* Fix the following minor problems: |
||
Line 18: | Line 41: | ||
** "шәксіз" is not a Kyrgyz word |
** "шәксіз" is not a Kyrgyz word |
||
** there's an issue with -ақ; I think we'll need to work on it together |
** there's an issue with -ақ; I think we'll need to work on it together |
||
[[Category:TODO lists]] |
Latest revision as of 21:22, 19 August 2015
By midterm[edit]
- total 6500 stems in dix
- azattyq_24455849 WER ≤7.5%
- trimmed coverage ≥72%
- clean testvoc for the following categories:
<postadv>
<ij>
<num>
<post>
<cnjcoo>
<cnjadv>
<cnjsub>
<adv>
- commit at least once every 24 hours!
By 22 July[edit]
- Add another 1000 words
- Finish WER process for texts/azattyq_24455849.txt
- Work with JNW on testvoc for closed categories.
By 14 July[edit]
- add 800 stems
- mostly nouns, verbs, adjectives (i.e., simple categories)
- 100 top stems from wikipedia corpus
- 100 top stems from rferl/azattyq corpus
- 100 top stems from bible corpus
- 100 top stems from quran corpus
- any 400 words marked i="yes" in dix
- sort these into their appropriate sections
- fix the Kyrgyz translation when needed (many will need to be fixed)
- remove i="yes" part
- Start work on WER for texts/azattyq_24455849.txt
- Use kaz-kir and output to texts/azattyq_24455849.kaz-kir.txt
- Add words/etc. to transducer needed until there are no */#/@
- Copy to texts/azattyq_24455849.kaz-kir-postedited.txt
- Post-edit until the postedited Kyrgyz is clean
- Add lexical selection rules and transfer rules as needed
- Goal: get WER down to around 10%
- Fix the following minor problems:
- words should not be entered with different capitalisation:
- құран=куран / Құран=Куран (remove one of them)
- пайғамбар=пайгамбар / Пайғамбар=Пайгамбар (remove one of them)
- "шәксіз" is not a Kyrgyz word
- there's an issue with -ақ; I think we'll need to work on it together
- words should not be entered with different capitalisation: