Difference between revisions of "Kazakh and Tatar/TODO"

From Apertium
Jump to navigation Jump to search
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== General TODO ==
== Goals ==


In both directions: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text. It has full coverage on '''Абай жолы. Бірінші кітап''' and disambiguates it fully with at least 90% of precision. There is a unified and documented testing framework to test morphotactics (specs for each [[Turkic_lexicon#Morphotactics|type II LEXICON]]), morphophonology, coverage, CG performance, regression and pending tests targeted mainly at transfer rules, a “gold standard” parallel corpus to measure the WER and to always have something to work on. Tests are fast (slow parts are decoupled). Testvoc is clean.
See [[Kazakh and Tatar/Work_plan]] and [[Kazakh and Tatar/Remaining unanalysed forms]]


== Road map ==
# '''0 itself and numbers containing it aren't analyzed (in both directions)'''
## This is only true for the transducers in apertium-kaz-tat, apertium-kaz and apertium-tat ones work fine.
# Make instrumental case to a clitical postposition, leaving only 6 cases which are the same both in Tatar and Kazakh
## update the t1x files accordingly (i.e. get rid of the rules for handling instrumental case)
# Declination of Tatar nouns ending with -и.
# A separate cont.class for verbs which have causative forms ending with -дыр/-дер
#* Isn't this the default for {{tag|v}}{{tag|iv}} ?
# A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
#* What do you mean? —[[User:Firespeaker|Firespeaker]] 16:20, 6 February 2013 (UTC)
# [[Apertium-kaz-tat/Ideas_for_Disambiguation_Rules|Better disambiguation]]


* coverage (more stems and better morphology)
== Phonology-related stuff ==
** <s>add "internationalisms" spectie has put in /dev</s>
** see also: [[Kazakh and Tatar/Remaining unanalysed forms]]
* constraint grammar
* transfer
* lexical selection


== General TODO ==
Might be twol, might not be, but JNW needs to go through this stuff and figure out the issues.

=== Kazakh ===

{| class="wikitable" border="1"
|-
! Currently generated incorrect form(s)
! Unanalyzed correct form(s)
! Comments
|-
| ^жатқандықын/жат<v><iv><ger_perf><px3sp><acc>/жат<vaux><ger_perf><px3sp><acc>$
| 200 *жатқандығын
| <pre>%<ger_perf%>:%>%{G%}%{A%}н%>%{L%}%{I%}%{K%} GER-INFL ;</pre>
|-
| colspan="2" |
<code>
/apertium-kaz$ echo "^біз<prn><pers><p1><pl><px><nom>+ма<qst>$" | hfst-proc -g kaz.autogen.hfst
<br />біздікін бе

/apertium-kaz$ echo "біздікі бе" | hfst-proc kaz.automorf.hfst
<br />^біздікі/біз<prn><pers><p1><pl><px><nom>/біз<prn><pers><p1><pl><px><nom>+е<cop><p3><pl>/біз<prn><pers><p1><pl><px><nom>+е<cop><p3><sg>$ ^бе/ма<qst>$

/apertium-kaz$ echo "біздікін бе?" | hfst-proc kaz.automorf.hfst
<br />^біздікін бе/біз<prn><pers><p1><pl><px><nom>+е<cop><p3><pl>+ма<qst>/біз<prn><pers><p1><pl><px><nom>+е<cop><p3><sg>+ма<qst>/біз<prn><pers><p1><pl><px><nom>+ма<qst>$^?/?<sent>$

/apertium-kaz$ echo "біздікін" | hfst-proc kaz.automorf.hfst
<br />^біздікін/біз<prn><pers><p1><pl><px><acc>$

/apertium-kaz$ echo "^біз<prn><pers><p1><pl><px><acc>+ма<qst>$" | hfst-proc -g kaz.autogen.hfst
<br />біздікіні ме
<code>
| Has something to do with %{n%} archiphoneme realisation before clitics
|}

==== done (but keep an eye on) ====
* <s>Current: <code>^миллион<num><subst><dat>$ --> миллионге</code> Should be: <code>^миллион<num><subst><dat>$ --> миллионға</code></s>
* <s>Current: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлесгенде</code> Should be: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлескенде</code></s>
* <s>Kazakh: <code>^ойна<v><tv><ifi><p1><pl>$ --> ойнадык</code> Should be: ''ойнадыҚ''</s>
* <s>*журналистерді - *журналистеріне - *журналистерді
** something like <tt>т:0 <=> :с/:0 _ %{L%}:/:0</tt></s> (<tt>r40597</tt>)
* <s>(kaz) *Назарбаевтың</s> (<tt>r40594</tt>)
* <s>АКШ-*тың НАТО-*ның
:: This is a problem with lexc, not twol —[[User:Firespeaker|Firespeaker]] 06:15, 20 August 2012 (UTC)</s>
* <s>words with "[back vowel]...и[(cons)]" (i.e., borrowings)</s> '''dealt with via <tt>%{☭%}</tt>'''
** <s>(kaz) *организмдер / организм<n><pl><nom> = организмдар</s> (<tt>r40597</tt>)
** <s>Currently: <code>^Исраил<np><ant><m><gen>/Исраилдың$</code> and <code>^Исраил<np><ant><m><dat>/Исраилға$</code> Correct forms are ''Исраилдің'' and ''Исраилге'' respectively</s>
** <s>Currently: <code>^Иерусалим<np><top><dat>/Иерусалимға$</code>, ''Иерусалимдағы'' and ''Иерусалимның''. Correct forms are ''Иерусалимге'', ''Иерусалимдегі'' and ''Иерусалимнең'' respectively. '''In short, make them take front vowel affixes!'''</s>
* (kaz) процесс, процесі/процессі, процесінің/процессінің
* <s>(kaz) автомобиль<n><attr>, *автомобильдер // автомобиль<n><pl><nom> = автомобильлер
:: As far as I can tell, автомобильдер is the most common form. The form автомобилдер also seems to be used, but doesn't look super formal, and автомобильлер seems to only be attested in "Kazakh" because Nissan seems to like to write in Noğay for its Kazakh-speaking audience. —[[User:Firespeaker|Firespeaker]] 06:10, 20 August 2012 (UTC)
::: The thing is that the form we are generating is автомобильлер. - [[User:Francis Tyers|Francis Tyers]] 07:08, 20 August 2012 (UTC)</s> (<tt>r40704</tt>)
* <s>^организмге/*организмге$ *организмнің *организмнен</s> (<tt>r40705</tt>)
* <s>*тарихынан *тарихы ...</s> (≤<tt>r40705</tt>)

{| class="wikitable" border="1"
|-
! Currently generated incorrect form(s)
! Unanalyzed correct form(s)
! Comments
|-
| ^қаубы/қауіп<n><px3sp><nom>
| 294 *қаупі
|
<pre>қауіп:қау%{y%}п N1 ; ! "danger"
қауіп:қауіп N1 ; ! "danger" Dir/LR</pre>

<pre>.gc қаупі=10,500 .gc қауіпі=1,560 .gc қаупы=27 .gc қауыпы=10 .gc қәуіп=11</pre>
|-
| ^құғы/құқ<n><px3sp><nom>$
| 284 *құқы
| Final consonant remains voiceles in intervocalic position.
|-
| ^жойу/жой<v><tv><ger><nom>$
| 215 *жою
|
|-
| align="center" colspan="3" | '''и phonology'''
|-
| ^жиіліп/жи<v><tv><pass><gna_perf>/жи<v><tv><pass><prc_perf>$
| 35 *жиылып
|rowspan="3"| Added in lexc as <code>жи:жи V-TV ; ! ""</code>. Tried to change it to <code>жи:жи%{й%} V-TV ; ! ""</code> — makes ''жиып'' work, but doesn't affect the gerund form. Not quite the right thing.
|-
| жиіп/жи<v><tv><gna_perf>/жи<v><tv><prc_perf>$
| 58 *жиып
|-
| ^жиу/жи<v><tv><ger><nom>$
| жию
|} —r42636 and previous

=== Tatar ===

# (tat) generates ''укыу''
# <s>(tat) generates ''айендә''</s>
# Deletion of soft sign "ь" before vowels in Tatar (see comments at the end of the <code>apertium-tat/apertium-tat.tat.twol</code> file)
# (tat) <s>''*аенда''</s>, generating ''музейе'' instead of correct ''музее''
# <code>apertium-tat$ echo "^йөр<v><iv><gpr_impf>$" | hfst-proc -g tat.autogen.hfst</code> >> <code>йөрә торган</code>
# ^хокукын/*хокукын$ <-- ^хокукын<n><px3sp><acc>$ See ^құқын/*құқын$ issue above.

{| class="wikitable" border="1"
|-
! Currently generated incorrect form(s)
! Unanalyzed correct form(s)
! Comments
|-
| ^безнекенеме/без<prn><pers><p1><pl><px><acc>+мы<qst>$
| ^безнекенме/*безнекенме$
| See "біздікі" above
|}

== Other ==
=== International vocabulary ===


* s/fut3/vol/
* <s>'''0 itself and numbers containing it aren't analyzed (in both directions)'''</s>
** <s>This is only true for the transducers in apertium-kaz-tat, apertium-kaz and apertium-tat ones work fine.</s>
* A number with a following . is analyzed incorrectly and therefore not generated:
** When apertium (not hfst-proc) is used, this is the case for any number at the end of the line, because deformatter puts a "." at the end of the sentence automatically.
<pre>
<pre>
/apertium-kaz$ echo "21." | hfst-proc kaz.automorf.hfst
*терроризмге *массивіндегі *террорлық *Факті
^21./21.<num>$
*кодекстің *терроризмге *Полицейлер *журналистерді «*АНТИТЕРРОРЛЫҚ »
*полицейлер *антитеррорлық *режим *полицейлер *журналистерді
*автоматты *автобустар *полицейлер *журналистеріне *сайттың

*технологиялар *компьютер *мобильді *техникаларға *интернет *объектілерін
*радиациялық *сантехник *проблемасы *веб-*сайттар *позитивті *алгебра

*коалициялық

*иммиграциялық *дипломатиялық *стратегиялық

*станциясында

</pre>
</pre>
* Make instrumental case to a clitical postposition, leaving only 6 cases which are the same both in Tatar and Kazakh (see [[http://wiki.apertium.org/wiki/Morphology_of_Kyrgyz_language#Cases_by_syntactic_function]] and the log from 12.03.2013 for reference)

** update the t1x files accordingly (i.e. get rid of the rules for handling instrumental case)
===Proper nouns===
* Revise continuations of gerunds

* жігіт% %{М%}ен
<pre>
* Declination of Tatar nouns ending with -и.

* A separate cont.class for verbs which have causative forms ending with -дыр/-дер
*Гонконгтан
** Isn't this the default for {{tag|v}}{{tag|iv}} ?

* A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
</pre>
** What do you mean? —[[User:Firespeaker|Firespeaker]] 16:20, 6 February 2013 (UTC)

* [[Apertium-kaz-tat/Ideas_for_Disambiguation_Rules|Better disambiguation]]
=== Other ===
* <code>көр%<v%>%<tv%>%<imp%>%<p2%>%<sg%>:гөр # ; ! "" Dir/LR</code> get's trimmed
* ма не - мыни thing
* handle gna_cond + DA<postadv> issue in lexc, not in CG
* Handle the sentences from the paper in transfer, not in CG
* Some nouns in Tatar (and Kazakh) lexc seem to be in NLEX and NLEX-RUS. This is fine for analysis, but which form is generated? There should be some <tt>! Dir/..</tt> filtering somewhere in there.
* Some nouns in Tatar (and Kazakh) lexc seem to be in NLEX and NLEX-RUS. This is fine for analysis, but which form is generated? There should be some <tt>! Dir/..</tt> filtering somewhere in there.
* Consider ''турындагы'' - should it still be tagged as postposition?
* How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. ''әуреле > башын әйләндер'')
* a better default translation for Kazakh past.evid


=== Algorithm for checking dictionaries (as part of the testvocing) ===
=== Discuss first ===


* Go through entries in bidix
# There is only one formal form (<frm>) in Tatar, which can be both sg and plural. But in Kazakh there are two forms. Should I pretend as if in Tatar it *were* the same and duplicate the same form with a different tag or should I handle it in transfer?
** Get rid of duplications, FIXME's, alternatively spelled variants (handling them in lexc instead)
# Consider ''турындагы'' - should it still be tagged as postposition?
* Look up bidix stems in lexc's and make sure every one of them is in there (and we don't loose any bit of coverage)
# How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. ''әуреле > башын әйләндер'')
* Try to get rid of FIXME's for stems in lexc's
----
* Have a quick look at continuation lexicons and make sure they follow standards (some of the lexicon names in tat.lexc in particular)

* Run testvoc (either for each different class separately -- commenting out all other root lexicons -- or without modifying anything if it doesn't take forever)
Part-of-speech related TODO's and DONE's can be found here:
* If a Tatar noun marked with 'Use/MT' is not used in kaz-tat.dix, get rid of it in tat.lexc

* [[Kazakh and Tatar/Postadvebs|/Postadverbs]]
* [[Kazakh and Tatar/Postpositions|/Postpositions]]

To run tests, use <code>aq-regtest</code> utility from [[Apertium-quality]] tools. E.g. <pre>aq-regtest -d . kaz-tat http://wiki.apertium.org/wiki/Special:Export/Kazakh_and_Tatar/Postadvebs</pre>

== Done ==

; But keep an eye on this

* Numerals
** kaz <num><subst>(<px3>) in fractions<ref>Currently whether it is in fractions or not is not taken into account</ref> = tat <num><subst>(<px3>)
** kaz <num><coll><advl> = tat <num><coll>
** kaz <num><coll><subst> = tat <num><subst>


== Notes ==
== Notes ==
Line 187: Line 60:
* [[Kazakh and Tatar/Pending tests]]
* [[Kazakh and Tatar/Pending tests]]
* [[Kazakh and Tatar/Regression tests]]
* [[Kazakh and Tatar/Regression tests]]

[[Category:Kazakh and Tatar|*]]
[[Category:TODO lists]]

Latest revision as of 21:20, 31 August 2015

Goals[edit]

In both directions: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text. It has full coverage on Абай жолы. Бірінші кітап and disambiguates it fully with at least 90% of precision. There is a unified and documented testing framework to test morphotactics (specs for each type II LEXICON), morphophonology, coverage, CG performance, regression and pending tests targeted mainly at transfer rules, a “gold standard” parallel corpus to measure the WER and to always have something to work on. Tests are fast (slow parts are decoupled). Testvoc is clean.

Road map[edit]

General TODO[edit]

  • s/fut3/vol/
  • 0 itself and numbers containing it aren't analyzed (in both directions)
    • This is only true for the transducers in apertium-kaz-tat, apertium-kaz and apertium-tat ones work fine.
  • A number with a following . is analyzed incorrectly and therefore not generated:
    • When apertium (not hfst-proc) is used, this is the case for any number at the end of the line, because deformatter puts a "." at the end of the sentence automatically.
/apertium-kaz$ echo "21." | hfst-proc kaz.automorf.hfst 
^21./21.<num>$
  • Make instrumental case to a clitical postposition, leaving only 6 cases which are the same both in Tatar and Kazakh (see [[1]] and the log from 12.03.2013 for reference)
    • update the t1x files accordingly (i.e. get rid of the rules for handling instrumental case)
  • Revise continuations of gerunds
  • жігіт% %{М%}ен
  • Declination of Tatar nouns ending with -и.
  • A separate cont.class for verbs which have causative forms ending with -дыр/-дер
    • Isn't this the default for <v><iv> ?
  • A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
    • What do you mean? —Firespeaker 16:20, 6 February 2013 (UTC)
  • Better disambiguation
  • көр%<v%>%<tv%>%<imp%>%<p2%>%<sg%>:гөр # ; ! "" Dir/LR get's trimmed
  • ма не - мыни thing
  • handle gna_cond + DA<postadv> issue in lexc, not in CG
  • Handle the sentences from the paper in transfer, not in CG
  • Some nouns in Tatar (and Kazakh) lexc seem to be in NLEX and NLEX-RUS. This is fine for analysis, but which form is generated? There should be some ! Dir/.. filtering somewhere in there.
  • Consider турындагы - should it still be tagged as postposition?
  • How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. әуреле > башын әйләндер)
  • a better default translation for Kazakh past.evid

Algorithm for checking dictionaries (as part of the testvocing)[edit]

  • Go through entries in bidix
    • Get rid of duplications, FIXME's, alternatively spelled variants (handling them in lexc instead)
  • Look up bidix stems in lexc's and make sure every one of them is in there (and we don't loose any bit of coverage)
  • Try to get rid of FIXME's for stems in lexc's
  • Have a quick look at continuation lexicons and make sure they follow standards (some of the lexicon names in tat.lexc in particular)
  • Run testvoc (either for each different class separately -- commenting out all other root lexicons -- or without modifying anything if it doesn't take forever)
  • If a Tatar noun marked with 'Use/MT' is not used in kaz-tat.dix, get rid of it in tat.lexc

Notes[edit]


See also[edit]