Difference between revisions of "Kazakh and Tatar/TODO"
|  (→Goals) | |||
| Line 1: | Line 1: | ||
| == Goals == | == Goals == | ||
| In both directions: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text. It has full coverage on '''Абай жолы. Бірінші кітап''' and disambiguates it fully with at least 90% of precision. There is a unified and documented testing framework to test morphotactics (specs for each [[Turkic_lexicon#Morphotactics|type II LEXICON]]), morphophonology, coverage, CG performance, regression and pending tests targeted mainly at transfer rules, a “gold standard” parallel corpus to measure the WER and to always have something to work on. | In both directions: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text. It has full coverage on '''Абай жолы. Бірінші кітап''' and disambiguates it fully with at least 90% of precision. There is a unified and documented testing framework to test morphotactics (specs for each [[Turkic_lexicon#Morphotactics|type II LEXICON]]), morphophonology, coverage, CG performance, regression and pending tests targeted mainly at transfer rules, a “gold standard” parallel corpus to measure the WER and to always have something to work on. Testvoc is clean. | ||
| == Road map == | == Road map == | ||
Revision as of 12:18, 2 February 2014
Contents
Goals
In both directions: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text. It has full coverage on Абай жолы. Бірінші кітап and disambiguates it fully with at least 90% of precision. There is a unified and documented testing framework to test morphotactics (specs for each type II LEXICON), morphophonology, coverage, CG performance, regression and pending tests targeted mainly at transfer rules, a “gold standard” parallel corpus to measure the WER and to always have something to work on. Testvoc is clean.
Road map
- add "internationalisms" spectie has put in /dev
General TODO
See Kazakh and Tatar/Work_plan and Kazakh and Tatar/Remaining unanalysed forms
- Tatar coverage 90%
- s/fut3/vol/
- 0 itself and numbers containing it aren't analyzed (in both directions)- This is only true for the transducers in apertium-kaz-tat, apertium-kaz and apertium-tat ones work fine.
 
- A number with a following . is analyzed incorrectly and therefore not generated:
- When apertium (not hfst-proc) is used, this is the case for any number at the end of the line, because deformatter puts a "." at the end of the sentence automatically.
 
/apertium-kaz$ echo "21." | hfst-proc kaz.automorf.hfst ^21./21.<num>$
- Make instrumental case to a clitical postposition, leaving only 6 cases which are the same both in Tatar and Kazakh (see [[1]] and the log from 12.03.2013 for reference)
- update the t1x files accordingly (i.e. get rid of the rules for handling instrumental case)
 
- Revise continuations of gerunds
- жігіт% %{М%}ен
- Declination of Tatar nouns ending with -и.
- A separate cont.class for verbs which have causative forms ending with -дыр/-дер
- Isn't this the default for <v><iv>?
 
- Isn't this the default for 
- A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
- What do you mean? —Firespeaker 16:20, 6 February 2013 (UTC)
 
- Better disambiguation
- көр%<v%>%<tv%>%<imp%>%<p2%>%<sg%>:гөр # ; ! "" Dir/LRget's trimmed
- ма не - мыни thing
- handle gna_cond + DA<postadv> issue in lexc, not in CG
- Handle the sentences from the paper in transfer, not in CG
Might be twol, might not be, but JNW needs to go through this stuff and figure out the issues.
Kazakh
| Currently generated incorrect form(s) | Unanalyzed correct form(s) | Comments | 
|---|---|---|
| ^жатқандықын/жат<v><iv><ger_perf><px3sp><acc>/жат<vaux><ger_perf><px3sp><acc>$ | 200 *жатқандығын | %<ger_perf%>:%>%{G%}%{A%}н%>%{L%}%{I%}%{K%} GER-INFL ; | 
| 
 
 
 
 
  | Has something to do with %{n%} archiphoneme realisation before clitics | |
done (but keep an eye on)
- Current:- ^миллион<num><subst><dat>$ --> миллионгеShould be:- ^миллион<num><subst><dat>$ --> миллионға
- Current:- ^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлесгендеShould be:- ^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлескенде
- Kazakh:- ^ойна<v><tv><ifi><p1><pl>$ --> ойнадыкShould be: ойнадыҚ
- *журналистерді - *журналистеріне - *журналистерді- something like т:0 <=> :с/:0 _ %{L%}:/:0(r40597)
 
- (kaz) *Назарбаевтың(r40594)
- АКШ-*тың НАТО-*ның
- This is a problem with lexc, not twol —Firespeaker 06:15, 20 August 2012 (UTC)
 
- words with "[back vowel]...и[(cons)]" (i.e., borrowings)dealt with via %{☭%}- (kaz) *организмдер / организм<n><pl><nom> = организмдар(r40597)
- Currently:- ^Исраил<np><ant><m><gen>/Исраилдың$and- ^Исраил<np><ant><m><dat>/Исраилға$Correct forms are Исраилдің and Исраилге respectively
- Currently:- ^Иерусалим<np><top><dat>/Иерусалимға$, Иерусалимдағы and Иерусалимның. Correct forms are Иерусалимге, Иерусалимдегі and Иерусалимнең respectively. In short, make them take front vowel affixes!
 
- (kaz) процесс, процесі/процессі, процесінің/процессінің
- (kaz) автомобиль<n><attr>, *автомобильдер // автомобиль<n><pl><nom> = автомобильлер
- As far as I can tell, автомобильдер is the most common form. The form автомобилдер also seems to be used, but doesn't look super formal, and автомобильлер seems to only be attested in "Kazakh" because Nissan seems to like to write in Noğay for its Kazakh-speaking audience. —Firespeaker 06:10, 20 August 2012 (UTC)- The thing is that the form we are generating is автомобильлер. - Francis Tyers 07:08, 20 August 2012 (UTC)(r40704)
 
 
- ^организмге/*организмге$ *организмнің *организмнен(r40705)
- *тарихынан *тарихы ...(≤r40705)
| Currently generated incorrect form(s) | Unanalyzed correct form(s) | Comments | 
|---|---|---|
| ^қаубы/қауіп<n><px3sp><nom> | 294 *қаупі | қауіп:қау%{y%}п N1 ; ! "danger"
қауіп:қауіп N1 ; ! "danger" Dir/LR.gc қаупі=10,500 .gc қауіпі=1,560 .gc қаупы=27 .gc қауыпы=10 .gc қәуіп=11 | 
| ^құғы/құқ<n><px3sp><nom>$ | 284 *құқы | Final consonant remains voiceles in intervocalic position. | 
| ^жойу/жой<v><tv><ger><nom>$ | 215 *жою | |
| и phonology | ||
| ^жиіліп/жи<v><tv><pass><gna_perf>/жи<v><tv><pass><prc_perf>$ | 35 *жиылып | Added in lexc as жи:жи V-TV ; ! "". Tried to change it toжи:жи%{й%} V-TV ; ! ""— makes жиып work, but doesn't affect the gerund form. Not quite the right thing. | 
| жиіп/жи<v><tv><gna_perf>/жи<v><tv><prc_perf>$ | 58 *жиып | |
| ^жиу/жи<v><tv><ger><nom>$ | жию | |
—r42636 and previous
Tatar
- (tat) generates укыу
- (tat) generates айендә
- Deletion of soft sign "ь" before vowels in Tatar (see comments at the end of the apertium-tat/apertium-tat.tat.twolfile)
- (tat) *аенда, generating музейе instead of correct музее
- apertium-tat$ echo "^йөр<v><iv><gpr_impf>$" | hfst-proc -g tat.autogen.hfst>>- йөрә торган
- ^хокукын/*хокукын$ <-- ^хокукын<n><px3sp><acc>$ See ^құқын/*құқын$ issue above.
| Currently generated incorrect form(s) | Unanalyzed correct form(s) | Comments | 
|---|---|---|
| ^безнекенеме/без<prn><pers><p1><pl><px><acc>+мы<qst>$ | ^безнекенме/*безнекенме$ | See "біздікі" above | 
Other
International vocabulary
*терроризмге *массивіндегі *террорлық *Факті *кодекстің *терроризмге *Полицейлер *журналистерді «*АНТИТЕРРОРЛЫҚ » *полицейлер *антитеррорлық *режим *полицейлер *журналистерді *автоматты *автобустар *полицейлер *журналистеріне *сайттың *технологиялар *компьютер *мобильді *техникаларға *интернет *объектілерін *радиациялық *сантехник *проблемасы *веб-*сайттар *позитивті *алгебра *коалициялық *иммиграциялық *дипломатиялық *стратегиялық *станциясында
Proper nouns
*Гонконгтан
Other
- Some nouns in Tatar (and Kazakh) lexc seem to be in NLEX and NLEX-RUS. This is fine for analysis, but which form is generated? There should be some ! Dir/.. filtering somewhere in there.
Discuss first
- There is only one formal form (<frm>) in Tatar, which can be both sg and plural. But in Kazakh there are two forms. Should I pretend as if in Tatar it *were* the same and duplicate the same form with a different tag or should I handle it in transfer?
- Consider турындагы - should it still be tagged as postposition?
- How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. әуреле > башын әйләндер)
Part-of-speech related TODO's and DONE's can be found here:
To run tests, use aq-regtest utility from Apertium-quality tools. E.g. 
aq-regtest -d . kaz-tat http://wiki.apertium.org/wiki/Special:Export/Kazakh_and_Tatar/Postadvebs
Done
- But keep an eye on this
- Numerals
- kaz <num><subst>(<px3>) in fractions[1] = tat <num><subst>(<px3>)
- kaz <num><coll><advl> = tat <num><coll>
- kaz <num><coll><subst> = tat <num><subst>
 
Release TODO
- Clean up Tatar issues mentioned in paper
- Finish getting kaz-tat running on bytemark
- Migrate apertium-kaz and kaz-tat to hfst>ltt stuff
- Set up e.g. turkic@apertium.org as a distribution to certain/most/all (??) members of apertium-turkic
- Finish setting up Apertium Turkic
- Tidy up release text
TODO (Camera-ready)
Ilnar
- 0.1.0 stable release
- Go through FIXMEs and CHECKs with JNW to see how many we can clear up
- Discuss дағыAgreed about handling it in lexc, but I'd prefer to do it after the release--selimcan 19:12, 1 July 2013 (UTC)- <cnjcoo>/- <postadv>thing with JNW
- Did we ever implement the- <adj>+е- <cop>stuff we discussed? Both -kaz and -tat need it.- Implemented as of r45395. Some disambiguation might be needed though (<adj.subst>+cop at the end of a sentence?)--selimcan 10:23, 29 June 2013 (UTC)
 
 
- tat→kaz eval
- Get казашки to post-edit Kazakh stuff
 
- You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.
- Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
- We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ? —FMT (We should exclude the HFST bug in this.)
 
- calculate gain in person hours using system over translating by hand?
Algorithm for checking dictionaries (as part of the testvocing)
- Go through entries in bidix
- Get rid of duplications, FIXME's, alternatively spelled variants (handling them in lexc instead)
 
- Look up bidix stems in lexc's and make sure every one of them is in there (and we don't loose any bit of coverage)
- Try to get rid of FIXME's for stems in lexc's
- Have a quick look at continuation lexicons and make sure they follow standards (some of the lexicon names in tat.lexc in particular)
- Run testvoc (either for each different class separately -- commenting out all other root lexicons -- or without modifying anything if it doesn't take forever)
In addition
- If a Tatar noun marked with 'Use/MT' is not used in kaz-tat.dix, get rid of it in tat.lexc
Fran
- Help calculate BLEU score and explain why it's stupid.
JNW
- new style file
- fix some of the last few bits of Tatar phonology?
- Explain that the irregular A-stem verbs in Tatar are a lexical property of the verb- do the two verb classes in Tatar differ for some other reason than the last vowel in the verb stem (that is, is class membership a lexical property or a phonological property and if the latter, why is this a lexicon distinction and not a morphophonological issue?)
 
- fix all referents to revision numbers after we have a stable release that we're testing on
- Minor stuff
- 2nd -> write 2^(nd) or second
- table 1 -> Table 1
- Fig. 1 -> Figure 1
- based on context rules the -> based on context rules, the
- structural transfer module which -> module, which
- CG and Constraint Grammar forms used
- Table 4 does not appear in running text.
- The contents of Sections 2 and 3 do not correspond to the description given in Section 1.
 
- justification/motivation for the system- gain in person hours using system over translating by hand
- For a marginalised language community to be able to communicate externally without going through the majority language.
- An MT system gives a good opportunity to work on other language resources, e.g. morph. analyser, disambiguator, etc. --- these resources can be used for other things.
- The time-effort : reward pay off is high. With a few months, we can get an effective system. This would not be the case for unrelated languages.
- Why going through Russian isn't "better": (This is retarded, we should probably put in a comment on why this is stupid. —FMT)- Translate documents/sources/etc. efficiently without finding/paying someone to translate it to Russian first
 
- more text on wikipedia in Kazakh (and generally more "human knowledge" available, maybe):
- kk.wiki 200,000 articles, tt.wiki <50k
- kk:Астана vs. tt:Астана.
 
 
incorporated above
- justification stuff
- The gain (if any) when using the current system followed by post-editing compared to translating from scratch- Could we do a small time-based experiment ? E.g. time taken to translate 2000 words with/without the system ? —FMT
 
 
- BLEU score- We can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions. —FMT
- BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid. —FMT
 
unincorporated
- add latin-alphabet transcriptions? ☹
- Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —Firespeaker 18:15, 1 June 2013 (UTC)
- Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —Firespeaker 18:15, 1 June 2013 (UTC)
 
- Lemma-by-lemma analysis
- We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ? —FMT
- In some cases of ambiguity this could happen. —JNW
 
- take out comparative phonology stuff?
TODO (Original submission)
Ideal benchmarks:
- document rules in the rlx with example sentences
- more like 100-150 (currently ~40) disambiguation rules in -kaz
Ilnar
- Development corpus (lots and lots of text)
- Work on increasing coverage (via lexc) and trimmed coverage (via dix) to 90%
- Work on making sure testvoc passes
- i.e., corpus testvoc
 
- add rules — disambigation (CG), lexical selection, and transfer.
 
- Test corpus (about 10 pages; don't base rules on this text!)- Make a gold standard translation/correct some tests for error-rate testingDone.
 
- Paper
- Add affiliation to paper
- Help JNW come up with some more contrastive stuff (see below / FIXME: Ilnars in paper)- Tatar equivalent of барайын деп жатырмын "I'm planning on going" ?
- What was it that you noticed with -ғалы/-гелі (and its correspondent in Tatar)?
 
- Find some exemplary bidix entries for figure 2.
- New example for table 3
- maybe Kazakh equivalent of original sentence: "Ауа райы бүгін өте/әбден жақсы, жылы."
- maybe "Ол енді ол дыбысты анығырақ ести бастады" (some good ambiguity). Unfortunately, current output is "Ул иңне ул тавышны аныграк ишетә башлады". Could we fix this?
 
 
Fran
- Delegate out error-rate testing tasks
- new version of Table 2
JNW
- Work on last few issues in -tat twol
- Write up background
- Contrastive analysis of Kazakh and Tatar
- phonological differences (a generalised summary, 2 or 3 small specific examples)
- orthographical differences (a generalised summary, 1 or 2 small specific examples)
- lexical and morphological differences (2 or 3 specific examples)
- morphotactic differences (2 or 3 specific examples)
- syntactic differences (2 or 3 specific examples)
 
- Coverage stuff
- divide corpora into 10 pieces and run coverage for each to get stddev
 
Over-all
Abstract 1 2 3 3.1 3.2 3.3 3.4 4 4.1 4.2 4.3 4.4 4.5 5 5.1 6 Acknowledgements References
Notes
- ↑ Currently whether it is in fractions or not is not taken into account

