Difference between revisions of "Kazakh and Tatar/TODO"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
| m (clean up some of the outdated parts) | Firespeaker (talk | contribs)  | ||
| Line 61: | Line 61: | ||
| [[Category:Kazakh and Tatar|*]] | [[Category:Kazakh and Tatar|*]] | ||
| [[Category:TODO lists]] | |||
Revision as of 21:22, 19 August 2015
Contents
Goals
In both directions: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text. It has full coverage on Абай жолы. Бірінші кітап and disambiguates it fully with at least 90% of precision. There is a unified and documented testing framework to test morphotactics (specs for each type II LEXICON), morphophonology, coverage, CG performance, regression and pending tests targeted mainly at transfer rules, a “gold standard” parallel corpus to measure the WER and to always have something to work on. Tests are fast (slow parts are decoupled). Testvoc is clean.
Road map
- coverage (more stems and better morphology)
- add "internationalisms" spectie has put in /dev
- see also: Kazakh and Tatar/Remaining unanalysed forms
 
- constraint grammar
- transfer
- lexical selection
General TODO
- s/fut3/vol/
- 0 itself and numbers containing it aren't analyzed (in both directions)- This is only true for the transducers in apertium-kaz-tat, apertium-kaz and apertium-tat ones work fine.
 
- A number with a following . is analyzed incorrectly and therefore not generated:
- When apertium (not hfst-proc) is used, this is the case for any number at the end of the line, because deformatter puts a "." at the end of the sentence automatically.
 
/apertium-kaz$ echo "21." | hfst-proc kaz.automorf.hfst ^21./21.<num>$
- Make instrumental case to a clitical postposition, leaving only 6 cases which are the same both in Tatar and Kazakh (see [[1]] and the log from 12.03.2013 for reference)
- update the t1x files accordingly (i.e. get rid of the rules for handling instrumental case)
 
- Revise continuations of gerunds
- жігіт% %{М%}ен
- Declination of Tatar nouns ending with -и.
- A separate cont.class for verbs which have causative forms ending with -дыр/-дер
- Isn't this the default for <v><iv>?
 
- Isn't this the default for 
- A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
- What do you mean? —Firespeaker 16:20, 6 February 2013 (UTC)
 
- Better disambiguation
- көр%<v%>%<tv%>%<imp%>%<p2%>%<sg%>:гөр # ; ! "" Dir/LRget's trimmed
- ма не - мыни thing
- handle gna_cond + DA<postadv> issue in lexc, not in CG
- Handle the sentences from the paper in transfer, not in CG
- Some nouns in Tatar (and Kazakh) lexc seem to be in NLEX and NLEX-RUS. This is fine for analysis, but which form is generated? There should be some ! Dir/.. filtering somewhere in there.
- Consider турындагы - should it still be tagged as postposition?
- How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. әуреле > башын әйләндер)
Algorithm for checking dictionaries (as part of the testvocing)
- Go through entries in bidix
- Get rid of duplications, FIXME's, alternatively spelled variants (handling them in lexc instead)
 
- Look up bidix stems in lexc's and make sure every one of them is in there (and we don't loose any bit of coverage)
- Try to get rid of FIXME's for stems in lexc's
- Have a quick look at continuation lexicons and make sure they follow standards (some of the lexicon names in tat.lexc in particular)
- Run testvoc (either for each different class separately -- commenting out all other root lexicons -- or without modifying anything if it doesn't take forever)
- If a Tatar noun marked with 'Use/MT' is not used in kaz-tat.dix, get rid of it in tat.lexc
Notes

