Kazakh and Tatar
- The Kazakh transducer has 36,595 stems and ~94.5% coverage over random corpora
- The Tatar transducer has 55,702 stems and ~91% coverage over random corpora
$ echo "бұл аударушымен татарша жазылған тексттер қазақша аударып оқыса болады" | apertium -d . kaz-tat
бу аударучы белән татарча язылган *тексттер казакъча аударып укыша була(hrm)
You will need:
- hfst (svn ≥r1916)
Information on what remains to be done for this pair can be found at the /TODO list.
We work on the transducers (apertium-kaz and apertium-tat) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.
In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:
- add an entry in the bilingual dictionary —
- add an entry in the Kazakh monolingual dictionary —
apertium-kaz.kaz.lexcfile, which, as the name indicates, is in the
- add an entry in the Tatar monolingual dictionary —
incubator/apertium-kaz-tatdirectory in terminal,
update-morphs.bashscript and recompile.
This script runs the
trim-lexc.py script, which itself lies in
/trunk/apertium-tools, and copies its output over to the kaz-tat directory, renaming them to fit the conventions.
In addition, it copies .twol and .rlx files from the apertium-kaz and apertium-tat directories.
The same workflow applies for any other pair involving Kazakh and Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, and then copy these files to the "bilingual" directory you are working on (either using update-morphs.bash script or manually).
Adding language-pair-specific stems to the lexc files
Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:
<e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e>
In order to make it work, we will need to add барлық жерде as a single adverb in
kaz.lexc, like this:
барлық% жерде:барлық% жерде ADV ; ! ""
But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain it as one word. In order to know that this entry was added for a specific language pair, we mark it with
Use/MT at the end of the line:
барлық% жерде:барлық% жерде ADV ; ! "" Use/MT
That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.