Kazakh and Tatar
This is a language pair translating between Kazakh and Tatar. The pair is currently located in trunk.
General information
- The Kazakh transducer has 36,595 stems and ~94.5% coverage over random corpora
- The Tatar transducer has 55,702 stems and ~91% coverage over random corpora
Demonstration
$ echo "Бұл аударушымен татарша жазылған мәтіндерді қазақшаға аударып оқуға болады." | apertium -d . kaz-tat
Бу тәрҗемәче белән татарча язылган текстләрне казакъчага аударып укыуга була.
Installation
You will need:
- hfst (svn ≥r1916)
- foma
- flex
- foma
- apertium
- lttoolbox (svn ≥r46087)
- CG
- apertium-lex-tools
Developers
Information on what remains to be done for this pair can be found at the /TODO list.
Development workflow
We work on the transducers (apertium-kaz and apertium-tat) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.
Adding words
In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:
- add an entry in the bilingual dictionary —
apertium-kaz-tat.kaz-tat.dix
file intrunk/apertium-kaz-tat
directory, - add an entry in the Kazakh monolingual dictionary —
apertium-kaz.kaz.lexc
file, which, as the name indicates, is in thelanguages/apertium-kaz
directory, - run
make
inlanguages/apertium-kaz
- add an entry in the Tatar monolingual dictionary —
apertium-tat.tat.lexc
file inlanguages/apertium-tat
, - run
make
inlanguages/apertium-tat
cd
toapertium-kaz-tat
and runmake
.
You have to have configured Kazakh-Tatar translator with the --with-lang1
and --with-lang2
options for the last step to work (see here for more details on this). It will fetch changed files automatically, trim them and compile them.
There is no need anymore to run a special trimmer script and to import its output into apertium-kaz-tat
manually.
The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, compile monolingual packages, and then compile the translator.
Adding language-pair-specific stems to the lexc files
Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:
<e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e>
In order to make it work, we will need to add барлық жерде as a single adverb in kaz.lexc
, like this:
барлық% жерде:барлық% жерде ADV ; ! ""
But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with Use/MT
at the end of the line:
барлық% жерде:барлық% жерде ADV ; ! "" Use/MT
That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.