Difference between revisions of "Kazakh and Tatar"
Firespeaker (talk | contribs) |
|||
Line 22: | Line 22: | ||
=== Development workflow === |
=== Development workflow === |
||
We work on the transducers ([[apertium-kaz]] and [[apertium-tat]]) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. |
We work on the transducers ([[apertium-kaz]] and [[apertium-tat]]) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process. |
||
* … |
|||
==== Adding words ==== |
|||
In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following: |
|||
# add an entry in the bilingual dictionary — <code>apertium-kaz-tat.kaz-tat.dix</code> file in <code>incubator/apertium-kaz-tat</code> directory, |
|||
# add an entry in the Kazakh monolingual dictionary — <code>apertium-kaz.kaz.lexc</code> file, which, as the name indicates, is in the <code>incubator/apertium-kaz</code> directory, |
|||
# add an entry in the Tatar monolingual dictionary — <code>apertium-tat.tat.lexc</code> file in <code>incubator/apertium-tat</code>, |
|||
# <code>cd</code> to the <code>incubator/apertium-kaz-tat</code> directory in terminal, |
|||
# run <code>update-morphs.bash</code> script and recompile. |
|||
This script runs the <code>trim-lexc.py</code> script, which itself lies in <code>/trunk/apertium-tools</code>, and copies its output over to the kaz-tat directory, renaming them to fit the conventions. |
|||
In addition, it copies .twol and .rlx files from the apertium-kaz and apertium-tat directories. |
|||
The same workflow applies for any other pair involving Kazakh and Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, and then copy these files to the "bilingual" directory you are working on (either using update-morphs.bash script or manually). |
|||
==== Adding language-pair-specific stems to the lexc files ==== |
|||
Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.: |
|||
<pre><e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e></pre> |
|||
In order to make it work, we will need to add ''барлық жерде'' as a single adverb in <code>kaz.lexc</code>, like this: |
|||
<pre>барлық% жерде:барлық% жерде ADV ; ! ""</pre> |
|||
But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain it as one word. In order to know that this entry was added for a specific language pair, we mark it with <code>Use/MT</code> at the end of the line: |
|||
<pre>барлық% жерде:барлық% жерде ADV ; ! "" Use/MT</pre> |
|||
That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed. |
|||
[[Category:Kazakh and Tatar|*]] |
[[Category:Kazakh and Tatar|*]] |
Revision as of 23:31, 26 January 2013
This is a language pair translating between Kazakh and Tatar. The pair is currently located in incubator, but it is expected that it will soon be moved to staging.
General information
- The Kazakh transducer has 36,595 stems and ~94.5% coverage over random corpora
- The Tatar transducer has 55,702 stems and ~91% coverage over random corpora
Demonstration
$ echo "бұл аударушымен татарша жазылған тексттер қазақша аударып оқыса болады" | apertium -d . kaz-tat
бу аударучы белән татарча язылган *тексттер казакъча аударып укыша була
(hrm)
Installation
You will need:
- hfst (svn ≥r1916)
- foma
- flex
- foma
- apertium
- lttoolbox
Developers
Information on what remains to be done for this pair can be found at the /TODO list.
Development workflow
We work on the transducers (apertium-kaz and apertium-tat) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.
Adding words
In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:
- add an entry in the bilingual dictionary —
apertium-kaz-tat.kaz-tat.dix
file inincubator/apertium-kaz-tat
directory, - add an entry in the Kazakh monolingual dictionary —
apertium-kaz.kaz.lexc
file, which, as the name indicates, is in theincubator/apertium-kaz
directory, - add an entry in the Tatar monolingual dictionary —
apertium-tat.tat.lexc
file inincubator/apertium-tat
, cd
to theincubator/apertium-kaz-tat
directory in terminal,- run
update-morphs.bash
script and recompile.
This script runs the trim-lexc.py
script, which itself lies in /trunk/apertium-tools
, and copies its output over to the kaz-tat directory, renaming them to fit the conventions.
In addition, it copies .twol and .rlx files from the apertium-kaz and apertium-tat directories.
The same workflow applies for any other pair involving Kazakh and Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, and then copy these files to the "bilingual" directory you are working on (either using update-morphs.bash script or manually).
Adding language-pair-specific stems to the lexc files
Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:
<e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e>
In order to make it work, we will need to add барлық жерде as a single adverb in kaz.lexc
, like this:
барлық% жерде:барлық% жерде ADV ; ! ""
But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain it as one word. In order to know that this entry was added for a specific language pair, we mark it with Use/MT
at the end of the line:
барлық% жерде:барлық% жерде ADV ; ! "" Use/MT
That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.