Difference between revisions of "Kazakh and Tatar"

From Apertium
Jump to navigation Jump to search
(Move info about editing kaz.lexc file and using the trimming script to the apertium-kaz/page)
m (Undo revision 43081 by Ilnar.salimzyan (Talk) Actually, let's keep it, instead only refer to it from the apertium-kaz's page)
Line 23: Line 23:
 
== Developers ==
 
== Developers ==
 
Information on what remains to be done for this pair can be found at the [[/TODO]] list.
 
Information on what remains to be done for this pair can be found at the [[/TODO]] list.
  +
  +
=== Development workflow ===
  +
We work on the transducers ([[apertium-kaz]] and [[apertium-tat]]) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.
  +
  +
==== Adding words ====
  +
In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:
  +
# add an entry in the bilingual dictionary — <code>apertium-kaz-tat.kaz-tat.dix</code> file in <code>staging/apertium-kaz-tat</code> directory,
  +
# add an entry in the Kazakh monolingual dictionary — <code>apertium-kaz.kaz.lexc</code> file, which, as the name indicates, is in the <code>incubator/apertium-kaz</code> directory,
  +
# add an entry in the Tatar monolingual dictionary — <code>apertium-tat.tat.lexc</code> file in <code>incubator/apertium-tat</code>,
  +
# <code>cd</code> to the <code>staging/apertium-kaz-tat</code> directory in terminal,
  +
# run <code>update-morphs.bash</code> script and recompile.
  +
  +
This script runs the <code>trim-lexc.py</code> script, which itself lies in <code>/trunk/apertium-tools</code>, and copies its output over to the kaz-tat directory, renaming the files to fit the conventions.
  +
  +
In addition, it copies .twol and .rlx files from the apertium-kaz and apertium-tat directories.
  +
  +
The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, and then copy these files to the "bilingual" directory you are working on (either using update-morphs.bash script or manually).
  +
  +
==== Adding language-pair-specific stems to the lexc files ====
  +
Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:
  +
<pre><e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e></pre>
  +
In order to make it work, we will need to add ''барлық жерде'' as a single adverb in <code>kaz.lexc</code>, like this:
  +
<pre>барлық% жерде:барлық% жерде ADV ; ! ""</pre>
  +
But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with <code>Use/MT</code> at the end of the line:
  +
<pre>барлық% жерде:барлық% жерде ADV ; ! "" Use/MT</pre>
  +
That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.
   
 
[[Category:Kazakh and Tatar|*]]
 
[[Category:Kazakh and Tatar|*]]

Revision as of 20:10, 9 August 2013

This is a language pair translating between Kazakh and Tatar. The pair is currently located in trunk.

General information

Demonstration

  • $ echo "бұл аударушымен татарша жазылған мәтіндер қазақша аударып оқыса болады" | apertium -d . kaz-tat
бу аударучы белән татарча язылган текстләр казакъча аударып укыша була

Installation

You will need:

Developers

Information on what remains to be done for this pair can be found at the /TODO list.

Development workflow

We work on the transducers (apertium-kaz and apertium-tat) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.

Adding words

In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:

  1. add an entry in the bilingual dictionary — apertium-kaz-tat.kaz-tat.dix file in staging/apertium-kaz-tat directory,
  2. add an entry in the Kazakh monolingual dictionary — apertium-kaz.kaz.lexc file, which, as the name indicates, is in the incubator/apertium-kaz directory,
  3. add an entry in the Tatar monolingual dictionary — apertium-tat.tat.lexc file in incubator/apertium-tat,
  4. cd to the staging/apertium-kaz-tat directory in terminal,
  5. run update-morphs.bash script and recompile.

This script runs the trim-lexc.py script, which itself lies in /trunk/apertium-tools, and copies its output over to the kaz-tat directory, renaming the files to fit the conventions.

In addition, it copies .twol and .rlx files from the apertium-kaz and apertium-tat directories.

The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, and then copy these files to the "bilingual" directory you are working on (either using update-morphs.bash script or manually).

Adding language-pair-specific stems to the lexc files

Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:

<e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e>

In order to make it work, we will need to add барлық жерде as a single adverb in kaz.lexc, like this:

барлық% жерде:барлық% жерде ADV ; ! ""

But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with Use/MT at the end of the line:

барлық% жерде:барлық% жерде ADV ; ! "" Use/MT

That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.