Difference between revisions of "Kazakh and Tatar"

From Apertium
Jump to navigation Jump to search
Line 43: Line 43:
 
In order to make it work, we will need to add ''барлық жерде'' as a single adverb in <code>kaz.lexc</code>, like this:
 
In order to make it work, we will need to add ''барлық жерде'' as a single adverb in <code>kaz.lexc</code>, like this:
 
<pre>барлық% жерде:барлық% жерде ADV ; ! ""</pre>
 
<pre>барлық% жерде:барлық% жерде ADV ; ! ""</pre>
But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain it as one word. In order to know that this entry was added for a specific language pair, we mark it with <code>Use/MT</code> at the end of the line:
+
But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with <code>Use/MT</code> at the end of the line:
 
<pre>барлық% жерде:барлық% жерде ADV ; ! "" Use/MT</pre>
 
<pre>барлық% жерде:барлық% жерде ADV ; ! "" Use/MT</pre>
 
That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.
 
That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.

Revision as of 23:32, 26 January 2013

This is a language pair translating between Kazakh and Tatar. The pair is currently located in incubator, but it is expected that it will soon be moved to staging.

General information

Demonstration

  • $ echo "бұл аударушымен татарша жазылған тексттер қазақша аударып оқыса болады" | apertium -d . kaz-tat
бу аударучы белән татарча язылган *тексттер казакъча аударып укыша була (hrm)

Installation

You will need:

  • hfst (svn ≥r1916)
    • foma
      • flex
  • apertium
    • lttoolbox

Developers

Information on what remains to be done for this pair can be found at the /TODO list.

Development workflow

We work on the transducers (apertium-kaz and apertium-tat) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.

Adding words

In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:

  1. add an entry in the bilingual dictionary — apertium-kaz-tat.kaz-tat.dix file in incubator/apertium-kaz-tat directory,
  2. add an entry in the Kazakh monolingual dictionary — apertium-kaz.kaz.lexc file, which, as the name indicates, is in the incubator/apertium-kaz directory,
  3. add an entry in the Tatar monolingual dictionary — apertium-tat.tat.lexc file in incubator/apertium-tat,
  4. cd to the incubator/apertium-kaz-tat directory in terminal,
  5. run update-morphs.bash script and recompile.

This script runs the trim-lexc.py script, which itself lies in /trunk/apertium-tools, and copies its output over to the kaz-tat directory, renaming them to fit the conventions.

In addition, it copies .twol and .rlx files from the apertium-kaz and apertium-tat directories.

The same workflow applies for any other pair involving Kazakh and Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, and then copy these files to the "bilingual" directory you are working on (either using update-morphs.bash script or manually).

Adding language-pair-specific stems to the lexc files

Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:

<e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e>

In order to make it work, we will need to add барлық жерде as a single adverb in kaz.lexc, like this:

барлық% жерде:барлық% жерде ADV ; ! ""

But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with Use/MT at the end of the line:

барлық% жерде:барлық% жерде ADV ; ! "" Use/MT

That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.