Difference between revisions of "Kazakh and Tatar"

From Apertium
Jump to navigation Jump to search
m (→‎Adding words: kaz-tat moved to staging -- corrected pathnames to reflect this change)
(Directories)
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}
This is a language pair translating between [[Kazakh]] and [[Tatar]]. The pair is currently located in [https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-kaz-tat/ incubator], but it is expected that it will soon be moved to staging.
This is a language pair translating between [[Kazakh]] and [[Tatar]]. The pair is currently located in [https://github.com/apertium/apertium-kaz-tat trunk].


== General information ==
== General information ==
Line 7: Line 7:


=== Demonstration ===
=== Demonstration ===
*<code>$ echo "бұл аударушымен татарша жазылған мәтіндер қазақша аударып оқыса болады" | apertium -d . kaz-tat</code>
*<code>$ echo "Бұл аударушымен қазақша жазылған мәтіндерді татаршаға аударып оқуға болады." | apertium -d . kaz-tat</code>
: <code>бу аударучы белән татарча язылган текстләр казакъча аударып укыша була</code>
: <code>Бу тәрҗемәче белән казакъча язылган текстларны татарчага тәрҗемә итеп укуга була.</code>


== Installation ==
== Installation ==
You will need:
You will need:
* hfst (svn ≥r1916)
* [[hfst]] (svn ≥r1916)
** foma
** foma
*** flex
*** flex
* apertium
* [[Minimal_installation_from_SVN|apertium]]
** lttoolbox
** lttoolbox (svn ≥r46087)
* [[CG]]
** [http://icu-project.org/download/ ICU]
* [[Constraint-based lexical selection module|apertium-lex-tools]]
* apertium-kaz and apertium-tat

If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from [[User:Tino Didriksen]]'s [[Prerequisites for Debian|repository]].


== Developers ==
== Developers ==
Line 22: Line 28:


=== Development workflow ===
=== Development workflow ===

We work on the transducers ([[apertium-kaz]] and [[apertium-tat]]) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.
We work on the transducers ([[apertium-kaz]] and [[apertium-tat]]) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.


==== Adding words ====
==== Adding words ====
In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:
In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:
# add an entry in the bilingual dictionary — <code>apertium-kaz-tat.kaz-tat.dix</code> file in <code>staging/apertium-kaz-tat</code> directory,
# add an entry in the bilingual dictionary — <code>apertium-kaz-tat.kaz-tat.dix</code> file in <code>apertium-kaz-tat</code> directory,
# add an entry in the Kazakh monolingual dictionary — <code>apertium-kaz.kaz.lexc</code> file, which, as the name indicates, is in the <code>incubator/apertium-kaz</code> directory,
# add an entry in the Kazakh monolingual dictionary — <code>apertium-kaz.kaz.lexc</code> file, which, as the name indicates, is in the <code>apertium-kaz</code> directory,
# add an entry in the Tatar monolingual dictionary — <code>apertium-tat.tat.lexc</code> file in <code>incubator/apertium-tat</code>,
# run <code>make</code> in <code>apertium-kaz</code>
# <code>cd</code> to the <code>staging/apertium-kaz-tat</code> directory in terminal,
# add an entry in the Tatar monolingual dictionary — <code>apertium-tat.tat.lexc</code> file in <code>apertium-tat</code>,
# run <code>update-morphs.bash</code> script and recompile.
# run <code>make</code> in <code>apertium-tat</code>
# <code>cd</code> to <code>apertium-kaz-tat</code> and run <code>make</code>.


You have to have configured Kazakh-Tatar translator with the <code>--with-lang1</code> and <code>--with-lang2</code> options for the last step to work (see [[Minimal installation from SVN#For language pairs that depend on monolingual packages (apertium-XYZ)|here]] for more details on this). It will fetch changed files automatically, trim them and compile them.
This script runs the <code>trim-lexc.py</code> script, which itself lies in <code>/trunk/apertium-tools</code>, and copies its output over to the kaz-tat directory, renaming the files to fit the conventions.


There is '''no need''' anymore to run a special trimmer script and to import its output into <code>apertium-kaz-tat</code> manually.
In addition, it copies .twol and .rlx files from the apertium-kaz and apertium-tat directories.


The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, and then copy these files to the "bilingual" directory you are working on (either using update-morphs.bash script or manually).
The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, compile monolingual packages, and then compile the translator.


==== Adding language-pair-specific stems to the lexc files ====
==== Adding language-pair-specific stems to the lexc files ====

Latest revision as of 01:53, 10 March 2018

This is a language pair translating between Kazakh and Tatar. The pair is currently located in trunk.

General information[edit]

Demonstration[edit]

  • $ echo "Бұл аударушымен қазақша жазылған мәтіндерді татаршаға аударып оқуға болады." | apertium -d . kaz-tat
Бу тәрҗемәче белән казакъча язылган текстларны татарчага тәрҗемә итеп укуга була.

Installation[edit]

You will need:

If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from User:Tino Didriksen's repository.

Developers[edit]

Information on what remains to be done for this pair can be found at the /TODO list.

Development workflow[edit]

We work on the transducers (apertium-kaz and apertium-tat) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.

Adding words[edit]

In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:

  1. add an entry in the bilingual dictionary — apertium-kaz-tat.kaz-tat.dix file in apertium-kaz-tat directory,
  2. add an entry in the Kazakh monolingual dictionary — apertium-kaz.kaz.lexc file, which, as the name indicates, is in the apertium-kaz directory,
  3. run make in apertium-kaz
  4. add an entry in the Tatar monolingual dictionary — apertium-tat.tat.lexc file in apertium-tat,
  5. run make in apertium-tat
  6. cd to apertium-kaz-tat and run make.

You have to have configured Kazakh-Tatar translator with the --with-lang1 and --with-lang2 options for the last step to work (see here for more details on this). It will fetch changed files automatically, trim them and compile them.

There is no need anymore to run a special trimmer script and to import its output into apertium-kaz-tat manually.

The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, compile monolingual packages, and then compile the translator.

Adding language-pair-specific stems to the lexc files[edit]

Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:

<e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e>

In order to make it work, we will need to add барлық жерде as a single adverb in kaz.lexc, like this:

барлық% жерде:барлық% жерде ADV ; ! ""

But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with Use/MT at the end of the line:

барлық% жерде:барлық% жерде ADV ; ! "" Use/MT

That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.