Difference between revisions of "Kazakh and Tatar"

Latest revision as of 01:53, 10 March 2018

General information[edit]

The Kazakh transducer has 36,595 stems and ~94.5% coverage over random corpora
The Tatar transducer has 55,702 stems and ~91% coverage over random corpora

Demonstration[edit]

$ echo "Бұл аударушымен қазақша жазылған мәтіндерді татаршаға аударып оқуға болады." | apertium -d . kaz-tat

Бу тәрҗемәче белән казакъча язылган текстларны татарчага тәрҗемә итеп укуга була.

Installation[edit]

You will need:

hfst (svn ≥r1916)
- foma
  - flex
apertium
- lttoolbox (svn ≥r46087)
CG
- ICU
apertium-lex-tools
apertium-kaz and apertium-tat

If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from User:Tino Didriksen's repository.

Developers[edit]

Information on what remains to be done for this pair can be found at the /TODO list.

Development workflow[edit]

We work on the transducers (apertium-kaz and apertium-tat) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.

Adding words[edit]

In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:

add an entry in the bilingual dictionary — apertium-kaz-tat.kaz-tat.dix file in apertium-kaz-tat directory,
add an entry in the Kazakh monolingual dictionary — apertium-kaz.kaz.lexc file, which, as the name indicates, is in the apertium-kaz directory,
run make in apertium-kaz
add an entry in the Tatar monolingual dictionary — apertium-tat.tat.lexc file in apertium-tat,
run make in apertium-tat
cd to apertium-kaz-tat and run make.

You have to have configured Kazakh-Tatar translator with the --with-lang1 and --with-lang2 options for the last step to work (see here for more details on this). It will fetch changed files automatically, trim them and compile them.

There is no need anymore to run a special trimmer script and to import its output into apertium-kaz-tat manually.

The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, compile monolingual packages, and then compile the translator.

Adding language-pair-specific stems to the lexc files[edit]

Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:

<e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e>

In order to make it work, we will need to add барлық жерде as a single adverb in kaz.lexc, like this:

барлық% жерде:барлық% жерде ADV ; ! ""

But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with Use/MT at the end of the line:

барлық% жерде:барлық% жерде ADV ; ! "" Use/MT

That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.

@@ Line 1: / Line 1: @@
+{{TOCD}}
-This is a language pair translating between [[Kazakh]] and [[Tatar]].
+This is a language pair translating between [[Kazakh]] and [[Tatar]]. The pair is currently located in [https://github.com/apertium/apertium-kaz-tat trunk].
-== General TODO ==
+== General information ==
+* The [[Apertium-kaz|Kazakh transducer]] has {{#lst:apertium-kaz/stats|stems}} stems and [[apertium-kaz/stats|~{{:apertium-kaz/stats/average}}%]] coverage over random corpora
+* The [[Apertium-tat|Tatar transducer]] has {{#lst:apertium-tat/stats|stems}} stems and [[apertium-tat/stats|~{{:apertium-tat/stats/average}}%]] coverage over random corpora
+=== Demonstration ===
-See [[/Work_plan]].
+*<code>$ echo "Бұл аударушымен қазақша жазылған мәтіндерді татаршаға аударып оқуға болады." | apertium -d . kaz-tat</code>
+: <code>Бу тәрҗемәче белән казакъча язылган текстларны татарчага тәрҗемә итеп укуга була.</code>
+== Installation ==
-# Declination of Tatar nouns ending with -и.
+You will need:
-# <s>Set up <code>bidix-with-context.sh</code> script (see <code>apertium-kaz-tat/dev/bidix</code>; seems to be very useful, requires another script from spectie)</s>.
+* [[hfst]] (svn ≥r1916)
-# <s>Add some of the short wikipedia-article-like texts I have for evaluation into <code>texts</code> (should be ~200 words).</s>
+** foma
-# Implement cont. class for compound/multiword nouns which already have possessive ending (<px3sp>), e.g. ''Қытай Халық Республикасы''.
+*** flex
-## This continuation class should link only to CASE (but consider that some of them can have plural form: ''ишегаллары'').
+* [[Minimal_installation_from_SVN|apertium]]
-# Add "ярты", "ярым" and "чирек" as numerals, but don't link them to common numerals cont. class.
+** lttoolbox (svn ≥r46087)
-# (Lexical selection rule): ''сондай-ақ'' > ''шулай-ук''
+* [[CG]]
-# <s>Fix roman numerals:
+** [http://icu-project.org/download/ ICU]
-## add them to tat.lexc too;
+* [[Constraint-based lexical selection module|apertium-lex-tools]]
-## change <code>LEXICON NUM-ROMAN</code> to something like this: <code>%<num%>%<ord%>: # ; </code>.</s>
+* apertium-kaz and apertium-tat
-# Add transfer rule(s) to handle instrumental case of all parts-of-speech which are subject to substantivation, not only of nouns (this is one of the things which make testvoc results look bad)
-# A separate cont.class for verbs which have causative forms ending with -дыр/-дер
-# A "location-cases" cont. classes for some of the postpositions and location adverbs (e.g. "бире")
-# '<code>Natinfl</code> cont. class in tat.lexc
-# Fix "дыр<mod_ind>" thing (it doesn't pass bidix right now)
-# '''Pronouns'''
-## check cont. classes (note: if it looks like an overgeneration, and me is not sure about it, overgenerate in both lexc's)
-## translate pronouns from kaz.lexc, add them to bidix and add equivalents into tat.lexc
-## ^нигез/ни<prn><itg><px2pl><nom>
-# '''Determiners'''
-## "unify" cont. classes and tags
-## add stems
-# '''Adjectives'''
-## personal clitics after adjectives are not implemented yet
-# '''Translating between classes'''
-##
-# '''General stuff'''
-## Copula suffixes: have for kazakh -ø = p3.sp, and for tatar -ø = p3.sp and -lAr = p3.pl. When translating kaz->tat there is no problem, when translating tat->kaz, pl->sp.
+If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from [[User:Tino Didriksen]]'s [[Prerequisites for Debian|repository]].
-=== Twol related stuff ===
+== Developers ==
-# <s>Current: <code>^миллион<num><subst><dat>$ --> миллионге</code> Should be: <code>^миллион<num><subst><dat>$ --> миллионға</code></s>
+Information on what remains to be done for this pair can be found at the [[/TODO]] list.
-# <s>Current: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлесгенде</code> Should be: <code>^сөйле<v><tv><coop><ger_past><loc>$ --> сөйлескенде</code></s>
-# Deletion of soft sign "ь" before vowels in Tatar (see comments at the end of the <code>apertium-tat/apertium-tat.tat.twol</code> file)
-# <s>Kazakh: <code>^ойна<v><tv><ifi><p1><pl>$ --> ойнадык</code> Should be: ''ойнадыҚ''</s>
-# (tat) ''*аенда'', generating ''музейе'' instead of correct ''музее''
-# *жатқандығын
-# безнекенеме (accusative case before clitics); безнекенгәме
-# <s>*журналистерді - *журналистеріне - *журналистерді
-#* something like <tt>т:0 <=> :с/:0 _ %{L%}:/:0</tt></s> (<tt>r40597</tt>)
-# <s>(kaz) *Назарбаевтың</s> (<tt>r40594</tt>)
-# АКШ-*тың  НАТО-*ның
-:: This is a problem with lexc, not twol —[[User:Firespeaker|Firespeaker]] 06:15, 20 August 2012 (UTC)
-# <s>(kaz) *организмдер / организм<n><pl><nom> = организмдар</s> (<tt>r40597</tt>)
-# (kaz) процесс, процесі/процессі, процесінің/процессінің
-# <s>(kaz) автомобиль<n><attr>, *автомобильдер // автомобиль<n><pl><nom> = автомобильлер
-:: As far as I can tell, автомобильдер is the most common form.  The form автомобилдер also seems to be used, but doesn't look super formal, and автомобильлер seems to only be attested in "Kazakh" because Nissan seems to like to write in Noğay for its Kazakh-speaking audience. —[[User:Firespeaker|Firespeaker]] 06:10, 20 August 2012 (UTC)
-::: The thing is that the form we are generating is автомобильлер. - [[User:Francis Tyers|Francis Tyers]] 07:08, 20 August 2012 (UTC)</s> (<tt>r40704</tt>)
-# <s>^организмге/*организмге$ *организмнің *организмнен</s> (<tt>r40705</tt>)
-# <s>*тарихынан *тарихы ...</s> (≤<tt>r40705</tt>)
-# (tat) generates ''укыу''
-# (tat) generates ''айендә''
-=== International vocabulary ===
+=== Development workflow ===
+We work on the transducers ([[apertium-kaz]] and [[apertium-tat]]) individually, and use a special process to import to the pair transducers that contain only the words found in the bidix. The following documents this process.
-<pre>
-*терроризмге  *массивіндегі  *террорлық *Факті
-*кодекстің *терроризмге  *Полицейлер *журналистерді  «*АНТИТЕРРОРЛЫҚ »
-*полицейлер  *антитеррорлық   *режим   *полицейлер   *журналистерді
-*автоматты  *автобустар   *полицейлер   *журналистеріне  *сайттың
+==== Adding words ====
-*технологиялар *компьютер  *мобильді *техникаларға *интернет *объектілерін
+In order to add a new word (and its translation equivalent) to the Kazakh-Tatar translator, you have to do the following:
-*радиациялық   *сантехник *проблемасы *веб-*сайттар *позитивті *алгебра
+# add an entry in the bilingual dictionary — <code>apertium-kaz-tat.kaz-tat.dix</code> file in <code>apertium-kaz-tat</code> directory,
+# add an entry in the Kazakh monolingual dictionary — <code>apertium-kaz.kaz.lexc</code> file, which, as the name indicates, is in the  <code>apertium-kaz</code> directory,
+# run <code>make</code> in <code>apertium-kaz</code>
+# add an entry in the Tatar monolingual dictionary — <code>apertium-tat.tat.lexc</code> file in <code>apertium-tat</code>,
+# run <code>make</code> in <code>apertium-tat</code>
+# <code>cd</code> to <code>apertium-kaz-tat</code> and run <code>make</code>.
+You have to have configured Kazakh-Tatar translator with the <code>--with-lang1</code> and <code>--with-lang2</code> options for the last step to work (see [[Minimal installation from SVN#For language pairs that depend on monolingual packages (apertium-XYZ)|here]] for more details on this). It will fetch changed files automatically, trim them and compile them.
-*коалициялық
+There is '''no need''' anymore to run a special trimmer script and to import its output into <code>apertium-kaz-tat</code> manually.
-*иммиграциялық *дипломатиялық *стратегиялық
+The same workflow applies for any other pair involving Kazakh or Tatar — if you need to change something in .lexc, .twol or .rlx files of these languages, you do so in apertium-kaz and apertium-tat directories respectively, compile monolingual packages, and then compile the translator.
-*станциясында
+==== Adding language-pair-specific stems to the lexc files ====
-</pre>
+Sometimes we have to translate a word into Kazakh or Tatar with two or more words, e.g.:
+<pre><e><p><l>everywhere<s n="adv"/></l><r>барлық<b/>жерде<s n="adv"/></r></p></e></pre>
-===Proper nouns===
+In order to make it work, we will need to add ''барлық жерде'' as a single adverb in <code>kaz.lexc</code>, like this:
+<pre>барлық% жерде:барлық% жерде ADV ; ! ""</pre>
-<pre>
+But, certainly, these are two words in Kazakh — a determiner and a noun — and we don't want that the standalone Kazakh morphological analyzer would contain them as one word. In order to know that this entry was added for a specific language pair, we mark it with <code>Use/MT</code> at the end of the line:
+<pre>барлық% жерде:барлық% жерде ADV ; ! "" Use/MT</pre>
-*Гонконгтан
+That way, we can collaborate on one single file across many pairs, but in the same time we are able to "clean up" the lexicon for standalone use if needed.
-</pre>
-=== Discuss first ===
-# There is only one formal form (<frm>) in Tatar, which can be both sg and plural. But in Kazakh there are two forms. Should I pretend as if in Tatar it *were* the same and duplicate the same form with a different tag or should I handle it in transfer?
-# Consider ''турындагы'' - should it still be tagged as postposition?
-# How to handle verbs with inner inflection (sometimes a Kazakh verb is translated with a multiword and vice versa. E.g. ''әуреле > башын әйләндер'')
-----
-Part-of-speech related TODO's and DONE's can be found here:
-* [[/Postadvebs|/Postadverbs]]
-* [[/Postpositions]]
-To run tests, use <code>aq-regtest</code> utility from [[Apertium-quality]] tools. E.g. <pre>aq-regtest -d . kaz-tat http://wiki.apertium.org/wiki/Special:Export/Kazakh_and_Tatar/Postadvebs</pre>
-== Done ==
-; But keep an eye on this
-* Numerals
-** kaz <num><subst>(<px3>) in fractions<ref>Currently whether it is in fractions or not is not taken into account</ref> = tat <num><subst>(<px3>)
-** kaz <num><coll><advl> = tat <num><coll>
-** kaz <num><coll><subst> = tat <num><subst>
-== Notes ==
-<references/>
-== See also ==
-* [[/Pending tests]]
-* [[/Regression tests]]
 [[Category:Kazakh and Tatar|*]]

Difference between revisions of "Kazakh and Tatar"

Latest revision as of 01:53, 10 March 2018

Contents

General information[edit]

Demonstration[edit]

Installation[edit]

Developers[edit]

Development workflow[edit]

Adding words[edit]

Adding language-pair-specific stems to the lexc files[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools