Difference between revisions of "Apertium-kaz"
Firespeaker (talk | contribs) |
Firespeaker (talk | contribs) |
||
Line 64: | Line 64: | ||
=== Guidelines for adding stems === |
=== Guidelines for adding stems === |
||
==== General ==== |
==== General ==== |
||
* Before adding a stem, be sure it does not already exist in lexc. |
|||
* Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems ''must'' be added with the ''voiceless'' consonant (п, к, қ), e.g <code>тақ:тақ V-TV ;</code> |
* Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems ''must'' be added with the ''voiceless'' consonant (п, к, қ), e.g <code>тақ:тақ V-TV ;</code> |
||
** Stems from Russian that end with one of the voiced consonants (б, г), such as <code>геолог</code> should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, then <code>N5</code>). |
** Stems from Russian that end with one of the voiced consonants (б, г), such as <code>геолог</code> should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, then <code>N5</code>). |
Revision as of 01:58, 10 August 2017
Kazakh - қазақ тілі | |
---|---|
language transducer | |
Coverage: | ~94.5% |
Stems: | 36,595 |
Vanilla stems: | 27,433 |
Paradigms: | |
Location: | apertium-kaz (languages) |
Families: | Turkic languages |
Areas: | Languages of Central Asia, Languages of the former Soviet Union |
Lang info | Kazakh |
Apertium-kaz is a morphological analyser/generator and CG tagger for Kazakh, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It's used in the following language pairs:
- Kazakh and Tatar
- English and Kazakh
- Kyrgyz and Kazakh
- Kazakh and Karakalpak
- Khalkha and Kazakh
- Kazakh and Russian
Installation
Apertium-kaz is currently located in languages/apertium-kaz.
Dependency tree
- hfst (svn ≥r1916)
- apertium
- lttoolbox
- VISL-CG3
For spell checking
If you're compiling the apertium-kaz spell checker, you'll additionally need these dependencies:
- hfst-ospell (./configure --enable-zhfst)
- see Installation, it is installable from Tino's repositories
- corevoikko/libvoikko/src/tools/voikkospell (./configure --enable-hfst)
You'll want to configure apertium-kaz with --enable-ospell and then after making it, copy kaz.zhfst to ~/.voikko/3/kk.zhfst
Then you can do this:
$ echo "қазақша билмеймін" | tr ' ' '\n' | voikkospell -d kk -s C: қазақша W: билмеймін S: билеймін S: білмеймін S: билемеймін S: бөлмеймін S: билемейміз
Current State
{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}
- Number of stems: 36,595 {{#ifneq | | | () }}
- Disambiguation rules: 150
- Coverage: ~94.5%
{{#ifneq | Әуезов | None |
{{#ifneq | Әуезов corpus | | | }}}}
{{#ifneq | bible | None |
{{#ifneq | | | | }}}}
{{#ifneq | azattyq2010 | None |
{{#ifneq | RFERL_corpora | | | }}}}
{{#ifneq | wp2013 | None |
{{#ifneq | | | | }}}}
{{#ifneq | quran | None |
{{#ifneq | | | | }}}}
{{#ifneq | udhr | None |
{{#ifneq | UDHR | | | }}}}
{{#ifneq | {{{corpus7}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus8}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus9}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus10}}} | None |
{{#ifneq | | | | }}}}
corpus | words | coverage | |
---|---|---|---|
<nowinter>Әуезов</nowinter> | Әуезов | 155K | ~92.89% |
<nowinter>[[|bible]]</nowinter> | bible | 577K | ~95.29% |
<nowinter>azattyq2010</nowinter> | azattyq2010 | 3.2M | ~95.07% |
<nowinter>[[|wp2013]]</nowinter> | wp2013 | 18.2M | ~90.10% |
<nowinter>[[|quran]]</nowinter> | quran | 107K | ~96.71% |
<nowinter>udhr</nowinter> | udhr | 1.5K | ~96.86% |
<nowinter>[[|{{{corpus7}}}]]</nowinter> | {{{corpus7}}} | ~% | |
<nowinter>[[|{{{corpus8}}}]]</nowinter> | {{{corpus8}}} | ~% | |
<nowinter>[[|{{{corpus9}}}]]</nowinter> | {{{corpus9}}} | ~% | |
<nowinter>[[|{{{corpus10}}}]]</nowinter> | {{{corpus10}}} | ~% |
Developers
We have several language pairs involving Kazakh, and in every pair there is a lexc, twol and rlx file for this language. But we don't work on these files directly. Instead, we edit kaz.lexc
, kaz.twol
and kaz.rlx
files located in apertium-kaz, and then import these files to the language pair directories using a script (update-morphs.bash
in language-pair directories). This script merely copies twol and rlx files, since they don't have to be tweaked to a particular language pair, but "trimms" the lexc file leaving only that stems in it, which are also found in the bilingual dictionary of the pair importing is made to.
For further details on how these works and a step-by-step guide (taking the Kazakh-Tatar pair as an example), see Kazakh_and_Tatar#Development_workflow.
A natural consequence of this approach is that any change made to apertium-kaz will affect all other Kazakh-to-X or X-to-Kazakh language pairs. This requires some rules to be followed. Here are some of them.
- Do not change tags or order of tags and don't do any other change wich certainly will break things in other pairs without discussing it first on the apertium-turkic mailing list (or notifying others via the list if you are absolutely sure that this change was necessary (e.g. it was discussed earlier and hence was due));
- If you encounter a word in the lexc which seems to be miscategorized, or some multiword which in reality is a combination of two lexemes (like барлық жерде from the example above), do not delete them! Mark them with
Use/MT
instead. - Write descriptive commit messages.
Guidelines for adding stems
General
- Before adding a stem, be sure it does not already exist in lexc.
- Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems must be added with the voiceless consonant (п, к, қ), e.g
тақ:тақ V-TV ;
- Stems from Russian that end with one of the voiced consonants (б, г), such as
геолог
should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, thenN5
).
- Stems from Russian that end with one of the voiced consonants (б, г), such as
- Words that have an inserted ‹ы› or ‹і› in some forms should get
%{y%}
in that spot on the right side, e.g.орын:ор%{y%}н N1 ;
.- Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add
! Dir/LR
after the form that should not be generated (i.e., the form that is the non-normative version), and add! Err/Orth
after it too if it should be considered a spelling mistake.
- Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add
Verbs
- There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
- Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as
сүй
- Some verbs have a "hidden" ‹ы› or ‹і› under the ‹у›, for example
ері
,аршы
,аңды
, etc. These verb stems should be added with the ‹ы› or ‹і›. - Of course, verbs with ‹у› in the stem should keep the ‹у›, like
жу
,қу
,жау
, etc.
- Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as
- Do not add passive or cooperative forms of verb stems (e.g., ‹тартыл› is passive of ‹тарт›, and ‹тартыс› is cooperative) unless absolutely needed for translation. In this case, put
! Use/MT ! Der/Pass
or! Use/MT ! Der/Coop
after the entry, respectively. - If you add a causative form of a verb (e.g., ‹отырғыз› is causative of ‹отыр›), put
! Der/Caus
after it.
Nouns
- Some nouns end in ‹ә›, and have interesting or inconsistent-looking phonology, like
күнә
,кінә
. These should be added with the right side missing its ‹ә› and in the class N1-Ә. E.g.,күнә:күн N1-Ә ;
- Nouns from Russian should be classified as
N5
- especially if the last vowel is ‹и› or ‹у›
- especially if they end with a consonant that would normally be voiced before a vowel-initial suffix in Kazakh words (п, к)
- Nouns that are compounds ending in a possessive form (like ‹ауа райы›) should be categorised into the
N-COMPOUND-PX
category and entered without the possessive ending on the right side, e.g.ауа% райы:ауа% рай N-COMPOUND-PX ; ! "weather,climate"
- If you're adding a noun that can also be used as an adjective, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's a noun it can also take the
<attr>
tag.
Adjectives
- If you're adding an adjective that can also be used as a noun, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's an adjective it can also take the
<subst>
tag.
Adverbs
- If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an adjective in the appropriate adjective class that can take the
<advl>
tag. In the bidix, you'll want to translate the<adj>
and the<adj>
<advl>
forms differently.