Apertium-kaz

From Apertium
Jump to navigation Jump to search
Kazakh - қазақ тілі
language transducer
Coverage: ~94.5%
Stems: 36,595
Vanilla stems: 27,433
Paradigms:
Location: apertium-kaz (languages)
Families: Turkic languages
Areas: Languages of Central Asia, Languages of the former Soviet Union
Lang info Kazakh

Apertium-kaz is a morphological analyser/generator and CG tagger for Kazakh, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It's used in the following language pairs:

Installation

Apertium-kaz is currently located in languages/apertium-kaz.

To use

  • Install Apertium core tools.
  • You can install apertium-kaz from the same repo as the core tools, but it doesn't give you access to the source code. For developer access, it is recommended to get the code from the repo above and compile it by running the following commands within the apertium-kaz directory that you checked out.
    ./autogen.sh; make
  • Now test it by running some Kazakh words through the analyser:
    echo "бірнеше қазақша сөздер" | apertium -d . kaz-morph
    • You should get some analyses, like this:
    ^бірнеше/бірнеше<det><qnt>$ ^қазақша/қазақша<adv>/қазақша<n><nom>/қазақша<n><attr>/қазақша<n><nom><cop><aor><p3><pl>/қазақша<n><nom><cop><aor><p3><sg>$ ^сөздер/сөз<n><pl><nom>/сөз<n><pl><nom><cop><aor><p3><pl>/сөз<n><pl><nom><cop><aor><p3><sg>$^./.<sent>$
    • Note that . specifies the directory where the compiled transducer is. The line above assume you're running the command from within the apertium-kaz/ directory.

Dependency tree

This is for reference and isn't generally needed for most users:

  • hfst (svn ≥r1916)
  • apertium
    • lttoolbox
  • VISL-CG3

For spell checking

If you're compiling the apertium-kaz spell checker, you'll additionally need these dependencies:

  • hfst-ospell (./configure --enable-zhfst)
    • see Installation, it is installable from Tino's repositories
  • corevoikko/libvoikko/src/tools/voikkospell (./configure --enable-hfst)

You'll want to configure apertium-kaz with --enable-ospell and then after making it, copy kaz.zhfst to ~/.voikko/3/kk.zhfst

Then you can do this:

$ echo "қазақша билмеймін" | tr ' ' '\n' | voikkospell -d kk -s
C: қазақша
W: билмеймін
S: билеймін
S: білмеймін
S: билемеймін
S: бөлмеймін
S: билемейміз

Current State

{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}

  • Number of stems: 36,595 {{#ifneq | | | () }}
  • Disambiguation rules: 150
  • Coverage: ~94.5%

{{#ifneq | Әуезов | None |

{{#ifneq | Әуезов corpus | | | }}

}}

{{#ifneq | bible | None |

{{#ifneq | | | | }}

}}

{{#ifneq | azattyq2010 | None |

{{#ifneq | RFERL_corpora | | | }}

}}

{{#ifneq | wp2013 | None |

{{#ifneq | | | | }}

}}

{{#ifneq | quran | None |

{{#ifneq | | | | }}

}}

{{#ifneq | udhr | None |

{{#ifneq | UDHR | | | }}

}}

{{#ifneq | {{{corpus7}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus8}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus9}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus10}}} | None |

{{#ifneq | | | | }}

}}

corpuswordscoverage
<nowinter>Әуезов</nowinter>Әуезов155K ~92.89%
<nowinter>[[|bible]]</nowinter>bible577K ~95.29%
<nowinter>azattyq2010</nowinter>azattyq20103.2M ~95.07%
<nowinter>[[|wp2013]]</nowinter>wp201318.2M ~90.10%
<nowinter>[[|quran]]</nowinter>quran107K ~96.71%
<nowinter>udhr</nowinter>udhr1.5K ~96.86%
<nowinter>[[|{{{corpus7}}}]]</nowinter>{{{corpus7}}} ~%
<nowinter>[[|{{{corpus8}}}]]</nowinter>{{{corpus8}}} ~%
<nowinter>[[|{{{corpus9}}}]]</nowinter>{{{corpus9}}} ~%
<nowinter>[[|{{{corpus10}}}]]</nowinter>{{{corpus10}}} ~%

Developers

Guidelines for adding stems

An overview of the process

If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things:

  1. what the stem of the word is (to be exact, what the left-hand side and the right-hand side of the the entry should be),
  2. whether or not that stem is already in apertium-kaz, and
  3. (if it isn't or it isn't analysed as something that you expect) which continuation lexicon (read: paradigm) you should assign the stem to.

Here is an example of a word already in the apertium-kaz.kaz.lexc file:

кітап:кітап N1 ; ! "book"

As in this example, in most cases, the left hand-side and the right-hand side of the entry are the same. The left-hand side is the underlying form, the right-hand is the surface form. Continuation lexicon in this example is N1. What comes after the exclamation mark '!' are comments. Glosses are a good thing to have, but technically they are only a comment, and thus optional.

Here is an example where the left and right hand sides are not the same:

күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! ""

This has been implemented in that way so that forms like "күн тәртіптері" can also be analysed as forms of the word "күн тәртібі".

The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign:

мән%-жай:мән%-жай N1 ; ! ""

General

  • Before adding a stem, be sure it does not already exist in lexc. A good way to do that is to look up stem(s) you want to add with lt-proc kaz.automorf.bin. In some cases, you'll see that the stem isn't analysed at all:

^foo/*foo$

In some cases, it will be analysed, but as something else than what you want to add it as:

^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$

(assuming that you want to add "Жол жөндеуші as a company name, which it happens to be).

Another, probably more relevant example:

apertium-kaz$ echo "қабылдау" | apertium -d . kaz-tagger 
^қабылдау/қабылда<v><tv><ger><nom>$^./.<sent>$

(supposing that some other forms of the word, say with case affixes, like e.g. "қабылдауды" weren't analysed (see the next paragraph) and thus you looked up қабылдау in kaz.autogen.bin). Looking the *stem* up (note: not the surface form, the stem) with the lt-proc kaz.autogen.bin command before adding it to the lexc file gives you a chance to save some work and to avoid addiing the same thing twice.

In the third case, you will see that the stem is already there, is linked to the right lexicon, but some surface forms of the word are not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it).

  • Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar.
  • Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems must be added with the voiceless consonant (п, к, қ), e.g тақ:тақ V-TV ;
    • Stems from Russian that end with one of the voiced consonants (б, г), such as геолог should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, then N5).
  • Words that have an inserted ‹ы› or ‹і› in some forms should get %{y%} in that spot on the right side, e.g. орын:ор%{y%}н N1 ;.
    • Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add ! Dir/LR after the form that should not be generated (i.e., the form that is the non-normative version), and add ! Err/Orth after it too if it should be considered a spelling mistake.
  • Any changes to continuation classes should be discussed on the apertium-turkic mailing list.

Most likely, a word not covered by apertium-kaz already will be an open class word. Below are some comments on the open-class word lexicons.

Verbs

  • Categorise correctly according to IV or TV status:
    • IV = intransitive verbs; TV = transitive verbs
    • If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is TV
    • For phrasal verbs (e.g,. "қабыл ал", "пайда бол", "мойынға ал"), do not categorise it according to its elements; instead treat it as a single verb (TV, IV, TV).
  • There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
    • Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as сүй
    • Some verbs have a "hidden" ‹ы› or ‹і› under the ‹у›, for example ері, аршы, аңды, etc. These verb stems should be added with the ‹ы› or ‹і›.
    • Of course, verbs with ‹у› in the stem should keep the ‹у›, like жу, қу, жау, etc.
  • Do not add passive or cooperative forms of verb stems (e.g., ‹тартыл› is passive of ‹тарт›, and ‹тартыс› is cooperative) unless absolutely needed for translation. In this case, put ! Use/MT ! Der/Pass or ! Use/MT ! Der/Coop after the entry, respectively.
  • If you add a causative form of a verb (e.g., ‹отырғыз› is causative of ‹отыр›), put ! Der/Caus after it.

Nouns

  • Some nouns end in ‹ә›, and have interesting or inconsistent-looking phonology, like күнә, кінә. These should be added with the right side missing its ‹ә› and in the class N1-Ә. E.g., күнә:күн N1-Ә ;
  • Nouns from Russian should be classified as N5
    • especially if the last vowel is ‹и› or ‹у›
    • especially if they end with a consonant that would normally be voiced before a vowel-initial suffix in Kazakh words (п, к)
  • Nouns that are compounds ending in a possessive form (like ‹ауа райы›) should be categorised into the N-COMPOUND-PX category and entered without the possessive ending on the right side, e.g. ауа% райы:ауа% рай N-COMPOUND-PX ; ! "weather,climate"
  • If you're adding a noun that can also be used as an adjective, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's a noun it can also take the <attr> tag.

Adjectives

  • The basic categorisation of adjectives depends on whether it takes comparative morphology (-ЫрАҚ), can be substantivised (acts like a noun), and/or can be adverbialised (acts like an adverb). Be sure to put the adjective in the right category according what those categories allow.
  • If you're adding an adjective that can also be used as a noun, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's an adjective it can also take the <subst> tag.

Adverbs

  • If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an adjective in the appropriate adjective class that can take the <advl> tag. In the bidix, you'll want to translate the <adj> and the <adj><advl> forms differently.

Additional tags

In a .lexc file, after the '!' you will also see Dir/LR, Dir/RL, Err/Orth and Use/MT comments. The meaning of them is as follows:

Dir/LR means: analyse this surface form, but don't generate it. Here is a good example:

сұхбат:сұқбат N1 ; ! "conversation/interview" Dir/LR
сұхбат:сұхбат N1 ; ! "conversation/interview"

In other words, Dir/LR marks alternative spellings of a word. If the alternative spelling isn't just alternative, but actually erroneous (but occurs quite commonly so that you want to support it), it is marked with the Err/Orth tag:

орын:ор%{y%}н N1 ; ! "place,seat"
орын:орын N1 ; ! "place,seat"  ! Dir/LR ! Err/Orth

"Орыны" for example, is considered erroneous spelling of "орын<n><px3sp><nom>". Such markings will allow us to produce better spell checkers.

In the examples above, if you don't mark either of the stems with Dir/LR, then the Kazakh generator, (if we personify it a bit) given a string like "^сұхбат<n><nom>$ for input, won't know which surface form to choose and will output both, separated with a slash: сұхбат/сұқбат.

As the name suggests, Dir/RL has the meaning opposite to Dir/LR: 'generate this surface form, but do not analyse it'. You won't see it much in a lexc file and almost certainly won't need to mark a stem you add as Dir/RL. Here is an example though:

да:%~да CC ; ! "also" Dir/RL

The conjunction ^да<cnj$ gets generated as "~да". This is necessary for a somewhat hacky way of handling the vowel harmony (read: making sure that the "да" gets rendered as "де" when the preceding word has front vowels) in cases where the standard way of handling the vowel harmony (read: twol) fails because the preceding word is unknown.

Use/MT (at least, in its original usage) marks (compound) words which are needed for translation, but probably shouldn't be in a "vanilla" Kazakh transducer:

қайда% болса% сонда:қайда% болса% сонда PRON-IND ; ! "anywhere" Use/MT

It has been also used to mark words which the person who added them wasn't sure how to classify. Such words will be reviewed later.

Full inventory of lexicons the stems can be linked to

It is useful to distinguish two classes of lexicons:

  1. lexicons which are only used as continuations for the other lexicons, and
  2. lexicons which are continuations for stems.

Here is an attempt to document the lexicons of the second kind found in the apertium-kaz.kaz.lexc file (so that: 1. people can add stems to a lexc file without having to read the lexc file itself 2. we can re-evaluate our decisions):

Nouns:

    • N1
    • N-COMPOUND-PX
    • N5
    • N1-ABBR
    • N-INFL-INKI

Proper nouns:

  • NP-ANT-F: feminine anthroponyms
  • NP-ANT-M: masculine anthroponyms
  • NP-COG-OB: family names ending with -ов or -ев
  • NP-COG-IN: family names ending with -ин
  • NP-COG-M: family name not ending with -ов, -ев or -in; masculine. Example: Галицкий
  • NP-COG-F: family name not ending with -ов, -ев or -in; feminine. Example: Толстая
  • NP-COG-MF: family names not ending with -ов, -ев or -in which are both masculine and feminine:
  • NP-PAT-VICH: patronyms ending with -вич (and thus which can also take the -вна ending): Васильевич:Василье NP-PAT-VICH ; ! ""
    • (could be derived from anthroponyms automatically?)
    • NP-TOP: toponyms (in particular, river names should go here too)
    • NP-TOP-ASSR: former and future soviet socialistic republic names ending with СР: Қырғыз% КСР:Қырғыз% КСР%{э%}%{й%} NP-TOP-ASSR ;
    • NP-ORG: organization names
    • NP-ORG-LAT: organization names written in Latin character. Example: Microsoft
    • NP-AL: proper names not belonging to one of the above NP-* classes. Example: Восток

Verbs:

  • V-TV
  • V-IV
  • Vinfl-AUX

Adjectives:

  • A1: adjectives which can be adverbialised and have a comparative form. Example: жақсы.
    • Test 1: can the word in question modify verb? "Жақсы оқиды" OK? A: yes.
    • Test 2: has a comparative form? "Жақсырақ" OK? A: yes
    • ==> жақсы A1
  • A2: adjectives which cannot be adverbialized, but which do have the comparative form. Example: лайық:лайық A2 ; ! ""
  • A3: adjectives which can neither be adverbialized nor have comparative form
  • A4: initially: adjectives like социал or (tat.) биологик = (kaz.) биологиялық which the author of this classification of adjectives thought to never substantivize, but have seen them substativized since then and thus considers deprecated.

The whole purpose of introducing subclasses of adjectives was to avoid overgenerating forms which do not exist.

If you're unsure which adjective lexicon to select, pick A1.

  • A6:

Adverbs:

  • ADV
  • ADV-ITG
  • ADV-WITH-KI
  • ADV-WITH-KI-I
  • ADV-LANG