Difference between revisions of "Apertium-kaz"

From Apertium
Jump to navigation Jump to search
Line 64: Line 64:
 
== Developers ==
 
== Developers ==
   
 
=== Guidelines for adding stems ===
We have several language pairs involving Kazakh, and in every pair there is a lexc, twol and rlx file for this language. But we don't work on these files directly. Instead, we edit <code>kaz.lexc</code>, <code>kaz.twol</code> and <code>kaz.rlx</code> files located in '''apertium-kaz''', and then import these files to the language pair directories using a script (<code>update-morphs.bash</code> in language-pair directories). This script merely copies twol and rlx files, since they don't have to be tweaked to a particular language pair, but "trimms" the lexc file leaving only that stems in it, which are also found in the bilingual dictionary of the pair importing is made to.
 
   
  +
==== An overview of the process ====
For further details on how these works and a step-by-step guide (taking the Kazakh-Tatar pair as an example), see [[Kazakh_and_Tatar#Development_workflow]].
 
   
  +
If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things:
A natural consequence of this approach is that any change made to '''apertium-kaz''' will affect all other Kazakh-to-X or X-to-Kazakh language pairs. This requires some rules to be followed. Here are some of them.
 
  +
# Do not change tags or order of tags and don't do any other change wich certainly will break things in other pairs without discussing it first on the apertium-turkic mailing list (or notifying others via the list if you are absolutely sure that this change was necessary (e.g. it was discussed earlier and hence was due));
 
  +
1) what the stem of the word is (and also the right-hand side in the entry)
# If you encounter a word in the lexc which seems to be miscategorized, or some multiword which in reality is a combination of two lexemes (like ''барлық жерде'' from the example above), do not delete them! Mark them with <code>Use/MT</code> instead.
 
  +
2) which continuation lexicon (read: paradigm) you should assign to it
# Write descriptive commit messages.
 
  +
3) whether or not that stem is already in apertium-kaz
  +
  +
Here is an example of a word already in apertium-kaz.kaz.lexc file:
  +
  +
кітап:кітап N1 ; ! "book"
  +
  +
As in this example, in most cases, the left hand-side and the right-hand side of the entry will be the same. The left-hand side might be referred to as 'stem'. Continuation lexicon is N1. What comes after '!' are comments. Glosses are a good thing to have, but technically it is only comment, and thus optional.
  +
  +
An example where the left and right hand sides are not the same:
  +
  +
күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! ""
  +
  +
This has been done so so that forms like "күн тәртіптері" can also be analyzed.
  +
  +
The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign:
  +
  +
мән%-жай:мән%-жай N1 ; ! ""
   
=== Guidelines for adding stems ===
 
 
==== General ====
 
==== General ====
  +
* Before adding a stem, be sure it does not already exist in lexc.
 
  +
* Before adding a stem, be sure it does not already exist in lexc. A good way to do that is to look up stem(s) you want to add with <code>lt-proc kaz.automorf.bin</code>. In some cases, you'll see that the stem isn't analysed at all:
  +
  +
^foo/*foo$
  +
  +
In some cases, it will be analysed, but as something else than what you want to add it as:
  +
  +
^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$
  +
  +
(Assuming that you want to add "Жол жөндеуші as a company name, which it happens to be).
  +
  +
Another, probably more relevant example:
  +
  +
<pre>
  +
apertium-kaz$ echo "қабылдау" | apertium -d . kaz-tagger
  +
^қабылдау/қабылда<v><tv><ger><nom>$^./.<sent>$
  +
</pre>
  +
  +
I'm pretty sure that you shouldn't add қабылдау as a noun. Or, rather, that you wouldn't after you see the above analysis.
  +
  +
In the third case, you will see that the stem gets the analysis you expect, but you'd seen that some of it wordforms were not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it).
  +
 
* Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar.
 
* Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar.
 
* Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems ''must'' be added with the ''voiceless'' consonant (п, к, қ), e.g <code>тақ:тақ V-TV ;</code>
 
* Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems ''must'' be added with the ''voiceless'' consonant (п, к, қ), e.g <code>тақ:тақ V-TV ;</code>
Line 82: Line 119:
 
** Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add <code>! Dir/LR</code> after the form that should not be generated (i.e., the form that is the non-normative version), and add <code>! Err/Orth</code> after it too if it should be considered a spelling mistake.
 
** Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add <code>! Dir/LR</code> after the form that should not be generated (i.e., the form that is the non-normative version), and add <code>! Err/Orth</code> after it too if it should be considered a spelling mistake.
 
* Any changes to continuation classes should be discussed on the apertium-turkic mailing list.
 
* Any changes to continuation classes should be discussed on the apertium-turkic mailing list.
  +
  +
Most likely, a word not covered by apertium-kaz already will be an open class word. Below are some comments on the open-class word lexicons.
   
 
==== Verbs ====
 
==== Verbs ====
 
* Categorise correctly according to IV or TV status:
 
* Categorise correctly according to IV or TV status:
 
** IV = intransitive verbs; TV = transitive verbs
 
** IV = intransitive verbs; TV = transitive verbs
** If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is
+
** If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is TV
 
** For phrasal verbs (e.g,. "қабыл ал", "пайда бол", "мойынға ал"), do not categorise it according to its elements; instead treat it as a single verb (TV, IV, TV).
 
** For phrasal verbs (e.g,. "қабыл ал", "пайда бол", "мойынға ал"), do not categorise it according to its elements; instead treat it as a single verb (TV, IV, TV).
 
* There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
 
* There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
Line 105: Line 144:
 
==== Adjectives ====
 
==== Adjectives ====
 
* The basic categorisation of adjectives depends on whether it takes comparative morphology (-ЫрАҚ), can be substantivised (acts like a noun), and/or can be adverbialised (acts like an adverb). Be sure to put the adjective in the right category according what those categories allow.
 
* The basic categorisation of adjectives depends on whether it takes comparative morphology (-ЫрАҚ), can be substantivised (acts like a noun), and/or can be adverbialised (acts like an adverb). Be sure to put the adjective in the right category according what those categories allow.
  +
 
* If you're adding an adjective that can also be used as a noun, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's an adjective it can also take the {{tag|subst}} tag.
 
* If you're adding an adjective that can also be used as a noun, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's an adjective it can also take the {{tag|subst}} tag.
   

Revision as of 10:28, 30 October 2017

Kazakh - қазақ тілі
language transducer
Coverage: ~94.5%
Stems: 36,595
Vanilla stems: 27,433
Paradigms:
Location: apertium-kaz (languages)
Families: Turkic languages
Areas: Languages of Central Asia, Languages of the former Soviet Union
Lang info Kazakh

Apertium-kaz is a morphological analyser/generator and CG tagger for Kazakh, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It's used in the following language pairs:

Installation

Apertium-kaz is currently located in languages/apertium-kaz.

To use

  • Install Apertium core tools.
  • You can install apertium-kaz from the same repo as the core tools, but what's recommended is to get it from the repo above and compile by running
    ./autogen.sh; make
  • Now test it by running some Kazakh words through the analyser it:
    echo "бірнеше қазақша сөздер" | apertium -d . kaz-morph
    • You should get some analyses, like this:
    ^бірнеше/бірнеше<det><qnt>$ ^қазақша/қазақша<adv>/қазақша<n><nom>/қазақша<n><attr>/қазақша<n><nom><cop><aor><p3><pl>/қазақша<n><nom><cop><aor><p3><sg>$ ^сөздер/сөз<n><pl><nom>/сөз<n><pl><nom><cop><aor><p3><pl>/сөз<n><pl><nom><cop><aor><p3><sg>$^./.<sent>$


Dependency tree

This is for reference and isn't generally needed for most users:

  • hfst (svn ≥r1916)
  • apertium
    • lttoolbox
  • VISL-CG3

For spell checking

If you're compiling the apertium-kaz spell checker, you'll additionally need these dependencies:

  • hfst-ospell (./configure --enable-zhfst)
    • see Installation, it is installable from Tino's repositories
  • corevoikko/libvoikko/src/tools/voikkospell (./configure --enable-hfst)

You'll want to configure apertium-kaz with --enable-ospell and then after making it, copy kaz.zhfst to ~/.voikko/3/kk.zhfst

Then you can do this:

$ echo "қазақша билмеймін" | tr ' ' '\n' | voikkospell -d kk -s
C: қазақша
W: билмеймін
S: билеймін
S: білмеймін
S: билемеймін
S: бөлмеймін
S: билемейміз

Current State

{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}

  • Number of stems: 36,595 {{#ifneq | | | () }}
  • Disambiguation rules: 150
  • Coverage: ~94.5%

{{#ifneq | Әуезов | None |

{{#ifneq | Әуезов corpus | | | }}

}}

{{#ifneq | bible | None |

{{#ifneq | | | | }}

}}

{{#ifneq | azattyq2010 | None |

{{#ifneq | RFERL_corpora | | | }}

}}

{{#ifneq | wp2013 | None |

{{#ifneq | | | | }}

}}

{{#ifneq | quran | None |

{{#ifneq | | | | }}

}}

{{#ifneq | udhr | None |

{{#ifneq | UDHR | | | }}

}}

{{#ifneq | {{{corpus7}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus8}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus9}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus10}}} | None |

{{#ifneq | | | | }}

}}

corpuswordscoverage
<nowinter>Әуезов</nowinter>Әуезов155K ~92.89%
<nowinter>[[|bible]]</nowinter>bible577K ~95.29%
<nowinter>azattyq2010</nowinter>azattyq20103.2M ~95.07%
<nowinter>[[|wp2013]]</nowinter>wp201318.2M ~90.10%
<nowinter>[[|quran]]</nowinter>quran107K ~96.71%
<nowinter>udhr</nowinter>udhr1.5K ~96.86%
<nowinter>[[|{{{corpus7}}}]]</nowinter>{{{corpus7}}} ~%
<nowinter>[[|{{{corpus8}}}]]</nowinter>{{{corpus8}}} ~%
<nowinter>[[|{{{corpus9}}}]]</nowinter>{{{corpus9}}} ~%
<nowinter>[[|{{{corpus10}}}]]</nowinter>{{{corpus10}}} ~%

Developers

Guidelines for adding stems

An overview of the process

If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things:

1) what the stem of the word is (and also the right-hand side in the entry) 2) which continuation lexicon (read: paradigm) you should assign to it 3) whether or not that stem is already in apertium-kaz

Here is an example of a word already in apertium-kaz.kaz.lexc file:

кітап:кітап N1 ; ! "book"

As in this example, in most cases, the left hand-side and the right-hand side of the entry will be the same. The left-hand side might be referred to as 'stem'. Continuation lexicon is N1. What comes after '!' are comments. Glosses are a good thing to have, but technically it is only comment, and thus optional.

An example where the left and right hand sides are not the same:

күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! ""

This has been done so so that forms like "күн тәртіптері" can also be analyzed.

The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign:

мән%-жай:мән%-жай N1 ; ! ""

General

  • Before adding a stem, be sure it does not already exist in lexc. A good way to do that is to look up stem(s) you want to add with lt-proc kaz.automorf.bin. In some cases, you'll see that the stem isn't analysed at all:

^foo/*foo$

In some cases, it will be analysed, but as something else than what you want to add it as:

^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$

(Assuming that you want to add "Жол жөндеуші as a company name, which it happens to be).

Another, probably more relevant example:

apertium-kaz$ echo "қабылдау" | apertium -d . kaz-tagger 
^қабылдау/қабылда<v><tv><ger><nom>$^./.<sent>$

I'm pretty sure that you shouldn't add қабылдау as a noun. Or, rather, that you wouldn't after you see the above analysis.

In the third case, you will see that the stem gets the analysis you expect, but you'd seen that some of it wordforms were not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it).

  • Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar.
  • Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems must be added with the voiceless consonant (п, к, қ), e.g тақ:тақ V-TV ;
    • Stems from Russian that end with one of the voiced consonants (б, г), such as геолог should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, then N5).
  • Words that have an inserted ‹ы› or ‹і› in some forms should get %{y%} in that spot on the right side, e.g. орын:ор%{y%}н N1 ;.
    • Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add ! Dir/LR after the form that should not be generated (i.e., the form that is the non-normative version), and add ! Err/Orth after it too if it should be considered a spelling mistake.
  • Any changes to continuation classes should be discussed on the apertium-turkic mailing list.

Most likely, a word not covered by apertium-kaz already will be an open class word. Below are some comments on the open-class word lexicons.

Verbs

  • Categorise correctly according to IV or TV status:
    • IV = intransitive verbs; TV = transitive verbs
    • If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is TV
    • For phrasal verbs (e.g,. "қабыл ал", "пайда бол", "мойынға ал"), do not categorise it according to its elements; instead treat it as a single verb (TV, IV, TV).
  • There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
    • Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as сүй
    • Some verbs have a "hidden" ‹ы› or ‹і› under the ‹у›, for example ері, аршы, аңды, etc. These verb stems should be added with the ‹ы› or ‹і›.
    • Of course, verbs with ‹у› in the stem should keep the ‹у›, like жу, қу, жау, etc.
  • Do not add passive or cooperative forms of verb stems (e.g., ‹тартыл› is passive of ‹тарт›, and ‹тартыс› is cooperative) unless absolutely needed for translation. In this case, put ! Use/MT ! Der/Pass or ! Use/MT ! Der/Coop after the entry, respectively.
  • If you add a causative form of a verb (e.g., ‹отырғыз› is causative of ‹отыр›), put ! Der/Caus after it.

Nouns

  • Some nouns end in ‹ә›, and have interesting or inconsistent-looking phonology, like күнә, кінә. These should be added with the right side missing its ‹ә› and in the class N1-Ә. E.g., күнә:күн N1-Ә ;
  • Nouns from Russian should be classified as N5
    • especially if the last vowel is ‹и› or ‹у›
    • especially if they end with a consonant that would normally be voiced before a vowel-initial suffix in Kazakh words (п, к)
  • Nouns that are compounds ending in a possessive form (like ‹ауа райы›) should be categorised into the N-COMPOUND-PX category and entered without the possessive ending on the right side, e.g. ауа% райы:ауа% рай N-COMPOUND-PX ; ! "weather,climate"
  • If you're adding a noun that can also be used as an adjective, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's a noun it can also take the <attr> tag.

Adjectives

  • The basic categorisation of adjectives depends on whether it takes comparative morphology (-ЫрАҚ), can be substantivised (acts like a noun), and/or can be adverbialised (acts like an adverb). Be sure to put the adjective in the right category according what those categories allow.
  • If you're adding an adjective that can also be used as a noun, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's an adjective it can also take the <subst> tag.

Adverbs

  • If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an adjective in the appropriate adjective class that can take the <advl> tag. In the bidix, you'll want to translate the <adj> and the <adj><advl> forms differently.