Difference between revisions of "Apertium-kaz"

Revision as of 19:54, 30 October 2017

Installation

Apertium-kaz is currently located in languages/apertium-kaz.

To use

Install Apertium core tools.
You can install apertium-kaz from the same repo as the core tools, but what's recommended is to get it from the repo above and compile by running
./autogen.sh; make
Now test it by running some Kazakh words through the analyser it:
echo "бірнеше қазақша сөздер" | apertium -d . kaz-morph
- You should get some analyses, like this:
^бірнеше/бірнеше<det><qnt>$ ^қазақша/қазақша<adv>/қазақша<n><nom>/қазақша<n><attr>/қазақша<n><nom>+е<cop><aor><p3><pl>/қазақша<n><nom>+е<cop><aor><p3><sg>$ ^сөздер/сөз<n><pl><nom>/сөз<n><pl><nom>+е<cop><aor><p3><pl>/сөз<n><pl><nom>+е<cop><aor><p3><sg>$^./.<sent>$

Dependency tree

This is for reference and isn't generally needed for most users:

hfst (svn ≥r1916)
apertium
- lttoolbox
VISL-CG3

For spell checking

If you're compiling the apertium-kaz spell checker, you'll additionally need these dependencies:

hfst-ospell (./configure --enable-zhfst)
- see Installation, it is installable from Tino's repositories
corevoikko/libvoikko/src/tools/voikkospell (./configure --enable-hfst)

You'll want to configure apertium-kaz with --enable-ospell and then after making it, copy kaz.zhfst to ~/.voikko/3/kk.zhfst

Then you can do this:

$ echo "қазақша билмеймін" | tr ' ' '\n' | voikkospell -d kk -s
C: қазақша
W: билмеймін
S: билеймін
S: білмеймін
S: билемеймін
S: бөлмеймін
S: билемейміз

Current State

{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}

Number of stems: 36,595 {{#ifneq | | | () }}
Disambiguation rules: 150
Coverage: ~94.5%

{{#ifneq | Әуезов | None |

}}

{{#ifneq | bible | None |

}}

{{#ifneq | azattyq2010 | None |

}}

{{#ifneq | wp2013 | None |

}}

{{#ifneq | quran | None |

}}

{{#ifneq | udhr | None |

}}

{{#ifneq | {{{corpus7}}} | None |

}}

{{#ifneq | {{{corpus8}}} | None |

}}

{{#ifneq | {{{corpus9}}} | None |

}}

{{#ifneq | {{{corpus10}}} | None |

}}

corpus	words	coverage
<nowinter>Әуезов</nowinter>	Әуезов	155K	~92.89%
<nowinter>[[\|bible]]</nowinter>	bible	577K	~95.29%
<nowinter>azattyq2010</nowinter>	azattyq2010	3.2M	~95.07%
<nowinter>[[\|wp2013]]</nowinter>	wp2013	18.2M	~90.10%
<nowinter>[[\|quran]]</nowinter>	quran	107K	~96.71%
<nowinter>udhr</nowinter>	udhr	1.5K	~96.86%
<nowinter>[[\|{{{corpus7}}}]]</nowinter>	{{{corpus7}}}		~%
<nowinter>[[\|{{{corpus8}}}]]</nowinter>	{{{corpus8}}}		~%
<nowinter>[[\|{{{corpus9}}}]]</nowinter>	{{{corpus9}}}		~%
<nowinter>[[\|{{{corpus10}}}]]</nowinter>	{{{corpus10}}}		~%

Developers

Guidelines for adding stems

An overview of the process

If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things:

what the stem of the word is (to be exact, what the left-hand side and the right-hand side of the the entry should be),
whether or not that stem is already in apertium-kaz, and
(if it isn't or it isn't something that is needed) which continuation lexicon (read: paradigm) you should assign the stem to.

Here is an example of a word already in the apertium-kaz.kaz.lexc file:

кітап:кітап N1 ; ! "book"

As in this example, in most cases, the left hand-side and the right-hand side of the entry are the same. The left-hand side is the underlying form, the right-hand is the surface form. Continuation lexicon in this example is N1. What comes after the exclamation mark '!' are comments. Glosses are a good thing to have, but technically they are only a comment, and thus optional.

Here is an example where the left and right hand sides are not the same:

күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! ""

This has been implemented in that way so that forms like "күн тәртіптері" can also be analysed as forms of the word "күн тәртібі".

The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign:

мән%-жай:мән%-жай N1 ; ! ""

General

Before adding a stem, be sure it does not already exist in lexc. A good way to do that is to look up stem(s) you want to add with lt-proc kaz.automorf.bin. In some cases, you'll see that the stem isn't analysed at all:

^foo/*foo$

In some cases, it will be analysed, but as something else than what you want to add it as:

^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$

(assuming that you want to add "Жол жөндеуші as a company name, which it happens to be).

Another, probably more relevant example:

apertium-kaz$ echo "қабылдау" | apertium -d . kaz-tagger 
^қабылдау/қабылда<v><tv><ger><nom>$^./.<sent>$

(supposing that some other forms of the word, say with case affixes, like e.g. "қабылдауды" weren't analysed (see the next paragraph) and thus you looked up қабылдау in kaz.autogen.bin). Looking the *stem* up (note: not the surface form, the stem) with the lt-proc kaz.autogen.bin command before adding it to the lexc file gives you a chance to save some work and to avoid addiing the same thing twice.

In the third case, you will see that the stem is already there, is linked to the right lexicon, but some surface forms of the word are not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it).

Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar.
Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems must be added with the voiceless consonant (п, к, қ), e.g тақ:тақ V-TV ;
- Stems from Russian that end with one of the voiced consonants (б, г), such as геолог should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, then N5).
Words that have an inserted ‹ы› or ‹і› in some forms should get %{y%} in that spot on the right side, e.g. орын:ор%{y%}н N1 ;.
- Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add ! Dir/LR after the form that should not be generated (i.e., the form that is the non-normative version), and add ! Err/Orth after it too if it should be considered a spelling mistake.
Any changes to continuation classes should be discussed on the apertium-turkic mailing list.

Most likely, a word not covered by apertium-kaz already will be an open class word. Below are some comments on the open-class word lexicons.

Verbs

Categorise correctly according to IV or TV status:
- IV = intransitive verbs; TV = transitive verbs
- If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is TV
- For phrasal verbs (e.g,. "қабыл ал", "пайда бол", "мойынға ал"), do not categorise it according to its elements; instead treat it as a single verb (TV, IV, TV).
There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
- Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as сүй
- Some verbs have a "hidden" ‹ы› or ‹і› under the ‹у›, for example ері, аршы, аңды, etc. These verb stems should be added with the ‹ы› or ‹і›.
- Of course, verbs with ‹у› in the stem should keep the ‹у›, like жу, қу, жау, etc.
Do not add passive or cooperative forms of verb stems (e.g., ‹тартыл› is passive of ‹тарт›, and ‹тартыс› is cooperative) unless absolutely needed for translation. In this case, put ! Use/MT ! Der/Pass or ! Use/MT ! Der/Coop after the entry, respectively.
If you add a causative form of a verb (e.g., ‹отырғыз› is causative of ‹отыр›), put ! Der/Caus after it.

Nouns

Some nouns end in ‹ә›, and have interesting or inconsistent-looking phonology, like күнә, кінә. These should be added with the right side missing its ‹ә› and in the class N1-Ә. E.g., күнә:күн N1-Ә ;
Nouns from Russian should be classified as N5
- especially if the last vowel is ‹и› or ‹у›
- especially if they end with a consonant that would normally be voiced before a vowel-initial suffix in Kazakh words (п, к)
Nouns that are compounds ending in a possessive form (like ‹ауа райы›) should be categorised into the N-COMPOUND-PX category and entered without the possessive ending on the right side, e.g. ауа% райы:ауа% рай N-COMPOUND-PX ; ! "weather,climate"
If you're adding a noun that can also be used as an adjective, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's a noun it can also take the <attr> tag.

Adjectives

The basic categorisation of adjectives depends on whether it takes comparative morphology (-ЫрАҚ), can be substantivised (acts like a noun), and/or can be adverbialised (acts like an adverb). Be sure to put the adjective in the right category according what those categories allow.

If you're adding an adjective that can also be used as a noun, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's an adjective it can also take the <subst> tag.

Adverbs

If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an adjective in the appropriate adjective class that can take the <advl> tag. In the bidix, you'll want to translate the <adj> and the <adj><advl> forms differently.

Full inventory of lexicons the stems can be linked to

It is useful to distinguish two classes of lexicons:

lexicons which are only used as continuations for the other lexicons, and
lexicons which are continuations for stems.

Here is an attempt to document the lexicons of the second kind found in the apertium-kaz.kaz.lexc file (so that: 1. people can add stems to a lexc file without having to read the lexc file itself 2. we can re-evaluate our decisions):

Nouns:

- N1
- N-COMPOUND-PX
- N5
- N1-ABBR
- N-INFL-INKI

Proper nouns:

NP-ANT-F: feminine anthroponyms
NP-ANT-M: masculine anthroponyms
NP-COG-OB: family names ending with -ов or -ев
NP-COG-IN: family names ending with -ин
NP-COG-M: family name not ending with -ов, -ев or -in; masculine. Example: Галицкий
NP-COG-F: family name not ending with -ов, -ев or -in; feminine. Example: Толстая
NP-COG-MF: family names not ending with -ов, -ев or -in which are both masculine and feminine:
NP-PAT-VICH: patronyms ending with -вич (and thus which can also take the -вна ending): Васильевич:Василье NP-PAT-VICH ; ! ""
- (could be derived from anthroponyms automatically?)
- NP-TOP: toponyms (in particular, river names should go here too)
- NP-TOP-ASSR: former and future soviet socialistic republic names ending with СР: Қырғыз% КСР:Қырғыз% КСР%{э%}%{й%} NP-TOP-ASSR ;
- NP-ORG: organization names
- NP-ORG-LAT: organization names written in Latin character. Example: Microsoft
- NP-AL: proper names not belonging to one of the above NP-* classes. Example: Восток

Verbs:

V-TV
V-IV
Vinfl-AUX

Adjectives:

A1: adjectives which can be adverbialised and have a comparative form. Example: жақсы.
- Test 1: can the word in question modify verb? "Жақсы оқиды" OK? A: yes.
- Test 2: has a comparative form? "Жақсырақ" OK? A: yes
- ==> жақсы A1

A2: adjectives which cannot be adverbialized, but which do have the comparative form. Example: лайық:лайық A2 ; ! ""

A3: adjectives which can neither be adverbialized nor have comparative form

A4: initially: adjectives like социал or (tat.) биологик = (kaz.) биологиялық which the author of this classification of adjectives thought to never substantivize, but have seen them substativized since then and thus considers deprecated.

The whole purpose of introducing subclasses of adjectives was to avoid overgenerating forms which do not exist.

If you're unsure which adjective lexicon to select, pick A1.

A6:

Adverbs:

ADV
ADV-ITG
ADV-WITH-KI
ADV-WITH-KI-I
ADV-LANG

Additional tags

In a .lexc file, after the '!' you will also see Dir/LR, Dir/RL, Err/Orth and Use/MT comments. The meaning of them is as follows:

Dir/LR means: analyse this surface form, but don't generate it. Here is a good example:

сұхбат:сұқбат N1 ; ! "conversation/interview" Dir/LR
сұхбат:сұхбат N1 ; ! "conversation/interview"

In other words, Dir/LR marks alternative spellings of a word. If the alternative spelling isn't just alternative, but actually erroneous (but occurs quite commonly so that you want to support it), it is marked with the Err/Orth tag:

орын:ор%{y%}н N1 ; ! "place,seat"
орын:орын N1 ; ! "place,seat"  ! Dir/LR ! Err/Orth

"Орыны" for example, is considered erroneous spelling of "орын<n><px3sp><nom>". Such markings will allow us to produce better spell checkers.

In the examples above, if you don't mark either of the stems with Dir/LR, then the Kazakh generator, (if we personify it a bit) given a string like "^сұхбат<n><nom>$ for input, won't know which surface form to choose and will output both, separated with a slash: сұхбат/сұқбат.

As the name suggests, Dir/RL has the meaning opposite to Dir/LR: 'generate this surface form, but do not analyse it'. You won't see it much in a lexc file and almost certainly won't need it. Here is an example:

да:%~да CC ; ! "also" Dir/RL

The conjunction ^да<cnj$ gets generated as "~да". This is necessary for a somewhat hacky way of handling the vowel harmony (read: making sure that the "да" gets rendered as "де" when the preceding word has front vowels) in cases where the standard way of handling the vowel harmony (read: twol) fails because the preceding word is unknown.

Use/MT marks words (at least, in its original usage) marks (compound) words which are needed for translation, but probably shouldn't be in a "vanilla" Kazakh transducer:

қайда% болса% сонда:қайда% болса% сонда PRON-IND ; ! "anywhere" Use/MT

It has been also used to mark words which the person who added them wasn't sure how to classify. Such words will be reviewed later.

@@ Line 70: / Line 70: @@
 If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things:
-) what the stem of the word is (and also the right-hand side in the entry)
+# what the stem of the word is (to be exact, what the left-hand side and the right-hand side of the the entry should be),
+# whether or not that stem is already in apertium-kaz, and
-) which continuation lexicon (read: paradigm) you should assign to it
+# (if it isn't or it isn't something that is needed) which continuation lexicon (read: paradigm) you should assign the stem to.
-) whether or not that stem is already in apertium-kaz
-Here is an example of a word already in apertium-kaz.kaz.lexc file:
+Here is an example of a word already in the apertium-kaz.kaz.lexc file:
+<pre>
 кітап:кітап N1 ; ! "book"
+</pre>
-As in this example, in most cases, the left hand-side and the right-hand side of the entry will be the same. The left-hand side might be referred to as 'stem'. Continuation lexicon is N1. What comes after '!' are comments. Glosses are a good thing to have, but technically it is only comment, and thus optional.
+As in this example, in most cases, the left hand-side and the right-hand side of the entry are the same. The left-hand side is the underlying form, the right-hand is the surface form. Continuation lexicon in this example is N1. What comes after the exclamation mark '!' are comments. Glosses are a good thing to have, but technically they are only a comment, and thus optional.
-An example where the left and right hand sides are not the same:
+Here is an example where the left and right hand sides are not the same:
+<pre>
 күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! ""
+</pre>
-This has been done so so that forms like "күн тәртіптері" can also be analyzed.
+This has been implemented in that way so that forms like "күн тәртіптері" can also be analysed as forms of the word "күн тәртібі".
 The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign:
+<pre>
 мән%-жай:мән%-жай N1 ; ! ""
+</pre>
 ==== General ====
@@ Line 100: / Line 106: @@
 ^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$
-(Assuming that you want to add "Жол жөндеуші as a company name, which it happens to be).
+(assuming that you want to add "Жол жөндеуші as a company name, which it happens to be).
 Another, probably more relevant example:
@@ Line 109: / Line 115: @@
 </pre>
+(supposing that some other forms of the word, say with case affixes, like e.g. "қабылдауды" weren't analysed (see the next paragraph) and thus you looked up қабылдау in <code>kaz.autogen.bin</code>). Looking the *stem* up (note: not the surface form, the stem) with the <code>lt-proc kaz.autogen.bin</code> command before adding it to the lexc file gives you a chance to save some work and to avoid addiing the same thing twice.
-I'm pretty sure that you shouldn't add қабылдау as a noun. Or, rather, that you wouldn't after you see the above analysis.
-In the third case, you will see that the stem gets the analysis you expect, but you'd seen that some of it wordforms were not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it).
+In the third case, you will see that the stem is already there, is linked to the right lexicon, but some surface forms of the word are not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it).
 * Provide a commit message saying what you did.  At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not.  Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar.
@@ Line 150: / Line 156: @@
 * If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb.  If this is the case, then add it as an adjective in the appropriate adjective class that can take the {{tag|advl}} tag.  In the bidix, you'll want to translate the {{tag|adj}} and the {{tag|adj}}{{tag|advl}} forms differently.
+=== Full inventory of lexicons the stems can be linked to ===
+It is useful to distinguish two classes of lexicons:
+# lexicons which are only used as continuations for the other lexicons, and
+# lexicons which are continuations for stems.
+Here is an attempt to document the lexicons of the second kind found in the <code>apertium-kaz.kaz.lexc</code> file (so that: 1. people can add stems to a lexc file without having to read the lexc file itself 2. we can re-evaluate our decisions):
+Nouns:
+** N1
+** N-COMPOUND-PX
+** N5
+** N1-ABBR
+** N-INFL-INKI
+Proper nouns:
+* NP-ANT-F: feminine anthroponyms
+* NP-ANT-M: masculine anthroponyms
+* NP-COG-OB: family names ending with -ов or -ев
+* NP-COG-IN: family names ending with -ин
+* NP-COG-M: family name not ending with -ов, -ев or -in; masculine. Example: Галицкий
+* NP-COG-F: family name not ending with -ов, -ев or -in; feminine. Example: Толстая
+* NP-COG-MF: family names not ending with -ов, -ев or -in which are both masculine and feminine:
+* NP-PAT-VICH: patronyms ending with -вич (and thus which can also take the -вна ending): <code>Васильевич:Василье NP-PAT-VICH ; ! ""</code>
+** (could be derived from anthroponyms automatically?)
+** NP-TOP: toponyms (in particular, river names should go here too)
+** NP-TOP-ASSR: former and future soviet socialistic republic names ending with СР: <code>Қырғыз% КСР:Қырғыз% КСР%{э%}%{й%} NP-TOP-ASSR ;</code>
+** NP-ORG: organization names
+** NP-ORG-LAT: organization names written in Latin character. Example: Microsoft
+** NP-AL: proper names not belonging to one of the above NP-* classes. Example: Восток
+Verbs:
+* V-TV
+* V-IV
+* Vinfl-AUX
+Adjectives:
+* A1: adjectives which can be adverbialised and have a comparative form. Example: жақсы.
+** Test 1: can the word in question modify verb? "Жақсы оқиды" OK? A: yes.
+** Test 2: has a comparative form? "Жақсырақ" OK? A: yes
+** ==> жақсы A1
+* A2: adjectives which cannot be adverbialized, but which do have the comparative form. Example: <code>лайық:лайық A2 ; ! ""</code>
+* A3: adjectives which can neither be adverbialized nor have comparative form
+* A4: initially: adjectives like социал or (tat.) ''биологик'' = (kaz.) ''биологиялық'' which the author of this classification of adjectives thought to never substantivize, but have seen them substativized since then and thus considers deprecated.
+The whole purpose of introducing subclasses of adjectives was to avoid overgenerating forms which do not exist.
+If you're unsure which adjective lexicon to select, pick A1.
+* A6:
+Adverbs:
+* ADV
+* ADV-ITG
+* ADV-WITH-KI
+* ADV-WITH-KI-I
+* ADV-LANG
+=== Additional tags ===
+In a .lexc file, after the '!' you will also see <code>Dir/LR</code>, <code>Dir/RL</code>, <code>Err/Orth</code> and <code>Use/MT</code> comments. The meaning of them is as follows:
+'''<code>Dir/LR</code>''' means: analyse this surface form, but don't generate it. Here is a good example:
+<pre>
+сұхбат:сұқбат N1 ; ! "conversation/interview" Dir/LR
+сұхбат:сұхбат N1 ; ! "conversation/interview"
+</pre>
+In other words, <code>Dir/LR</code> marks alternative spellings of a word. If the alternative spelling isn't just alternative, but actually erroneous (but occurs quite commonly so that you want to support it), it is marked with the '''<code>Err/Orth</code>''' tag:
+<pre>
+орын:ор%{y%}н N1 ; ! "place,seat"
+орын:орын N1 ; ! "place,seat"  ! Dir/LR ! Err/Orth
+</pre>
+"Орыны" for example, is considered erroneous spelling of "орын<n><px3sp><nom>". Such markings will allow us to produce better spell checkers.
+In the examples above, if you don't mark either of the stems with <code>Dir/LR</code>, then the Kazakh generator, (if we personify it a bit) given a string like "^сұхбат<n><nom>$ for input, won't know which surface form to choose and will output both, separated with a slash: сұхбат/сұқбат.
+As the name suggests, '''<code>Dir/RL</code>''' has the meaning opposite to <code>Dir/LR</code>: 'generate this surface form, but do not analyse it'. You won't see it much in a lexc file and almost certainly won't need it. Here is an example:
+<pre>
+да:%~да CC ; ! "also" Dir/RL
+</pre>
+The conjunction ^да<cnj$ gets generated as "~да". This is necessary for a somewhat hacky way of handling the vowel harmony (read: making sure that the "да" gets rendered as "де" when the preceding word has front vowels) in cases where the standard way of handling the vowel harmony (read: [[twol]]) fails because the preceding word is unknown.
+'''Use/MT''' marks words (at least, in its original usage) marks (compound) words which are needed for translation, but probably shouldn't be in a "vanilla" Kazakh transducer:
+<pre>
+қайда% болса% сонда:қайда% болса% сонда PRON-IND ; ! "anywhere" Use/MT
+</pre>
+It has been also used to mark words which the person who added them wasn't sure how to classify. Such words will be reviewed later.
 [[Category:Tools]]

Kazakh - қазақ тілі
language transducer
Coverage:	~94.5%
Stems:	36,595
Vanilla stems:	27,433
Paradigms:
Location:	apertium-kaz (languages)
Families:	Turkic languages
Areas:	Languages of Central Asia, Languages of the former Soviet Union
Lang info	Kazakh

Difference between revisions of "Apertium-kaz"

Revision as of 19:54, 30 October 2017

Contents

Installation

To use

Dependency tree

For spell checking

Current State

Developers

Guidelines for adding stems

An overview of the process

General

Verbs

Nouns

Adjectives

Adverbs

Full inventory of lexicons the stems can be linked to

Additional tags

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools