Difference between revisions of "Apertium-kaz"
(some more docs) |
|||
Line 70: | Line 70: | ||
If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things: |
If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things: |
||
# what the stem of the word is (to be exact, what the left-hand side and the right-hand side of the the entry should be), |
|||
# whether or not that stem is already in apertium-kaz, and |
|||
2) which continuation lexicon (read: paradigm) you should assign to it |
|||
# (if it isn't or it isn't something that is needed) which continuation lexicon (read: paradigm) you should assign the stem to. |
|||
3) whether or not that stem is already in apertium-kaz |
|||
Here is an example of a word already in apertium-kaz.kaz.lexc file: |
Here is an example of a word already in the apertium-kaz.kaz.lexc file: |
||
<pre> |
|||
кітап:кітап N1 ; ! "book" |
кітап:кітап N1 ; ! "book" |
||
</pre> |
|||
As in this example, in most cases, the left hand-side and the right-hand side of the entry |
As in this example, in most cases, the left hand-side and the right-hand side of the entry are the same. The left-hand side is the underlying form, the right-hand is the surface form. Continuation lexicon in this example is N1. What comes after the exclamation mark '!' are comments. Glosses are a good thing to have, but technically they are only a comment, and thus optional. |
||
Here is an example where the left and right hand sides are not the same: |
|||
<pre> |
|||
күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! "" |
күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! "" |
||
</pre> |
|||
This has been |
This has been implemented in that way so that forms like "күн тәртіптері" can also be analysed as forms of the word "күн тәртібі". |
||
The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign: |
The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign: |
||
<pre> |
|||
мән%-жай:мән%-жай N1 ; ! "" |
мән%-жай:мән%-жай N1 ; ! "" |
||
</pre> |
|||
==== General ==== |
==== General ==== |
||
Line 100: | Line 106: | ||
^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$ |
^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$ |
||
( |
(assuming that you want to add "Жол жөндеуші as a company name, which it happens to be). |
||
Another, probably more relevant example: |
Another, probably more relevant example: |
||
Line 109: | Line 115: | ||
</pre> |
</pre> |
||
(supposing that some other forms of the word, say with case affixes, like e.g. "қабылдауды" weren't analysed (see the next paragraph) and thus you looked up қабылдау in <code>kaz.autogen.bin</code>). Looking the *stem* up (note: not the surface form, the stem) with the <code>lt-proc kaz.autogen.bin</code> command before adding it to the lexc file gives you a chance to save some work and to avoid addiing the same thing twice. |
|||
I'm pretty sure that you shouldn't add қабылдау as a noun. Or, rather, that you wouldn't after you see the above analysis. |
|||
In the third case, you will see that the stem |
In the third case, you will see that the stem is already there, is linked to the right lexicon, but some surface forms of the word are not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it). |
||
* Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar. |
* Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar. |
||
Line 150: | Line 156: | ||
* If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an adjective in the appropriate adjective class that can take the {{tag|advl}} tag. In the bidix, you'll want to translate the {{tag|adj}} and the {{tag|adj}}{{tag|advl}} forms differently. |
* If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an adjective in the appropriate adjective class that can take the {{tag|advl}} tag. In the bidix, you'll want to translate the {{tag|adj}} and the {{tag|adj}}{{tag|advl}} forms differently. |
||
=== Full inventory of lexicons the stems can be linked to === |
|||
It is useful to distinguish two classes of lexicons: |
|||
# lexicons which are only used as continuations for the other lexicons, and |
|||
# lexicons which are continuations for stems. |
|||
Here is an attempt to document the lexicons of the second kind found in the <code>apertium-kaz.kaz.lexc</code> file (so that: 1. people can add stems to a lexc file without having to read the lexc file itself 2. we can re-evaluate our decisions): |
|||
Nouns: |
|||
** N1 |
|||
** N-COMPOUND-PX |
|||
** N5 |
|||
** N1-ABBR |
|||
** N-INFL-INKI |
|||
Proper nouns: |
|||
* NP-ANT-F: feminine anthroponyms |
|||
* NP-ANT-M: masculine anthroponyms |
|||
* NP-COG-OB: family names ending with -ов or -ев |
|||
* NP-COG-IN: family names ending with -ин |
|||
* NP-COG-M: family name not ending with -ов, -ев or -in; masculine. Example: Галицкий |
|||
* NP-COG-F: family name not ending with -ов, -ев or -in; feminine. Example: Толстая |
|||
* NP-COG-MF: family names not ending with -ов, -ев or -in which are both masculine and feminine: |
|||
* NP-PAT-VICH: patronyms ending with -вич (and thus which can also take the -вна ending): <code>Васильевич:Василье NP-PAT-VICH ; ! ""</code> |
|||
** (could be derived from anthroponyms automatically?) |
|||
** NP-TOP: toponyms (in particular, river names should go here too) |
|||
** NP-TOP-ASSR: former and future soviet socialistic republic names ending with СР: <code>Қырғыз% КСР:Қырғыз% КСР%{э%}%{й%} NP-TOP-ASSR ;</code> |
|||
** NP-ORG: organization names |
|||
** NP-ORG-LAT: organization names written in Latin character. Example: Microsoft |
|||
** NP-AL: proper names not belonging to one of the above NP-* classes. Example: Восток |
|||
Verbs: |
|||
* V-TV |
|||
* V-IV |
|||
* Vinfl-AUX |
|||
Adjectives: |
|||
* A1: adjectives which can be adverbialised and have a comparative form. Example: жақсы. |
|||
** Test 1: can the word in question modify verb? "Жақсы оқиды" OK? A: yes. |
|||
** Test 2: has a comparative form? "Жақсырақ" OK? A: yes |
|||
** ==> жақсы A1 |
|||
* A2: adjectives which cannot be adverbialized, but which do have the comparative form. Example: <code>лайық:лайық A2 ; ! ""</code> |
|||
* A3: adjectives which can neither be adverbialized nor have comparative form |
|||
* A4: initially: adjectives like социал or (tat.) ''биологик'' = (kaz.) ''биологиялық'' which the author of this classification of adjectives thought to never substantivize, but have seen them substativized since then and thus considers deprecated. |
|||
The whole purpose of introducing subclasses of adjectives was to avoid overgenerating forms which do not exist. |
|||
If you're unsure which adjective lexicon to select, pick A1. |
|||
* A6: |
|||
Adverbs: |
|||
* ADV |
|||
* ADV-ITG |
|||
* ADV-WITH-KI |
|||
* ADV-WITH-KI-I |
|||
* ADV-LANG |
|||
=== Additional tags === |
|||
In a .lexc file, after the '!' you will also see <code>Dir/LR</code>, <code>Dir/RL</code>, <code>Err/Orth</code> and <code>Use/MT</code> comments. The meaning of them is as follows: |
|||
'''<code>Dir/LR</code>''' means: analyse this surface form, but don't generate it. Here is a good example: |
|||
<pre> |
|||
сұхбат:сұқбат N1 ; ! "conversation/interview" Dir/LR |
|||
сұхбат:сұхбат N1 ; ! "conversation/interview" |
|||
</pre> |
|||
In other words, <code>Dir/LR</code> marks alternative spellings of a word. If the alternative spelling isn't just alternative, but actually erroneous (but occurs quite commonly so that you want to support it), it is marked with the '''<code>Err/Orth</code>''' tag: |
|||
<pre> |
|||
орын:ор%{y%}н N1 ; ! "place,seat" |
|||
орын:орын N1 ; ! "place,seat" ! Dir/LR ! Err/Orth |
|||
</pre> |
|||
"Орыны" for example, is considered erroneous spelling of "орын<n><px3sp><nom>". Such markings will allow us to produce better spell checkers. |
|||
In the examples above, if you don't mark either of the stems with <code>Dir/LR</code>, then the Kazakh generator, (if we personify it a bit) given a string like "^сұхбат<n><nom>$ for input, won't know which surface form to choose and will output both, separated with a slash: сұхбат/сұқбат. |
|||
As the name suggests, '''<code>Dir/RL</code>''' has the meaning opposite to <code>Dir/LR</code>: 'generate this surface form, but do not analyse it'. You won't see it much in a lexc file and almost certainly won't need it. Here is an example: |
|||
<pre> |
|||
да:%~да CC ; ! "also" Dir/RL |
|||
</pre> |
|||
The conjunction ^да<cnj$ gets generated as "~да". This is necessary for a somewhat hacky way of handling the vowel harmony (read: making sure that the "да" gets rendered as "де" when the preceding word has front vowels) in cases where the standard way of handling the vowel harmony (read: [[twol]]) fails because the preceding word is unknown. |
|||
'''Use/MT''' marks words (at least, in its original usage) marks (compound) words which are needed for translation, but probably shouldn't be in a "vanilla" Kazakh transducer: |
|||
<pre> |
|||
қайда% болса% сонда:қайда% болса% сонда PRON-IND ; ! "anywhere" Use/MT |
|||
</pre> |
|||
It has been also used to mark words which the person who added them wasn't sure how to classify. Such words will be reviewed later. |
|||
[[Category:Tools]] |
[[Category:Tools]] |
Revision as of 19:54, 30 October 2017
Contents |
Kazakh - қазақ тілі | |
---|---|
language transducer | |
Coverage: | ~94.5% |
Stems: | 36,595 |
Vanilla stems: | 27,433 |
Paradigms: | |
Location: | apertium-kaz (languages) |
Families: | Turkic languages |
Areas: | Languages of Central Asia, Languages of the former Soviet Union |
Lang info | Kazakh |
Apertium-kaz is a morphological analyser/generator and CG tagger for Kazakh, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It's used in the following language pairs:
- Kazakh and Tatar
- English and Kazakh
- Kyrgyz and Kazakh
- Kazakh and Karakalpak
- Khalkha and Kazakh
- Kazakh and Russian
Installation
Apertium-kaz is currently located in languages/apertium-kaz.
To use
- Install Apertium core tools.
- You can install apertium-kaz from the same repo as the core tools, but what's recommended is to get it from the repo above and compile by running
./autogen.sh; make
- Now test it by running some Kazakh words through the analyser it:
echo "бірнеше қазақша сөздер" | apertium -d . kaz-morph
- You should get some analyses, like this:
^бірнеше/бірнеше
<det>
<qnt>
$ ^қазақша/қазақша<adv>
/қазақша<n>
<nom>
/қазақша<n>
<attr>
/қазақша<n>
<nom>
+е<cop>
<aor>
<p3>
<pl>
/қазақша<n>
<nom>
+е<cop>
<aor>
<p3>
<sg>
$ ^сөздер/сөз<n>
<pl>
<nom>
/сөз<n>
<pl>
<nom>
+е<cop>
<aor>
<p3>
<pl>
/сөз<n>
<pl>
<nom>
+е<cop>
<aor>
<p3>
<sg>
$^./.<sent>$
Dependency tree
This is for reference and isn't generally needed for most users:
- hfst (svn ≥r1916)
- apertium
- lttoolbox
- VISL-CG3
For spell checking
If you're compiling the apertium-kaz spell checker, you'll additionally need these dependencies:
- hfst-ospell (./configure --enable-zhfst)
- see Installation, it is installable from Tino's repositories
- corevoikko/libvoikko/src/tools/voikkospell (./configure --enable-hfst)
You'll want to configure apertium-kaz with --enable-ospell and then after making it, copy kaz.zhfst to ~/.voikko/3/kk.zhfst
Then you can do this:
$ echo "қазақша билмеймін" | tr ' ' '\n' | voikkospell -d kk -s C: қазақша W: билмеймін S: билеймін S: білмеймін S: билемеймін S: бөлмеймін S: билемейміз
Current State
{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}
- Number of stems: 36,595 {{#ifneq | | | () }}
- Disambiguation rules: 150
- Coverage: ~94.5%
{{#ifneq | Әуезов | None |
{{#ifneq | Әуезов corpus | | | }}}}
{{#ifneq | bible | None |
{{#ifneq | | | | }}}}
{{#ifneq | azattyq2010 | None |
{{#ifneq | RFERL_corpora | | | }}}}
{{#ifneq | wp2013 | None |
{{#ifneq | | | | }}}}
{{#ifneq | quran | None |
{{#ifneq | | | | }}}}
{{#ifneq | udhr | None |
{{#ifneq | UDHR | | | }}}}
{{#ifneq | {{{corpus7}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus8}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus9}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus10}}} | None |
{{#ifneq | | | | }}}}
corpus | words | coverage | |
---|---|---|---|
<nowinter>Әуезов</nowinter> | Әуезов | 155K | ~92.89% |
<nowinter>[[|bible]]</nowinter> | bible | 577K | ~95.29% |
<nowinter>azattyq2010</nowinter> | azattyq2010 | 3.2M | ~95.07% |
<nowinter>[[|wp2013]]</nowinter> | wp2013 | 18.2M | ~90.10% |
<nowinter>[[|quran]]</nowinter> | quran | 107K | ~96.71% |
<nowinter>udhr</nowinter> | udhr | 1.5K | ~96.86% |
<nowinter>[[|{{{corpus7}}}]]</nowinter> | {{{corpus7}}} | ~% | |
<nowinter>[[|{{{corpus8}}}]]</nowinter> | {{{corpus8}}} | ~% | |
<nowinter>[[|{{{corpus9}}}]]</nowinter> | {{{corpus9}}} | ~% | |
<nowinter>[[|{{{corpus10}}}]]</nowinter> | {{{corpus10}}} | ~% |
Developers
Guidelines for adding stems
An overview of the process
If you see that a wordform is not supported by apertium-kaz and you want to add it, you have to figure out three things:
- what the stem of the word is (to be exact, what the left-hand side and the right-hand side of the the entry should be),
- whether or not that stem is already in apertium-kaz, and
- (if it isn't or it isn't something that is needed) which continuation lexicon (read: paradigm) you should assign the stem to.
Here is an example of a word already in the apertium-kaz.kaz.lexc file:
кітап:кітап N1 ; ! "book"
As in this example, in most cases, the left hand-side and the right-hand side of the entry are the same. The left-hand side is the underlying form, the right-hand is the surface form. Continuation lexicon in this example is N1. What comes after the exclamation mark '!' are comments. Glosses are a good thing to have, but technically they are only a comment, and thus optional.
Here is an example where the left and right hand sides are not the same:
күн% тәртібі:күн% тәртіп N-COMPOUND-PX ; ! ""
This has been implemented in that way so that forms like "күн тәртіптері" can also be analysed as forms of the word "күн тәртібі".
The example above also shows that spaces in a word have to be escaped with %. So is the hyphen sign:
мән%-жай:мән%-жай N1 ; ! ""
General
- Before adding a stem, be sure it does not already exist in lexc. A good way to do that is to look up stem(s) you want to add with
lt-proc kaz.automorf.bin
. In some cases, you'll see that the stem isn't analysed at all:
^foo/*foo$
In some cases, it will be analysed, but as something else than what you want to add it as:
^Жол/жол<adj>$ ^жөндеуші/жөнде<v><tv><gpr_pot>$^./.<sent>$
(assuming that you want to add "Жол жөндеуші as a company name, which it happens to be).
Another, probably more relevant example:
apertium-kaz$ echo "қабылдау" | apertium -d . kaz-tagger ^қабылдау/қабылда<v><tv><ger><nom>$^./.<sent>$
(supposing that some other forms of the word, say with case affixes, like e.g. "қабылдауды" weren't analysed (see the next paragraph) and thus you looked up қабылдау in kaz.autogen.bin
). Looking the *stem* up (note: not the surface form, the stem) with the lt-proc kaz.autogen.bin
command before adding it to the lexc file gives you a chance to save some work and to avoid addiing the same thing twice.
In the third case, you will see that the stem is already there, is linked to the right lexicon, but some surface forms of the word are not analysed. This means that either there is a problem with the phonology part, or you've discovered some affix currently not supported by apertium-kaz. Both issues have to be documented/reported (the simplest way would be just to add an 'ISSUES' file to apertium-kaz and commit it).
- Provide a commit message saying what you did. At a bare minimum, "adding more stems" is okay, but "a" or "ф" is not. Try to be more informative though; e.g. "added stems from story, mostly NP-TOP and NP-ANT" or similar.
- Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by twol, but these stems must be added with the voiceless consonant (п, к, қ), e.g
тақ:тақ V-TV ;
- Stems from Russian that end with one of the voiced consonants (б, г), such as
геолог
should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, thenN5
).
- Stems from Russian that end with one of the voiced consonants (б, г), such as
- Words that have an inserted ‹ы› or ‹і› in some forms should get
%{y%}
in that spot on the right side, e.g.орын:ор%{y%}н N1 ;
.- Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add
! Dir/LR
after the form that should not be generated (i.e., the form that is the non-normative version), and add! Err/Orth
after it too if it should be considered a spelling mistake.
- Words that are commonly written in both forms (e.g., орнында and орынында) need special treatment: add
- Any changes to continuation classes should be discussed on the apertium-turkic mailing list.
Most likely, a word not covered by apertium-kaz already will be an open class word. Below are some comments on the open-class word lexicons.
Verbs
- Categorise correctly according to IV or TV status:
- IV = intransitive verbs; TV = transitive verbs
- If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is TV
- For phrasal verbs (e.g,. "қабыл ал", "пайда бол", "мойынға ал"), do not categorise it according to its elements; instead treat it as a single verb (TV, IV, TV).
- There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
- Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as
сүй
- Some verbs have a "hidden" ‹ы› or ‹і› under the ‹у›, for example
ері
,аршы
,аңды
, etc. These verb stems should be added with the ‹ы› or ‹і›. - Of course, verbs with ‹у› in the stem should keep the ‹у›, like
жу
,қу
,жау
, etc.
- Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as
- Do not add passive or cooperative forms of verb stems (e.g., ‹тартыл› is passive of ‹тарт›, and ‹тартыс› is cooperative) unless absolutely needed for translation. In this case, put
! Use/MT ! Der/Pass
or! Use/MT ! Der/Coop
after the entry, respectively. - If you add a causative form of a verb (e.g., ‹отырғыз› is causative of ‹отыр›), put
! Der/Caus
after it.
Nouns
- Some nouns end in ‹ә›, and have interesting or inconsistent-looking phonology, like
күнә
,кінә
. These should be added with the right side missing its ‹ә› and in the class N1-Ә. E.g.,күнә:күн N1-Ә ;
- Nouns from Russian should be classified as
N5
- especially if the last vowel is ‹и› or ‹у›
- especially if they end with a consonant that would normally be voiced before a vowel-initial suffix in Kazakh words (п, к)
- Nouns that are compounds ending in a possessive form (like ‹ауа райы›) should be categorised into the
N-COMPOUND-PX
category and entered without the possessive ending on the right side, e.g.ауа% райы:ауа% рай N-COMPOUND-PX ; ! "weather,climate"
- If you're adding a noun that can also be used as an adjective, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's a noun it can also take the
<attr>
tag.
Adjectives
- The basic categorisation of adjectives depends on whether it takes comparative morphology (-ЫрАҚ), can be substantivised (acts like a noun), and/or can be adverbialised (acts like an adverb). Be sure to put the adjective in the right category according what those categories allow.
- If you're adding an adjective that can also be used as a noun, think whether it's actually an adjective or actually a noun and add it to the right category. You'll want to subcategorise it correctly so that e.g. if it's an adjective it can also take the
<subst>
tag.
Adverbs
- If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an adjective in the appropriate adjective class that can take the
<advl>
tag. In the bidix, you'll want to translate the<adj>
and the<adj>
<advl>
forms differently.
Full inventory of lexicons the stems can be linked to
It is useful to distinguish two classes of lexicons:
- lexicons which are only used as continuations for the other lexicons, and
- lexicons which are continuations for stems.
Here is an attempt to document the lexicons of the second kind found in the apertium-kaz.kaz.lexc
file (so that: 1. people can add stems to a lexc file without having to read the lexc file itself 2. we can re-evaluate our decisions):
Nouns:
- N1
- N-COMPOUND-PX
- N5
- N1-ABBR
- N-INFL-INKI
Proper nouns:
- NP-ANT-F: feminine anthroponyms
- NP-ANT-M: masculine anthroponyms
- NP-COG-OB: family names ending with -ов or -ев
- NP-COG-IN: family names ending with -ин
- NP-COG-M: family name not ending with -ов, -ев or -in; masculine. Example: Галицкий
- NP-COG-F: family name not ending with -ов, -ев or -in; feminine. Example: Толстая
- NP-COG-MF: family names not ending with -ов, -ев or -in which are both masculine and feminine:
- NP-PAT-VICH: patronyms ending with -вич (and thus which can also take the -вна ending):
Васильевич:Василье NP-PAT-VICH ; ! ""
- (could be derived from anthroponyms automatically?)
- NP-TOP: toponyms (in particular, river names should go here too)
- NP-TOP-ASSR: former and future soviet socialistic republic names ending with СР:
Қырғыз% КСР:Қырғыз% КСР%{э%}%{й%} NP-TOP-ASSR ;
- NP-ORG: organization names
- NP-ORG-LAT: organization names written in Latin character. Example: Microsoft
- NP-AL: proper names not belonging to one of the above NP-* classes. Example: Восток
Verbs:
- V-TV
- V-IV
- Vinfl-AUX
Adjectives:
- A1: adjectives which can be adverbialised and have a comparative form. Example: жақсы.
- Test 1: can the word in question modify verb? "Жақсы оқиды" OK? A: yes.
- Test 2: has a comparative form? "Жақсырақ" OK? A: yes
- ==> жақсы A1
- A2: adjectives which cannot be adverbialized, but which do have the comparative form. Example:
лайық:лайық A2 ; ! ""
- A3: adjectives which can neither be adverbialized nor have comparative form
- A4: initially: adjectives like социал or (tat.) биологик = (kaz.) биологиялық which the author of this classification of adjectives thought to never substantivize, but have seen them substativized since then and thus considers deprecated.
The whole purpose of introducing subclasses of adjectives was to avoid overgenerating forms which do not exist.
If you're unsure which adjective lexicon to select, pick A1.
- A6:
Adverbs:
- ADV
- ADV-ITG
- ADV-WITH-KI
- ADV-WITH-KI-I
- ADV-LANG
Additional tags
In a .lexc file, after the '!' you will also see Dir/LR
, Dir/RL
, Err/Orth
and Use/MT
comments. The meaning of them is as follows:
Dir/LR
means: analyse this surface form, but don't generate it. Here is a good example:
сұхбат:сұқбат N1 ; ! "conversation/interview" Dir/LR сұхбат:сұхбат N1 ; ! "conversation/interview"
In other words, Dir/LR
marks alternative spellings of a word. If the alternative spelling isn't just alternative, but actually erroneous (but occurs quite commonly so that you want to support it), it is marked with the Err/Orth
tag:
орын:ор%{y%}н N1 ; ! "place,seat" орын:орын N1 ; ! "place,seat" ! Dir/LR ! Err/Orth
"Орыны" for example, is considered erroneous spelling of "орын<n><px3sp><nom>". Such markings will allow us to produce better spell checkers.
In the examples above, if you don't mark either of the stems with Dir/LR
, then the Kazakh generator, (if we personify it a bit) given a string like "^сұхбат<n><nom>$ for input, won't know which surface form to choose and will output both, separated with a slash: сұхбат/сұқбат.
As the name suggests, Dir/RL
has the meaning opposite to Dir/LR
: 'generate this surface form, but do not analyse it'. You won't see it much in a lexc file and almost certainly won't need it. Here is an example:
да:%~да CC ; ! "also" Dir/RL
The conjunction ^да<cnj$ gets generated as "~да". This is necessary for a somewhat hacky way of handling the vowel harmony (read: making sure that the "да" gets rendered as "де" when the preceding word has front vowels) in cases where the standard way of handling the vowel harmony (read: twol) fails because the preceding word is unknown.
Use/MT marks words (at least, in its original usage) marks (compound) words which are needed for translation, but probably shouldn't be in a "vanilla" Kazakh transducer:
қайда% болса% сонда:қайда% болса% сонда PRON-IND ; ! "anywhere" Use/MT
It has been also used to mark words which the person who added them wasn't sure how to classify. Such words will be reviewed later.