Apertium-uzb

From Apertium
Revision as of 17:49, 8 March 2018 by Ilnar.salimzyan (talk | contribs) (s/sourceforge link/github link/)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Uzbek - o'zbek tili, ўзбек тили
language transducer
Coverage: ~82.9%
Stems: 34,470
Vanilla stems: 34,465
Paradigms: 1
Location:
Families: Turkic languages
Areas: Languages of Central Asia, Languages of the former Soviet Union
Lang info Uzbek

Apertium-uzb is a morphological analyser/generator and CG tagger for Uzbek, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It's used in the following language pairs:

A recent version of the transducer is running live at turkic.apertium.org and can be tested in its analysis and generation capabilities there.

Current State

{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}

  • Number of stems: 34,470 {{#ifneq | | | () }}
  • Disambiguation rules: 48
  • Coverage: ~82.9%

{{#ifneq | wikipedia | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus2}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus3}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus4}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus5}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus6}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus7}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus8}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus9}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus10}}} | None |

{{#ifneq | | | | }}

}}

corpuswordscoverage
<nowinter>[[|wikipedia]]</nowinter>wikipedia1.2M ~82.9%
<nowinter>[[|{{{corpus2}}}]]</nowinter>{{{corpus2}}} ~%
<nowinter>[[|{{{corpus3}}}]]</nowinter>{{{corpus3}}} ~%
<nowinter>[[|{{{corpus4}}}]]</nowinter>{{{corpus4}}} ~%
<nowinter>[[|{{{corpus5}}}]]</nowinter>{{{corpus5}}} ~%
<nowinter>[[|{{{corpus6}}}]]</nowinter>{{{corpus6}}} ~%
<nowinter>[[|{{{corpus7}}}]]</nowinter>{{{corpus7}}} ~%
<nowinter>[[|{{{corpus8}}}]]</nowinter>{{{corpus8}}} ~%
<nowinter>[[|{{{corpus9}}}]]</nowinter>{{{corpus9}}} ~%
<nowinter>[[|{{{corpus10}}}]]</nowinter>{{{corpus10}}} ~%

Installation

Apertium-uzb is currently located in languages/apertium-uzb.

Developers

oʻ and gʻ

The Uzbek letters ‹oʻ› and ‹gʻ› are properly written using ‹ʻ›, unicode character 02BB, "modifier letter turned comma". Use of other apostrophe letters is incorrect.

Dealing with Cyrillic and Latin

Plan A

There will be two separate lexcs and twols (.lat and .cyr) with the continuation lexica and rules and all, though you may be able to get by with one twol considering how simple things are. There will also be a master .dix, in Latin, with comments in a standarised format in Cyrillic (also possible the other way around).

There will also be a simple script to check for dix entries without Cyrillic comments in the standard format in the master .dix, and automatically generate them, updating the Cyrillic dix, outputting "TOCHECK" or something in a comment with the converted words. Someone then goes through and checks anything with "TOCHECK", and fixes / gets rid of "TOCHECK".

This is how we can trivially "convert" the dix to Cyrillic, and even convert the stems in lexc when we copy/update it from -uzb.

Plan B

The Cyrillic lexc and dix will be generated from the Latin-script ones.

A script will take all the stems from dix and automatically convert them to Cyrillic, updating a three-column text-file database (Latin Cyrillic Checked). The Checked column will have two states: TOCHECK, GOOD. This will allow a checker to fix the output of the conversion script for corner cases (mostly Russian words). This script (although slow) covers the edge cases and produces good results.

Another script will then generate a Cyrillic version of dix and lexc from the Latin-script versions, using the above mentioned database.

Final Plan

We'll go with Plan B, with file names roughly like this:

apertium-uzb.uzb.lexc
apertium-uzb.uzb.twol
apertium-uzb.uzb_Cyrl.lexc
apertium-uzb.uzb_Cyrl.twol
apertium-uzb.uzb_Latn-Cyrl.tsv

To do:

  • Hand-convert the morphotactics (in lexc) and morphophonology (twol) to Cyrillic..
  • Future development of either morphotactics (in lexc) or morphophonology (twol) must be done in parallel (i.e., both Latin and Cyrillic versions will need to be updated with any changes)!
  • Write a short script that dumps all [new] stems from the lexc file and uses the transliterator script to update any entries not marked as GOOD.
  • Write a short script (that uses the lexc parsing script Sushain developed as a library?) that replaces stems in certain categories (e.g., not numbers and other truly closed categories) in the uzb_Cyrl.lexc file based on the uzb.lexc and Latn-Cyrl.tsv files.
  • Update modes file to support Cyrl stuff.
  • Write disambiguation rules to be as script-agnostic as possible. For word-specific rules, make a version for each script?
  • Develop a plan for pairs, e.g.:
    • In uzb-kir, add _Cyrl.dix, _Cyrl.lrx (could also be automatically generated), and even transfer files (???)...
    • Set up conversion of stems using same tsv file and similar scripts..
    • Update modes file for _Cyrl