Difference between revisions of "Apertium-kaz"

From Apertium
Jump to navigation Jump to search
m
(add info on importing lexc, twol and rlx files)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
  +
'''Apertium-kaz''' is a morphological analyser/generator for [[Kazakh]], currently under development. It is intended to be compatible with transducers for other [[Turkic languages]] so that they can be translated between. It's used in the following language pairs:
+
'''Apertium-kaz''' is a morphological analyser/generator and CG tagger for [[Kazakh]], currently under development. It is intended to be compatible with transducers for other [[Turkic languages]] so that they can be translated between. It's used in the following language pairs:
   
 
* [[Kazakh and Tatar]]
 
* [[Kazakh and Tatar]]
Line 8: Line 9:
   
 
== Installation ==
 
== Installation ==
  +
'''apertium-kaz''' is currently located in incubator/apertium-kaz.
+
'''Apertium-kaz''' is currently located in incubator/apertium-kaz.
   
 
=== Dependency tree ===
 
=== Dependency tree ===
  +
 
* hfst (svn ≥r1916)
 
* hfst (svn ≥r1916)
 
** foma
 
** foma
Line 16: Line 19:
 
* apertium
 
* apertium
 
** lttoolbox
 
** lttoolbox
  +
* VISL-CG
 
[[Category:Tools]]
 
   
 
== Current State ==
 
== Current State ==
  +
 
{{LangStats | lang = kaz | corpus1 = Әуезов | corpus2 = bible | corpus3 = azattyq2010 | corpus4 = wp2011 | corpus5 = quran}}
 
{{LangStats | lang = kaz | corpus1 = Әуезов | corpus2 = bible | corpus3 = azattyq2010 | corpus4 = wp2011 | corpus5 = quran}}
  +
  +
== Developers ==
  +
  +
We have several language pairs involving Kazakh, and in every pair there is a lexc, twol and rlx file for this language. But we don't work on these files directly. Instead, we edit <code>kaz.lexc</code>, <code>kaz.twol</code> and <code>kaz.rlx</code> files located in '''apertium-kaz''', and then import these files to the language pair directories using a script. This script merely copies twol and rlx files, since they don't have to be tweaked to a particular language pair, but "trimms" the lexc file leaving only that stems in it, which are also found in the bilingual dictionary of the pair importing is made to.
  +
  +
For further details on how these works and a step-by-step guide (taking the Kazakh-Tatar pair as an example), see [[Kazakh_and_Tatar#Development_workflow]].
  +
 
[[Category:Tools]]

Revision as of 20:43, 9 August 2013

Apertium-kaz is a morphological analyser/generator and CG tagger for Kazakh, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It's used in the following language pairs:

Installation

Apertium-kaz is currently located in incubator/apertium-kaz.

Dependency tree

  • hfst (svn ≥r1916)
    • foma
      • flex
  • apertium
    • lttoolbox
  • VISL-CG

Current State

{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}

  • Number of stems: 36,595 {{#ifneq | | | () }}
  • Disambiguation rules: 150
  • Coverage: ~94.5%

{{#ifneq | Әуезов | None |

{{#ifneq | Әуезов corpus | | | }}

}}

{{#ifneq | bible | None |

{{#ifneq | | | | }}

}}

{{#ifneq | azattyq2010 | None |

{{#ifneq | RFERL_corpora | | | }}

}}

{{#ifneq | wp2011 | None |

{{#ifneq | | | | }}

}}

{{#ifneq | quran | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus6}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus7}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus8}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus9}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus10}}} | None |

{{#ifneq | | | | }}

}}

corpuswordscoverage
<nowinter>Әуезов</nowinter>Әуезов155K ~92.89%
<nowinter>[[|bible]]</nowinter>bible577K ~95.29%
<nowinter>azattyq2010</nowinter>azattyq20103.2M ~95.07%
<nowinter>[[|wp2011]]</nowinter>wp2011850K ~90.72%
<nowinter>[[|quran]]</nowinter>quran107K ~96.71%
<nowinter>[[|{{{corpus6}}}]]</nowinter>{{{corpus6}}} ~%
<nowinter>[[|{{{corpus7}}}]]</nowinter>{{{corpus7}}} ~%
<nowinter>[[|{{{corpus8}}}]]</nowinter>{{{corpus8}}} ~%
<nowinter>[[|{{{corpus9}}}]]</nowinter>{{{corpus9}}} ~%
<nowinter>[[|{{{corpus10}}}]]</nowinter>{{{corpus10}}} ~%

Developers

We have several language pairs involving Kazakh, and in every pair there is a lexc, twol and rlx file for this language. But we don't work on these files directly. Instead, we edit kaz.lexc, kaz.twol and kaz.rlx files located in apertium-kaz, and then import these files to the language pair directories using a script. This script merely copies twol and rlx files, since they don't have to be tweaked to a particular language pair, but "trimms" the lexc file leaving only that stems in it, which are also found in the bilingual dictionary of the pair importing is made to.

For further details on how these works and a step-by-step guide (taking the Kazakh-Tatar pair as an example), see Kazakh_and_Tatar#Development_workflow.