Difference between revisions of "Apertium-kir"

From Apertium
Jump to navigation Jump to search
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{TOCD}}
'''Kymorph''' is a morphological analyser/generator for the [[Kyrgyz language]], currently working. It is intended to be compatible with transducers for other [[Turkic languages]] so that they can be translated between.
'''Kymorph''' is a morphological analyser/generator for the [[Kyrgyz language]], currently working. It is intended to be compatible with transducers for other [[Turkic languages]] so that they can be translated between.


== Installation ==
== Installation ==
'''kymorph''' is currently located in [[tr-ky]].


<pre>
=== Dependency tree ===
$ git clone https://github.com/apertium/apertium-kir.git
* apertium-tr-ky
** apertium-tr-az
*** apertium
**** lttoolbox
*** VISL CG3
**** cmake
**** libicu-dev
**** tmalloc (libgoogle-perftools-dev)
***** libtcmalloc-minimal0
***** libgoogle-perftools0
**** boost
*** trmorph
**** hfst (≥r1559 for kymorph)
***** openfst
***** sfst
***** foma
*** azmorph


</pre>
[[Category:Tools]]


== Current State ==
== Adding analyses to conllu ==
* Number of stems: {{:Kymorph/stems}}
* Coverage: {{:Kymorph/coverage/average}}


This guide assumes that you
{|class="wikitable"
* have <tt>apertium-kir</tt> compiled and running,
|-
* have a working recent version of python3, and
! corpus !! words !! coverage
* are saving your files in the <tt>apertium-kir/corpus</tt> directory.
|-

|| [[RFERL corpora|azattyk]] 2010
Here's what you need to do:
|align="right"| {{:RFERL corpus/ky/2010/stems}}
# Get CG3 format of conllu data
|| {{:Kymorph/coverage/rferl2010}}
#* From Annotatrix: click CG3 tab, click menu, download, rename file to <tt>.dep</tt>
|-
# Populate morphology in CG3 file from Apertium using <code>./corpora/add_morph.py</code> script:
|| [[RFERL corpora|azattyk]] 2009
#* <code>$ cat corpus.dep | ./corpora/add_morph.py > ./corpora/corpus_with_annotation.dep</code>
|align="right"| {{:RFERL corpus/ky/2009/stems}}
#* Open <code>./corpora/corpus_with_annotation.dep</code> and fix any errors
|| {{:Kymorph/coverage/rferl2009}}
# Create segmentation file with <tt>./corpora/dep_to_seg.py</tt> script:
|-
#* <code>$ cat corpus_with_annotation.dep | ./corpora/dep_to_seg.py > ./corpora/corpus_with_annotation.seg</code>
||bible
#* Open <code>corpus_with_annotation.seg</code> and fix any errors
|align="right"|174K
# Convert back to CoNLL-U format with <tt>./scripts/conllise.py</tt> script:
|| {{:Kymorph/coverage/bible}}
#* <code>$ python3 ./scripts/conllise.py -e apertium-kir.kir.udx ./corpora/corpus_with_annotation.dep ./corpora/corpus_with_annotation.seg</code>
|-

||WP 2011-04
== Current State ==
|align="right"|545k
{{LangStats | lang = kir | corpus1 = azattyk2010 | corpus2 = azattyk2009 | corpus3 = bible | corpus4 = wp2011/04}}
|| {{:Kymorph/coverage/wikipedia}}
|}


== To-do ==
== To-do ==

[[Category:Tools]]

Latest revision as of 21:20, 16 June 2024

Kymorph is a morphological analyser/generator for the Kyrgyz language, currently working. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between.

Installation[edit]

$ git clone https://github.com/apertium/apertium-kir.git

Adding analyses to conllu[edit]

This guide assumes that you

  • have apertium-kir compiled and running,
  • have a working recent version of python3, and
  • are saving your files in the apertium-kir/corpus directory.

Here's what you need to do:

  1. Get CG3 format of conllu data
    • From Annotatrix: click CG3 tab, click menu, download, rename file to .dep
  2. Populate morphology in CG3 file from Apertium using ./corpora/add_morph.py script:
    • $ cat corpus.dep | ./corpora/add_morph.py > ./corpora/corpus_with_annotation.dep
    • Open ./corpora/corpus_with_annotation.dep and fix any errors
  3. Create segmentation file with ./corpora/dep_to_seg.py script:
    • $ cat corpus_with_annotation.dep | ./corpora/dep_to_seg.py > ./corpora/corpus_with_annotation.seg
    • Open corpus_with_annotation.seg and fix any errors
  4. Convert back to CoNLL-U format with ./scripts/conllise.py script:
    • $ python3 ./scripts/conllise.py -e apertium-kir.kir.udx ./corpora/corpus_with_annotation.dep ./corpora/corpus_with_annotation.seg

Current State[edit]

{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}

  • Number of stems: 14,424 {{#ifneq | | | () }}
  • Disambiguation rules: 16
  • Coverage: ~90.4%

{{#ifneq | azattyk2010 | None |

{{#ifneq | RFERL_corpora | | | }}

}}

{{#ifneq | azattyk2009 | None |

{{#ifneq | RFERL_corpora | | | }}

}}

{{#ifneq | bible | None |

{{#ifneq | | | | }}

}}

{{#ifneq | wp2011/04 | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus5}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus6}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus7}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus8}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus9}}} | None |

{{#ifneq | | | | }}

}}

{{#ifneq | {{{corpus10}}} | None |

{{#ifneq | | | | }}

}}

corpuswordscoverage
<nowinter>azattyk2010</nowinter>azattyk20103.4M ~92.11%
<nowinter>azattyk2009</nowinter>azattyk20094.1M ~92.04%
<nowinter>[[|bible]]</nowinter>bible174K ~92.25%
<nowinter>[[|wp2011/04]]</nowinter>wp2011/04545K ~85.37%
<nowinter>[[|{{{corpus5}}}]]</nowinter>{{{corpus5}}} ~%
<nowinter>[[|{{{corpus6}}}]]</nowinter>{{{corpus6}}} ~%
<nowinter>[[|{{{corpus7}}}]]</nowinter>{{{corpus7}}} ~%
<nowinter>[[|{{{corpus8}}}]]</nowinter>{{{corpus8}}} ~%
<nowinter>[[|{{{corpus9}}}]]</nowinter>{{{corpus9}}} ~%
<nowinter>[[|{{{corpus10}}}]]</nowinter>{{{corpus10}}} ~%

To-do[edit]