Difference between revisions of "Apertium-kir"
Firespeaker (talk | contribs) (→Current State: fix link for reals this time) |
Firespeaker (talk | contribs) |
||
(15 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{TOCD}} |
||
'''Kymorph''' is a morphological analyser/generator for the [[Kyrgyz language]], currently working. It is intended to be compatible with transducers for other [[Turkic languages]] so that they can be translated between. |
'''Kymorph''' is a morphological analyser/generator for the [[Kyrgyz language]], currently working. It is intended to be compatible with transducers for other [[Turkic languages]] so that they can be translated between. |
||
== Installation == |
== Installation == |
||
− | '''kymorph''' is currently located in [[tr-ky]]. |
||
+ | <pre> |
||
− | === Dependency tree === |
||
+ | $ git clone https://github.com/apertium/apertium-kir.git |
||
− | * apertium-tr-ky |
||
− | ** apertium-tr-az |
||
− | *** apertium |
||
− | **** lttoolbox |
||
− | *** VISL CG3 |
||
− | **** cmake |
||
− | **** libicu-dev |
||
− | **** tmalloc (libgoogle-perftools-dev) |
||
− | ***** libtcmalloc-minimal0 |
||
− | ***** libgoogle-perftools0 |
||
− | **** boost |
||
− | *** trmorph |
||
− | **** hfst (≥r1559 for kymorph) |
||
− | ***** openfst |
||
− | ***** sfst |
||
− | ***** foma |
||
− | *** azmorph |
||
+ | </pre> |
||
⚫ | |||
+ | |||
+ | == Adding analyses to conllu == |
||
+ | |||
+ | This guide assumes that you |
||
+ | * have <tt>apertium-kir</tt> compiled and running, |
||
+ | * have a working recent version of python3, and |
||
+ | * are saving your files in the <tt>apertium-kir/</tt> directory. |
||
+ | |||
+ | # Get CG3 format of conllu data |
||
+ | #* From Annotatrix: click CG3 tab, click menu, download, rename file to <tt>.dep</tt> |
||
+ | # Populate morphology in CG3 file from Apertium using <code>add_morph.py</code> script: |
||
+ | #* <code>$ cat corpus.dep | ./corpora/add_morph.py > corpus_with_annotation.dep</code> |
||
+ | #* Open <code>corpus_with_annotation.dep</code> and fix any errors |
||
+ | # Create segmentation file with <tt>dep_to_seg.py</tt> script: |
||
+ | #* <code>$ cat corpus_with_annotation.dep | ./corpora/dep_to_seg.py > corpus_with_annotation.seg</code> |
||
+ | #* Open <code>corpus_with_annotation.seg</code> and fix any errors |
||
+ | # Convert back to CoNLL-U format with <tt>conllise.py</tt> script: |
||
+ | #* <code>$ python3 ./scripts/conllise.py -e apertium-kir.kir.udx corpus_with_annotation.dep corpus_with_annotation.seg</code> |
||
== Current State == |
== Current State == |
||
+ | {{LangStats | lang = kir | corpus1 = azattyk2010 | corpus2 = azattyk2009 | corpus3 = bible | corpus4 = wp2011/04}} |
||
− | * Number of stems: {{:Kymorph/stems}} |
||
− | * Coverage: {{:Kymorph/coverage/rferl2010}} (2010 [[RFERL corpora]]), {{:Kymorph/coverage/bible}} ([[Kyrgyz bible]]) |
||
== To-do == |
== To-do == |
||
+ | |||
⚫ |
Latest revision as of 14:51, 24 April 2024
Kymorph is a morphological analyser/generator for the Kyrgyz language, currently working. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between.
Installation[edit]
$ git clone https://github.com/apertium/apertium-kir.git
Adding analyses to conllu[edit]
This guide assumes that you
- have apertium-kir compiled and running,
- have a working recent version of python3, and
- are saving your files in the apertium-kir/ directory.
- Get CG3 format of conllu data
- From Annotatrix: click CG3 tab, click menu, download, rename file to .dep
- Populate morphology in CG3 file from Apertium using
add_morph.py
script:$ cat corpus.dep | ./corpora/add_morph.py > corpus_with_annotation.dep
- Open
corpus_with_annotation.dep
and fix any errors
- Create segmentation file with dep_to_seg.py script:
$ cat corpus_with_annotation.dep | ./corpora/dep_to_seg.py > corpus_with_annotation.seg
- Open
corpus_with_annotation.seg
and fix any errors
- Convert back to CoNLL-U format with conllise.py script:
$ python3 ./scripts/conllise.py -e apertium-kir.kir.udx corpus_with_annotation.dep corpus_with_annotation.seg
Current State[edit]
{{#set_param_default | corpus1 | None }} {{#set_param_default | corpus2 | None }} {{#set_param_default | corpus3 | None }} {{#set_param_default | corpus4 | None }} {{#set_param_default | corpus5 | None }} {{#set_param_default | corpus6 | None }} {{#set_param_default | corpus7 | None }} {{#set_param_default | corpus8 | None }} {{#set_param_default | corpus9 | None }} {{#set_param_default | corpus10 | None }}
- Number of stems: 14,424 {{#ifneq | | | () }}
- Disambiguation rules: 16
- Coverage: ~90.4%
{{#ifneq | azattyk2010 | None |
{{#ifneq | RFERL_corpora | | | }}}}
{{#ifneq | azattyk2009 | None |
{{#ifneq | RFERL_corpora | | | }}}}
{{#ifneq | bible | None |
{{#ifneq | | | | }}}}
{{#ifneq | wp2011/04 | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus5}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus6}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus7}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus8}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus9}}} | None |
{{#ifneq | | | | }}}}
{{#ifneq | {{{corpus10}}} | None |
{{#ifneq | | | | }}}}
corpus | words | coverage | |
---|---|---|---|
<nowinter>azattyk2010</nowinter> | azattyk2010 | 3.4M | ~92.11% |
<nowinter>azattyk2009</nowinter> | azattyk2009 | 4.1M | ~92.04% |
<nowinter>[[|bible]]</nowinter> | bible | 174K | ~92.25% |
<nowinter>[[|wp2011/04]]</nowinter> | wp2011/04 | 545K | ~85.37% |
<nowinter>[[|{{{corpus5}}}]]</nowinter> | {{{corpus5}}} | ~% | |
<nowinter>[[|{{{corpus6}}}]]</nowinter> | {{{corpus6}}} | ~% | |
<nowinter>[[|{{{corpus7}}}]]</nowinter> | {{{corpus7}}} | ~% | |
<nowinter>[[|{{{corpus8}}}]]</nowinter> | {{{corpus8}}} | ~% | |
<nowinter>[[|{{{corpus9}}}]]</nowinter> | {{{corpus9}}} | ~% | |
<nowinter>[[|{{{corpus10}}}]]</nowinter> | {{{corpus10}}} | ~% |