Difference between revisions of "Sorani"

From Apertium
Jump to navigation Jump to search
Line 27: Line 27:
|}
|}


There are 9635 uses of ZWNJ characters in the wiki, as opposed to 66911 non-joining ە characters. Not using ZWNJs appears to be the convention, but dealing with ZWNJs is definitely an issue that needs to be resolved as well.
There are 9635 uses of ZWNJ characters in the wiki, as opposed to 66911 non-joining ە characters. Not using ZWNJs appears to be the convention, but dealing with ZWNJs is definitely an issue that needs to be resolved as well. More than 4500 of the ZWNJs are at the end of words, stripping those off and letting the ACX handle the replacement might be a possibility.


===Post-processing Dictionaries===
===Post-processing Dictionaries===

Revision as of 14:43, 19 September 2016

Details on building a Sorani transducer and related language pairs

Grammar and Paradigms

We are largely taking the paradigms from Thackston's grammar, Amuzesh-e Zeban-e Kordi by Sayyed Mohammad Sina Ahmadi may be consulted in the future.

Loose and Close Ezafets

There are two types of ezafets in noun phrases, it would be best to quickly decide what tags to use for each one. For the time being I have adopted <ezalos> and <ezaclo>. The vaguely romance look of them might fit the theme but might cause confusion later on.

ZWNJ and ه

As a rule of thumb we are not adding or leaving zero-width non joiner characters at the end of word entries in the dictionaries. Two variants of the same character, ه, are used and these can stand for each other as defined in the ACX file. In "standard, proper" written Sorani it seems that the variant that breaks the abjad is used in typing. We will use the kurdish, abjad-breaking variant by default, with various variations added through postprocessing.

Character Connection Hexdump Note
ه هت 87d9 aad8 000a Connects the usual way for an abjad.
ە ەت 95db aad8 000a Does not connect.

There are 9635 uses of ZWNJ characters in the wiki, as opposed to 66911 non-joining ە characters. Not using ZWNJs appears to be the convention, but dealing with ZWNJs is definitely an issue that needs to be resolved as well. More than 4500 of the ZWNJs are at the end of words, stripping those off and letting the ACX handle the replacement might be a possibility.

Post-processing Dictionaries

A possibly useful thing to do would be to automatically add variants of the letter with and without ZWNJ before compiling the dictionaries. Variants with the 95db character could be considered the proper entry, with the others given an "LR" analysis.

Similarly variants with the 87d9 character with the ZWNJs replaced by spaces could also be generated, as these are very frequently encountered in Sorani typed on cell phones.