Details on building a Sorani transducer and related language pairs
Grammar and Paradigms
We are largely taking the paradigms from Thackston's grammar, Amuzesh-e Zeban-e Kordi by Sayyed Mohammad Sina Ahmadi may be consulted in the future.
Loose and Close Ezafets
There are two types of ezafets in noun phrases, it would be best to quickly decide what tags to use for each one. For the time being I have adopted <ezalos> and <ezaclo>. The vaguely romance look of them might fit the theme but might cause confusion later on.
Do we need to distinguish between the demonstrative of a loose or close ezafet? For now the distinction isn't used but can be added if needed in the future.
Emphatic and Pronominal Clitics
The clitic -ish is emphatic, giving the meaning "too" or "even" to the substantive in question. Possession is marked by pronominal clitics at the very end. Pronominal clitics are segmented in UD Persian, this might be something to discuss in the UD issues. It seems most intuitive to leave the word as one unit, as neither of these types of clitics are meaningful when orthographically independent.
For analysis, these could be only two paradigms: one to which all words with vowel endings would be connected, the others ending with consonants to the other. Over-simplification might lead to situations like "me of theirs" or "them of mine", but it is likely this will save many paradigm entries.
Pronominal Clitics and Adpositions
When a clitic is used with a pronominal clitic, we will definitely have to segment it. ✓
Most vocalic-ending prepositions seem to also change vowels when used with certain pronominal clitics. ✓
What will the proper tags for nominals with pronominal clitics be? It looks intuitive that they would be in a close ezafet, and this is what they would look like after segmentation, however information about absoluteness/definiteness is on the initial nominal, and not the pronoun. كتاوم, meaning "my book" is analyzed as the following:
"<كتاوم>" "كتاو" n sg abs "من" prn pers p1 sg
"My book" is clearly different from simply putting the words "book" and "I" after each other, but it remains to be decided how to mark this. A convenient solution might be to somehow mark the personal pronoun differently, maybe with a tag like <clt>, to differentiate it from the usual pronouns.
RESOLVED: Nominals will have possessor agreement like Turkish, while the clitics on adpositions will be segmented.
What about situations where the pronominal clitic precedes the preposition, and is attached to the verb before?
This seems to occur frequently with light verbs too, we will probably need an intermediary paradigm to add the pronominal clitics for things like pirsiyarakam daka.
While this occurs with proper light verbs, it does not seem to happen with compounds such as rawastaan. These take only modal prefixes before the verb proper e.g. radawastam.
Existential Copulae and Possessor Agreement
This could make for a number of good transfer rules. Parat haya -> You have money. The pattern would be like nominal<p1xsg> + vbhaver, resulting in prnpers<p1sg> + have + nominal.
In cases such as هاتن (hatin, to come), how will we decide which variant to output? Can a speaker from Sulaymaniyah understand the rest of Sorani speakers, and vice versa?
More Vocalic Endings
The only paradigm with a vowel ending the analyzer has as of now is for words ending with ە. There are probably other vowel endings that we need to adapt to, possibly ێ could have a different situation. ی should also be looked at in terms of how nominals ending with it are inflected.
ZWNJ and ه
As a rule of thumb we are not adding or leaving zero-width non joiner characters at the end of word entries in the dictionaries. Two variants of the same character, ه, are used and these can stand for each other as defined in the ACX file. In "standard, proper" written Sorani it seems that the variant that breaks the abjad is used in typing. We will use the kurdish, abjad-breaking variant by default, with various variations added through postprocessing.
|ه||هت||87d9 aad8 000a||Connects the usual way for an abjad.|
|ە||ەت||95db aad8 000a||Does not connect.|
There are 9635 uses of ZWNJ characters in the wiki, as opposed to 66911 non-joining ە characters. Not using ZWNJs appears to be the convention, but dealing with ZWNJs is definitely an issue that needs to be resolved as well. More than 4500 of the ZWNJs are at the end of words, stripping those off and letting the ACX handle the replacement might be a possibility.
A possibly useful thing to do would be to automatically add variants of the letter with and without ZWNJ before compiling the dictionaries. Variants with the 95db character could be considered the proper entry, with the others given an "LR" analysis.
Similarly variants with the 87d9 character with the ZWNJs replaced by spaces could also be generated, as these are very frequently encountered in Sorani typed on cell phones.