Current Status: In Progress
Project: Extend lttoolbox to have the power of HFST
Guidelines[edit]
- Every rule in the dictionary file must be properly compatible with the the HFST twolc engine and must not result in any kind of ambiguities.
- The xml tags must be well defined for archiphonemes and rules and must be distinct from the other existing tags in lttoolbox.
- Every rule entry should have comments adequate enough to give a brief understanding of morphophonological transformations performed by the twol compiler.
- The design is still in the development stage and may need significant modifications after it is implemented on the existing language pairs.
- The design must be robust enough to support all type of rules namely:
- Phonologically conditioned deletion
- Morphologically conditioned deletion
- Phonologically conditioned symbol change
- Morphologically conditioned symbol change
- Phonologically conditioned insertion
- Morphologically conditioned insertion
Alphabets[edit]
<alphabet>аӑеёӗиоуӳыэюябвгджзклмнпрсҫтфхцчшщйьъАӐЕЁӖИОУӲЫЭЮЯБВГДЖЗКЛМНПРСҪТФХЦЧШЩЙЬЪ<ar n="A">ae</ar></alphabet>
The alphabets within the ar tags denote all the possible surface form transformations possible for the archiphoneme.
Tag/Symbol |
Meaning
|
ar |
archiphoneme
|
n |
archiphoneme name
|
<sets>
<set n="Vowels">aeiou</set>
<set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol |
Meaning
|
set |
set/group of alphabets
|
n |
set name
|
Definitions[edit]
<defs>
<def n="Vowel_Group"><gr><set n="Vowel"/></gr></def>
<def n="BackVow_Group"><gr><set n="Vowel"/></gr><set n="BackVow"/></def>
</defs>
Tag/Symbol |
Meaning
|
def |
group/combination of alphabet sets
|
n |
definition name
|
Diacritics[edit]
<sets>
<set n="Vowels">aeiou</set>
<set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol |
Meaning
|
set |
set/group of alphabets
|
n |
set name
|
Rule Definitions[edit]
<sets>
<set n="Vowels">aeiou</set>
<set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol |
Meaning
|
set |
set/group of alphabets
|
n |
set name
|
<rules>
<rule c="Back vowel harmony for archiphoneme A">
<m><ar n="A"></m><s>a</s>
<context dir="e"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
</rule>
<rule c="Only hyphen in vowel boundaries and caps">
<m><ar n="hyph?"></m><s>-</s>
<context dir="f"><l_c><set n="Vowels"></l_c><r_c></r_c></context>
</rule>
<rule c="Back vowel harmony for archiphoneme A">
<m><ar n="A"></m><s>a</s>
<context dir="b"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
</rule>
<rule c="Back vowel harmony for archiphoneme A">
<m><ar n="A"></m><s>a</s>
<context dir="ne"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
</rule>
</rules>
Tag/Symbol |
Meaning
|
rule |
twol rule
|
c |
comment
|
m |
morphotactic side
|
s |
surface side
|
context |
context for transformation
|
dir |
direction constraint
|
f |
a:b => _ ; If the symbol pair a:b appears it must be in context _
|
b |
a:b <= _ ; If lexical a appears in the context _ then it must correspond to surface b
|
e |
a:b <=> _ ; Lexical a always corresponds to b in context _
|
ne |
a:b /<= _ ; Lexical a never corresponds to b in context _
|
r_c |
right context
|
l_c |
left context
|
Regular Expression Syntax[edit]
In hfst-twolc, the following order of precedence is followed in the application of regular expressions:
- Unary operators:
^INTEGER
, $
, $.
, \
, ~
, *
, +
- Concatenation
- Binary operators:
|
, &
, -
It also uses constructions like (...)
and [...]
which overrides all other order of precedence.
We therefore separate the regular expressions in lttoolbox into two categories and use different type of tags for regex operators and enclosing operators.
Regular Expression Operators
Tag/Symbol |
Meaning
|
re |
regular expression operator tag
|
n |
regular expression operator name
|
gr |
grouping operator
|
gro |
grouping operator (optional)
|