Difference between revisions of "Twol rules in lttoolbox"

From Apertium
Jump to navigation Jump to search
(Format for archiphonemes and sets)
(→‎Definitions: Update definition)
 
(10 intermediate revisions by the same user not shown)
Line 4: Line 4:
   
 
==Guidelines==
 
==Guidelines==
*Every rule in the dictionary file must be properly compatible with the the HFST twolc engine and must not result in any ambiguities.
+
*Every rule in the dictionary file must be properly compatible with the the HFST twolc engine and must not result in any kind of ambiguities.
*The xml tags must be well defined for archiphonemes and rules.
+
*The xml tags must be well defined for archiphonemes and rules and must be distinct from the other existing tags in lttoolbox.
*Every rule should have a comment giving the input from the morphotactics no exceptions
+
*Every rule entry should have comments adequate enough to give a brief understanding of morphophonological transformations performed by the twol compiler.
   
 
==Design==
 
==Design==
  +
*The design is still in the development stage and may need significant modifications after it is implemented on the existing language pairs.
  +
*The design must be robust enough to support all type of rules namely:
  +
**Phonologically conditioned deletion
  +
**Morphologically conditioned deletion
  +
**Phonologically conditioned symbol change
  +
**Morphologically conditioned symbol change
  +
**Phonologically conditioned insertion
  +
**Morphologically conditioned insertion
   
==Archiphonemes==
+
==Alphabets==
   
 
<pre>
 
<pre>
  +
<alphabet>аӑеёӗиоуӳыэюябвгджзклмнпрсҫтфхцчшщйьъАӐЕЁӖИОУӲЫЭЮЯБВГДЖЗКЛМНПРСҪТФХЦЧШЩЙЬЪ<ar n="A">ae</ar></alphabet>
<archiphoneme>
 
<ar n="A" alpha="ae"/>
 
<ar n="B" alpha="bcd"/>
 
</archiphoneme>
 
 
</pre>
 
</pre>
  +
  +
The alphabets within the '''ar''' tags denote all the possible surface form transformations possible for the archiphoneme.
  +
  +
{|class=wikitable
  +
! Tag/Symbol !! Meaning
  +
|-
  +
| '''ar''' || archiphoneme
  +
|-
  +
| '''n''' || archiphoneme name
  +
|-
  +
|}
   
 
==Sets==
 
==Sets==
Line 23: Line 39:
 
<pre>
 
<pre>
 
<sets>
 
<sets>
<set n="Vowels" alpha="aeiou"/>
+
<set n="Vowels">aeiou</set>
<set n="Consonants" alpha="bcdfg"/>
+
<set n="BackVow">bcdfg</set>
 
</sets>
 
</sets>
 
</pre>
 
</pre>
   
  +
{|class=wikitable
==Twol Rules==
 
  +
! Tag/Symbol !! Meaning
  +
|-
  +
| '''set''' || set/group of alphabets
  +
|-
  +
| '''n''' || set name
  +
|-
  +
|}
  +
  +
==Definitions==
  +
  +
<pre>
  +
<defs>
  +
<def n="Vowel_Group"><gr><set n="Vowel"/></gr></def>
  +
<def n="BackVow_Group"><gr><set n="Vowel"/></gr><set n="BackVow"/></def>
  +
</defs>
  +
</pre>
  +
  +
{|class=wikitable
  +
! Tag/Symbol !! Meaning
  +
|-
  +
| '''def''' || group/combination of alphabet sets
  +
|-
  +
| '''n''' || definition name
  +
|-
  +
|}
  +
  +
==Diacritics==
  +
  +
<pre>
  +
<sets>
  +
<set n="Vowels">aeiou</set>
  +
<set n="BackVow">bcdfg</set>
  +
</sets>
  +
</pre>
  +
  +
{|class=wikitable
  +
! Tag/Symbol !! Meaning
  +
|-
  +
| '''set''' || set/group of alphabets
  +
|-
  +
| '''n''' || set name
  +
|-
  +
|}
  +
  +
==Rule Definitions==
  +
  +
<pre>
  +
<sets>
  +
<set n="Vowels">aeiou</set>
  +
<set n="BackVow">bcdfg</set>
  +
</sets>
  +
</pre>
  +
  +
{|class=wikitable
  +
! Tag/Symbol !! Meaning
  +
|-
  +
| '''set''' || set/group of alphabets
  +
|-
  +
| '''n''' || set name
  +
|-
  +
|}
  +
  +
==Rules==
   
 
<pre>
 
<pre>
 
<rules>
 
<rules>
  +
<rule c="Back vowel harmony for archiphoneme A">
 
  +
<m><ar n="A"></m><s>a</s>
  +
<context dir="e"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  +
</rule>
  +
<rule c="Only hyphen in vowel boundaries and caps">
  +
<m><ar n="hyph?"></m><s>-</s>
  +
<context dir="f"><l_c><set n="Vowels"></l_c><r_c></r_c></context>
  +
</rule>
  +
<rule c="Back vowel harmony for archiphoneme A">
  +
<m><ar n="A"></m><s>a</s>
  +
<context dir="b"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  +
</rule>
  +
<rule c="Back vowel harmony for archiphoneme A">
  +
<m><ar n="A"></m><s>a</s>
  +
<context dir="ne"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  +
</rule>
 
</rules>
 
</rules>
 
</pre>
 
</pre>
  +
  +
{|class=wikitable
  +
! Tag/Symbol !! Meaning
  +
|-
  +
| '''rule''' || twol rule
  +
|-
  +
| '''c''' || comment
  +
|-
  +
| '''m''' || morphotactic side
  +
|-
  +
| '''s''' || surface side
  +
|-
  +
| '''context''' || context for transformation
  +
|-
  +
| '''dir''' || direction constraint
  +
|-
  +
| '''f''' || a:b => _ ; If the symbol pair a:b appears it must be in context _
  +
|-
  +
| '''b''' || a:b <= _ ; If lexical a appears in the context _ then it must correspond to surface b
  +
|-
  +
| '''e''' || a:b <=> _ ; Lexical a always corresponds to b in context _
  +
|-
  +
| '''ne''' || a:b /<= _ ; Lexical a never corresponds to b in context _
  +
|-
  +
| '''r_c''' || right context
  +
|-
  +
| '''l_c''' || left context
  +
|-
  +
|}
  +
  +
==Regular Expression Syntax==
  +
  +
In hfst-twolc, the following order of precedence is followed in the application of regular expressions:
  +
* Unary operators: <code>^INTEGER</code>, <code>$</code>, <code>$.</code>, <code>\</code>, <code>~</code>, <code>*</code>, <code>+</code>
  +
* Concatenation
  +
* Binary operators: <code>|</code>, <code>&</code>, <code>-</code>
  +
  +
It also uses constructions like <code>(...)</code> and <code>[...]</code> which overrides all other order of precedence.
  +
  +
We therefore separate the regular expressions in lttoolbox into two categories and use different type of tags for regex operators and enclosing operators.
  +
  +
'''Regular Expression Operators'''
  +
  +
{|class=wikitable
  +
! Tag/Symbol !! Meaning
  +
|-
  +
| '''re''' || regular expression operator tag
  +
|-
  +
| '''n''' || regular expression operator name
  +
|-
  +
| '''gr''' || grouping operator
  +
|-
  +
| '''gro''' || grouping operator (optional)
  +
|-
  +
|}

Latest revision as of 16:58, 10 June 2018

Current Status: In Progress
Project: Extend lttoolbox to have the power of HFST

Guidelines[edit]

  • Every rule in the dictionary file must be properly compatible with the the HFST twolc engine and must not result in any kind of ambiguities.
  • The xml tags must be well defined for archiphonemes and rules and must be distinct from the other existing tags in lttoolbox.
  • Every rule entry should have comments adequate enough to give a brief understanding of morphophonological transformations performed by the twol compiler.

Design[edit]

  • The design is still in the development stage and may need significant modifications after it is implemented on the existing language pairs.
  • The design must be robust enough to support all type of rules namely:
    • Phonologically conditioned deletion
    • Morphologically conditioned deletion
    • Phonologically conditioned symbol change
    • Morphologically conditioned symbol change
    • Phonologically conditioned insertion
    • Morphologically conditioned insertion

Alphabets[edit]

<alphabet>аӑеёӗиоуӳыэюябвгджзклмнпрсҫтфхцчшщйьъАӐЕЁӖИОУӲЫЭЮЯБВГДЖЗКЛМНПРСҪТФХЦЧШЩЙЬЪ<ar n="A">ae</ar></alphabet>

The alphabets within the ar tags denote all the possible surface form transformations possible for the archiphoneme.

Tag/Symbol Meaning
ar archiphoneme
n archiphoneme name

Sets[edit]

<sets>
  <set n="Vowels">aeiou</set>
  <set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol Meaning
set set/group of alphabets
n set name

Definitions[edit]

<defs>
  <def n="Vowel_Group"><gr><set n="Vowel"/></gr></def>
  <def n="BackVow_Group"><gr><set n="Vowel"/></gr><set n="BackVow"/></def>
</defs>
Tag/Symbol Meaning
def group/combination of alphabet sets
n definition name

Diacritics[edit]

<sets>
  <set n="Vowels">aeiou</set>
  <set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol Meaning
set set/group of alphabets
n set name

Rule Definitions[edit]

<sets>
  <set n="Vowels">aeiou</set>
  <set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol Meaning
set set/group of alphabets
n set name

Rules[edit]

<rules>
  <rule c="Back vowel harmony for archiphoneme A">
    <m><ar n="A"></m><s>a</s>
    <context dir="e"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  </rule>
  <rule c="Only hyphen in vowel boundaries and caps">
    <m><ar n="hyph?"></m><s>-</s>
    <context dir="f"><l_c><set n="Vowels"></l_c><r_c></r_c></context>
  </rule>
  <rule c="Back vowel harmony for archiphoneme A">
    <m><ar n="A"></m><s>a</s>
    <context dir="b"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  </rule>
  <rule c="Back vowel harmony for archiphoneme A">
    <m><ar n="A"></m><s>a</s>
    <context dir="ne"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  </rule>
</rules>
Tag/Symbol Meaning
rule twol rule
c comment
m morphotactic side
s surface side
context context for transformation
dir direction constraint
f a:b => _ ; If the symbol pair a:b appears it must be in context _
b a:b <= _ ; If lexical a appears in the context _ then it must correspond to surface b
e a:b <=> _ ; Lexical a always corresponds to b in context _
ne a:b /<= _ ; Lexical a never corresponds to b in context _
r_c right context
l_c left context

Regular Expression Syntax[edit]

In hfst-twolc, the following order of precedence is followed in the application of regular expressions:

  • Unary operators: ^INTEGER, $, $., \, ~, *, +
  • Concatenation
  • Binary operators: |, &, -

It also uses constructions like (...) and [...] which overrides all other order of precedence.

We therefore separate the regular expressions in lttoolbox into two categories and use different type of tags for regex operators and enclosing operators.

Regular Expression Operators

Tag/Symbol Meaning
re regular expression operator tag
n regular expression operator name
gr grouping operator
gro grouping operator (optional)