Difference between revisions of "Twol rules in lttoolbox"

From Apertium
Jump to navigation Jump to search
(Add type of rules)
(→‎Definitions: Update definition)
 
(6 intermediate revisions by the same user not shown)
Line 4: Line 4:


==Guidelines==
==Guidelines==
*Every rule in the dictionary file must be properly compatible with the the HFST twolc engine and must not result in any ambiguities.
*Every rule in the dictionary file must be properly compatible with the the HFST twolc engine and must not result in any kind of ambiguities.
*The xml tags must be well defined for archiphonemes and rules and must be distinct from the other existing tags in lttoolbox.
*The xml tags must be well defined for archiphonemes and rules and must be distinct from the other existing tags in lttoolbox.
*Every rule entry should have comments adequate enough to give a brief understanding of morphophonological transformations performed by the twol compiler.
*Every rule entry should have comments adequate enough to give a brief understanding of morphophonological transformations performed by the twol compiler.
Line 18: Line 18:
**Morphologically conditioned insertion
**Morphologically conditioned insertion


==Archiphonemes==
==Alphabets==


<pre>
<pre>
<alphabet>аӑеёӗиоуӳыэюябвгджзклмнпрсҫтфхцчшщйьъАӐЕЁӖИОУӲЫЭЮЯБВГДЖЗКЛМНПРСҪТФХЦЧШЩЙЬЪ<ar n="A">ae</ar></alphabet>
<archiphoneme>
<ar n="A" alpha="ae"/>
<ar n="B" alpha="bcd"/>
</archiphoneme>
</pre>
</pre>

The alphabets within the '''ar''' tags denote all the possible surface form transformations possible for the archiphoneme.


{|class=wikitable
{|class=wikitable
Line 32: Line 31:
| '''ar''' || archiphoneme
| '''ar''' || archiphoneme
|-
|-
| '''alpha''' || alphabet
| '''n''' || archiphoneme name
|-
|-
|}
|}
Line 40: Line 39:
<pre>
<pre>
<sets>
<sets>
<set n="Vowels" alpha="aeiou"/>
<set n="Vowels">aeiou</set>
<set n="BackVow" alpha="bcdfg"/>
<set n="BackVow">bcdfg</set>
</sets>
</sets>
</pre>
</pre>
Line 52: Line 51:
| '''n''' || set name
| '''n''' || set name
|-
|-
|}
| '''alpha''' || alphabet

==Definitions==

<pre>
<defs>
<def n="Vowel_Group"><gr><set n="Vowel"/></gr></def>
<def n="BackVow_Group"><gr><set n="Vowel"/></gr><set n="BackVow"/></def>
</defs>
</pre>

{|class=wikitable
! Tag/Symbol !! Meaning
|-
|-
| '''def''' || group/combination of alphabet sets
|-
| '''n''' || definition name
|-
|}
|}


==Twol Rules==
==Diacritics==

<pre>
<sets>
<set n="Vowels">aeiou</set>
<set n="BackVow">bcdfg</set>
</sets>
</pre>

{|class=wikitable
! Tag/Symbol !! Meaning
|-
| '''set''' || set/group of alphabets
|-
| '''n''' || set name
|-
|}

==Rule Definitions==

<pre>
<sets>
<set n="Vowels">aeiou</set>
<set n="BackVow">bcdfg</set>
</sets>
</pre>

{|class=wikitable
! Tag/Symbol !! Meaning
|-
| '''set''' || set/group of alphabets
|-
| '''n''' || set name
|-
|}

==Rules==


<pre>
<pre>
Line 62: Line 113:
<rule c="Back vowel harmony for archiphoneme A">
<rule c="Back vowel harmony for archiphoneme A">
<m><ar n="A"></m><s>a</s>
<m><ar n="A"></m><s>a</s>
<context constraint="e"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
<context dir="e"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
</rule>
</rule>
<rule c="Only hyphen in vowel boundaries and caps">
<rule c="Only hyphen in vowel boundaries and caps">
<m><ar n="hyph?"></m><s>-</s>
<m><ar n="hyph?"></m><s>-</s>
<context constraint="f"><l_c><set n="Vowels"></l_c><r_c></r_c></context>
<context dir="f"><l_c><set n="Vowels"></l_c><r_c></r_c></context>
</rule>
</rule>
<rule c="Back vowel harmony for archiphoneme A">
<rule c="Back vowel harmony for archiphoneme A">
<m><ar n="A"></m><s>a</s>
<m><ar n="A"></m><s>a</s>
<context constraint="b"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
<context dir="b"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
</rule>
</rule>
<rule c="Back vowel harmony for archiphoneme A">
<rule c="Back vowel harmony for archiphoneme A">
<m><ar n="A"></m><s>a</s>
<m><ar n="A"></m><s>a</s>
<context constraint="ne"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
<context dir="ne"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
</rule>
</rule>
</rules>
</rules>
Line 92: Line 143:
| '''context''' || context for transformation
| '''context''' || context for transformation
|-
|-
| '''constraint''' || direction constraint
| '''dir''' || direction constraint
|-
|-
| '''f''' || a:b => _ ; If the symbol pair a:b appears it must be in context _
| '''f''' || a:b => _ ; If the symbol pair a:b appears it must be in context _
Line 106: Line 157:
| '''l_c''' || left context
| '''l_c''' || left context
|-
|-
|}

==Regular Expression Syntax==

In hfst-twolc, the following order of precedence is followed in the application of regular expressions:
* Unary operators: <code>^INTEGER</code>, <code>$</code>, <code>$.</code>, <code>\</code>, <code>~</code>, <code>*</code>, <code>+</code>
* Concatenation
* Binary operators: <code>|</code>, <code>&</code>, <code>-</code>

It also uses constructions like <code>(...)</code> and <code>[...]</code> which overrides all other order of precedence.

We therefore separate the regular expressions in lttoolbox into two categories and use different type of tags for regex operators and enclosing operators.

'''Regular Expression Operators'''

{|class=wikitable
! Tag/Symbol !! Meaning
|-
| '''re''' || regular expression operator tag
|-
| '''n''' || regular expression operator name
|-
| '''gr''' || grouping operator
|-
| '''gro''' || grouping operator (optional)
|-
|}
|}

Latest revision as of 16:58, 10 June 2018

Current Status: In Progress
Project: Extend lttoolbox to have the power of HFST

Guidelines[edit]

  • Every rule in the dictionary file must be properly compatible with the the HFST twolc engine and must not result in any kind of ambiguities.
  • The xml tags must be well defined for archiphonemes and rules and must be distinct from the other existing tags in lttoolbox.
  • Every rule entry should have comments adequate enough to give a brief understanding of morphophonological transformations performed by the twol compiler.

Design[edit]

  • The design is still in the development stage and may need significant modifications after it is implemented on the existing language pairs.
  • The design must be robust enough to support all type of rules namely:
    • Phonologically conditioned deletion
    • Morphologically conditioned deletion
    • Phonologically conditioned symbol change
    • Morphologically conditioned symbol change
    • Phonologically conditioned insertion
    • Morphologically conditioned insertion

Alphabets[edit]

<alphabet>аӑеёӗиоуӳыэюябвгджзклмнпрсҫтфхцчшщйьъАӐЕЁӖИОУӲЫЭЮЯБВГДЖЗКЛМНПРСҪТФХЦЧШЩЙЬЪ<ar n="A">ae</ar></alphabet>

The alphabets within the ar tags denote all the possible surface form transformations possible for the archiphoneme.

Tag/Symbol Meaning
ar archiphoneme
n archiphoneme name

Sets[edit]

<sets>
  <set n="Vowels">aeiou</set>
  <set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol Meaning
set set/group of alphabets
n set name

Definitions[edit]

<defs>
  <def n="Vowel_Group"><gr><set n="Vowel"/></gr></def>
  <def n="BackVow_Group"><gr><set n="Vowel"/></gr><set n="BackVow"/></def>
</defs>
Tag/Symbol Meaning
def group/combination of alphabet sets
n definition name

Diacritics[edit]

<sets>
  <set n="Vowels">aeiou</set>
  <set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol Meaning
set set/group of alphabets
n set name

Rule Definitions[edit]

<sets>
  <set n="Vowels">aeiou</set>
  <set n="BackVow">bcdfg</set>
</sets>
Tag/Symbol Meaning
set set/group of alphabets
n set name

Rules[edit]

<rules>
  <rule c="Back vowel harmony for archiphoneme A">
    <m><ar n="A"></m><s>a</s>
    <context dir="e"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  </rule>
  <rule c="Only hyphen in vowel boundaries and caps">
    <m><ar n="hyph?"></m><s>-</s>
    <context dir="f"><l_c><set n="Vowels"></l_c><r_c></r_c></context>
  </rule>
  <rule c="Back vowel harmony for archiphoneme A">
    <m><ar n="A"></m><s>a</s>
    <context dir="b"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  </rule>
  <rule c="Back vowel harmony for archiphoneme A">
    <m><ar n="A"></m><s>a</s>
    <context dir="ne"><l_c><set n="BackVow"></l_c><r_c></r_c></context>
  </rule>
</rules>
Tag/Symbol Meaning
rule twol rule
c comment
m morphotactic side
s surface side
context context for transformation
dir direction constraint
f a:b => _ ; If the symbol pair a:b appears it must be in context _
b a:b <= _ ; If lexical a appears in the context _ then it must correspond to surface b
e a:b <=> _ ; Lexical a always corresponds to b in context _
ne a:b /<= _ ; Lexical a never corresponds to b in context _
r_c right context
l_c left context

Regular Expression Syntax[edit]

In hfst-twolc, the following order of precedence is followed in the application of regular expressions:

  • Unary operators: ^INTEGER, $, $., \, ~, *, +
  • Concatenation
  • Binary operators: |, &, -

It also uses constructions like (...) and [...] which overrides all other order of precedence.

We therefore separate the regular expressions in lttoolbox into two categories and use different type of tags for regex operators and enclosing operators.

Regular Expression Operators

Tag/Symbol Meaning
re regular expression operator tag
n regular expression operator name
gr grouping operator
gro grouping operator (optional)