Difference between revisions of "Working with twol"

From Apertium
Jump to navigation Jump to search
m
 
(6 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
==Guidelines==
 
==Guidelines==
   
* Every rule should have a comment giving the input from the morphotactics (e.g. <code>те>{A}т{ь}</code> and the expected output from the phonology <code>тет</code>) -- no exceptions.
+
* Every rule should have a comment giving the input from the morphotactics (e.g. <code>те>{A}т{ь}</code> and the expected output from the phonology <code>тет</code>) -- no exceptions. The current format for this is as follows:
  +
: <code>!@ т е >:0 {A}:0 т {ь}:0</code>
  +
* Define your alphabet, no jokes. It should contain all your alphabetic symbols, and your [[archiphonemes]] with their default realisations.
  +
* Some limitations:
  +
** You can't insert from 0; i.e., you can't write a rule like <code>0:n</code> and expect to be able to insert n. You must specify an archiphoneme (like <code>{n}</code>) where you need insertion to happen.
  +
** You shouldn't have two characters as the input or output of a rule. So instead of having a rule like <code>cc:sh</code>, you will need two rules: <code>c:s</code> and <code>c:h</code>, with appropriate contexts.
   
  +
== Writing rules ==
  +
  +
=== Deciding on the input symbol ===
  +
  +
Deciding on the input symbol is the first problem that needs to be solved for writing a twol rule. This is the symbol that gets implemented in the morphology, and constitutes the first part of any rule.
  +
  +
The idea of the input symbol, for anyone familiar with phonology, is roughly the same idea as the underlying form. If you have a good understanding of what this is, then you can probably skip much of this document; for everyone else, keep reading.
  +
  +
In simple cases, the input symbol is often simply the most common form of the character to occur. For example, if ''x'' changes to ''y'' in some restricted condition, then ''x'' is the input symbol, since it occurs in the must unrestricted environment, and ''y'' is a conditioned output symbol. See [[#Phonologically conditioned symbol change]] for how to write this kind of rule.
  +
  +
In some cases, it's not clear what the input symbol should be, since its output forms occur in restricted environments. Vowel harmony is a good example of this. For these cases, you should probably choose an archiphoneme. Apertium convention is to put archiphonemes in <code>{}</code>s, with archiphonemes that primarily change form in upper case (e.g., {{archi|A}}) and archiphonemes that delete in lower case (e.g., {{archi|y}}). Note that the <code>{}</code>s have to be escaped with <code>%</code>. See ... for more information on how to write this kind of rule.
  +
  +
When you want to delete a character, the input character is simply the character that gets deleted. See [[#Phonologically conditioned deletion]] for more information.
  +
  +
When you want to insert a character, you again have to use an archiphoneme, since otherwise you have no input character. To learn more about writing this sort of rule, see [[#Phonologically conditioned insertion]].
  +
  +
=== Types of rules ===
  +
  +
==== Phonologically conditioned deletion ====
  +
  +
==== Morphologically conditioned deletion ====
  +
  +
==== Phonologically conditioned symbol change ====
  +
  +
There's actually several types of this:
  +
* one character to another
  +
* multiple characters to one character
  +
* one character to multiple characters
  +
* unknown input character to one of a number of characters
  +
  +
==== Morphologically conditioned symbol change ====
  +
  +
==== Phonologically conditioned insertion ====
  +
  +
==== Morphologically conditioned insertion ====
  +
==See also==
  +
  +
* [[Archiphonemes]]
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]
  +
[[Category:HFST]]
  +
  +
[[Category:Writing dictionaries]]

Latest revision as of 12:26, 26 September 2016

Guidelines[edit]

  • Every rule should have a comment giving the input from the morphotactics (e.g. те>{A}т{ь} and the expected output from the phonology тет) -- no exceptions. The current format for this is as follows:
!@ т е >:0 {A}:0 т {ь}:0
  • Define your alphabet, no jokes. It should contain all your alphabetic symbols, and your archiphonemes with their default realisations.
  • Some limitations:
    • You can't insert from 0; i.e., you can't write a rule like 0:n and expect to be able to insert n. You must specify an archiphoneme (like {n}) where you need insertion to happen.
    • You shouldn't have two characters as the input or output of a rule. So instead of having a rule like cc:sh, you will need two rules: c:s and c:h, with appropriate contexts.

Writing rules[edit]

Deciding on the input symbol[edit]

Deciding on the input symbol is the first problem that needs to be solved for writing a twol rule. This is the symbol that gets implemented in the morphology, and constitutes the first part of any rule.

The idea of the input symbol, for anyone familiar with phonology, is roughly the same idea as the underlying form. If you have a good understanding of what this is, then you can probably skip much of this document; for everyone else, keep reading.

In simple cases, the input symbol is often simply the most common form of the character to occur. For example, if x changes to y in some restricted condition, then x is the input symbol, since it occurs in the must unrestricted environment, and y is a conditioned output symbol. See #Phonologically conditioned symbol change for how to write this kind of rule.

In some cases, it's not clear what the input symbol should be, since its output forms occur in restricted environments. Vowel harmony is a good example of this. For these cases, you should probably choose an archiphoneme. Apertium convention is to put archiphonemes in {}s, with archiphonemes that primarily change form in upper case (e.g., %{A%}) and archiphonemes that delete in lower case (e.g., %{y%}). Note that the {}s have to be escaped with %. See ... for more information on how to write this kind of rule.

When you want to delete a character, the input character is simply the character that gets deleted. See #Phonologically conditioned deletion for more information.

When you want to insert a character, you again have to use an archiphoneme, since otherwise you have no input character. To learn more about writing this sort of rule, see #Phonologically conditioned insertion.

Types of rules[edit]

Phonologically conditioned deletion[edit]

Morphologically conditioned deletion[edit]

Phonologically conditioned symbol change[edit]

There's actually several types of this:

  • one character to another
  • multiple characters to one character
  • one character to multiple characters
  • unknown input character to one of a number of characters

Morphologically conditioned symbol change[edit]

Phonologically conditioned insertion[edit]

Morphologically conditioned insertion[edit]

See also[edit]