Difference between revisions of "Starting a new language with lttoolbox"
(Created page with 'blah ==See also== * Starting a new language with lttoolbox Category:Documentation') |
m |
||
(33 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
blah |
|||
:''For information on how to install lttoolbox, see [[lttoolbox]] and [[minimal installation from SVN]]'' |
|||
This page is going to describe how to start a new language with [[lttoolbox]]. As lttoolbox is not really suited to agglutinative languages, or languages with complex and regular morphophonology (or at least no-one has written a dictionary from scratch using lttoolbox for one of these languages yet), we're going to work on one with simpler and less regular morphology. We particularly encourage people to use lttoolbox wherever possible; it has a straightforward syntax, has some very useful features for validation and is a canonical part of Apertium, not requiring any special software to be installed. |
|||
==Preliminaries== |
|||
A morphological transducer in lttoolbox has typically one file, a <code>.dix</code> file. This defines both how morphemes in the language are joined together, ''morphotactics'', and how changes happen when these morphemes are joined together, ''morphographemics'' (or ''morphophonology''). For example, |
|||
* Morphotactics: wolf<n><pl> → wolf + s |
|||
* Morphographemics: wolf + s → wolves |
|||
These two phenomena are treated in the same file. |
|||
==The language== |
|||
The language we will be modelling is Upper Sorbian, a Slavic language spoken in Germany. There is a limited grammar available in English [http://serbscina.w.interia.pl/iso/eindex.htm here], and that is what we will be basing our analysis on. The part of speech we're going to look at for this small tutorial is nouns. Nouns in Upper Sorbian have seven cases (nominative, genitive, dative, accusative, locative, instrumental, vocative), three numbers (singular, dual, plural) and three genders (masculine, feminine, neuter). Like other Slavic languages, the category of animacy is distinguished in the masculine.<ref>This description is simplistic; the reality is more complicated, but it will do for a tutorial.</ref> |
|||
===Paradigms=== |
|||
Here we give four example paradigms; these will form the basis of our implementation. |
|||
;Masculine animate (''nan'' "father") |
|||
{|class=wikitable |
|||
! !! Singular !! Dual !! Plural |
|||
|- |
|||
| Nominative || nan || nan'''aj''' || nan'''ojo''' |
|||
|- |
|||
| Genitive || nan'''a''' || nan'''ow''' || nan'''ow''' |
|||
|- |
|||
| Dative || nan'''ej''' || nan'''omaj''' || nan'''am''' |
|||
|- |
|||
| Accusative || nan'''a''' || nan'''ow''' || nan'''ow''' |
|||
|- |
|||
| Instrumental || nan'''om''' || nan'''omaj''' || nan'''ami''' |
|||
|- |
|||
| Locative || nan'''je''' || nan'''omaj''' || nan'''ach''' |
|||
|- |
|||
| Vocative || nan'''o'''! || nan'''aj'''! || nan'''ojo'''! |
|||
|- |
|||
|} |
|||
;Masculine inanimate (''hrěch'' "sin") |
|||
The differences from the masculine animate paradigm are indicated in blue. |
|||
{|class=wikitable |
|||
! !! Singular !! Dual !! Plural |
|||
|- |
|||
| Nominative || hrěch || hrěch'''aj''' || <span style="background-color:#cceeff">hrěch'''i'''</span> |
|||
|- |
|||
| Genitive || hrěch'''a''' || hrěch'''ow''' || hrěch'''ow''' |
|||
|- |
|||
| Dative || hrěch'''ej''' || hrěch'''omaj''' || hrěch'''am''' |
|||
|- |
|||
| Accusative || <span style="background-color:#cceeff">hrěch</span> || <span style="background-color:#cceeff">hrěch'''aj'''</span> || <span style="background-color:#cceeff">hrěch'''i'''</span> |
|||
|- |
|||
| Instrumental || hrěch'''om''' || hrěch'''omaj''' || hrěch'''ami''' |
|||
|- |
|||
| Locative || <span style="background-color:#cceeff">hrěch'''u'''</span> || hrěch'''omaj''' || hrěch'''ach''' |
|||
|- |
|||
| Vocative || hrěch'''o'''! || hrěch'''aj'''! || <span style="background-color:#cceeff">hrěch'''i'''</span>! |
|||
|- |
|||
|} |
|||
;Feminine (''wróna'' "crow") |
|||
The parts in common with the masculine paradigms are highlighted in green. |
|||
{|class=wikitable |
|||
! !! Singular !! Dual !! Plural |
|||
|- |
|||
| Nominative || wrón'''a''' || wrón'''je''' || wrón'''y''' |
|||
|- |
|||
| Genitive || wrón'''u''' || <span style="background-color:#ccffcc">wrón'''ow'''</span> || <span style="background-color:#ccffcc">wrón'''ow'''</span> |
|||
|- |
|||
| Dative || wrón'''je''' || <span style="background-color:#ccffcc">wrón'''omaj'''</span> || <span style="background-color:#ccffcc">wrón'''am'''</span> |
|||
|- |
|||
| Accusative || wrón'''u''' || wrón'''je''' || wrón'''y''' |
|||
|- |
|||
| Instrumental || wrón'''u''' || <span style="background-color:#ccffcc">wrón'''omaj'''</span> || <span style="background-color:#ccffcc">wrón'''ami'''</span> |
|||
|- |
|||
| Locative || wrón'''je''' || <span style="background-color:#ccffcc">wrón'''omaj'''</span> || <span style="background-color:#ccffcc">wrón'''ach'''</span> |
|||
|- |
|||
| Vocative || wrón'''a'''! || wrón'''je'''! || wrón'''u'''! |
|||
|- |
|||
|} |
|||
;Neuter (''trašidło'' "monster") |
|||
Forms in common with both the masculine and feminine paradigms are highlighted in red. |
|||
{|class=wikitable |
|||
! !! Singular !! Dual !! Plural |
|||
|- |
|||
| Nominative || trašidł'''o''' || trašidł'''e''' || trašidł'''a''' |
|||
|- |
|||
| Genitive || trašidł'''a''' || <span style="background-color:#ffcccc">trašidł'''ow'''</span> || <span style="background-color:#ffcccc">trašidł'''ow'''</span> |
|||
|- |
|||
| Dative || trašidł'''u''' || <span style="background-color:#ffcccc">trašidł'''omaj'''</span> || <span style="background-color:#ffcccc">trašidł'''am'''</span> |
|||
|- |
|||
| Accusative || trašidł'''o''' || trašidł'''e''' || trašidł'''a''' |
|||
|- |
|||
| Instrumental || trašidł'''om''' || <span style="background-color:#ffcccc">trašidł'''omaj'''</span> || <span style="background-color:#ffcccc">trašidł'''ami'''</span> |
|||
|- |
|||
| Locative || trašidł'''e''' || <span style="background-color:#ffcccc">trašidł'''omaj'''</span> || <span style="background-color:#ffcccc">trašidł'''ach'''</span> |
|||
|- |
|||
| Vocative || trašidł'''o'''! || trašidł'''e'''! || trašidł'''a'''! |
|||
|- |
|||
|} |
|||
==Lexicon== |
|||
Given the description above, how do we start to write a morphological description in [[lttoolbox]]? Well, first we start with our filename, <code>hsb.dix</code>, so open up a text editor and save an empty document with that name. |
|||
===The basics=== |
|||
;The skeleton |
|||
The basic skeleton of an lttoolbox dictionary looks like the following: |
|||
<pre> |
|||
<dictionary> |
|||
<alphabet>abc...</alphabet> |
|||
<sdefs> |
|||
... |
|||
</sdefs> |
|||
<pardefs> |
|||
... |
|||
</pardefs> |
|||
<section id="main" type="standard"> |
|||
... |
|||
</section> |
|||
</dictionary> |
|||
</pre> |
|||
So type this up into the file, and this gives the outline of our the main parts of our morphology: the alphabet (used for tokenisation); the symbols (or ''tags''), which give us useful mnemonics for grammatical features; the {{tag|pardefs}} section, which gives our inflectional paradigms; and finally the main section of the file, which contains our lexical items. |
|||
;Symbol (tag) definitions |
|||
The first thing we'll start with is the list of symbols which are going to encode our grammatical features (part-of-speech, gender, number, case). The page [[list of symbols]] gives some common tags in Apertium. Generally we try and keep features which are named the same thing among languages tagged the same; thus, for example, the tag for "nominative" will be {{tag|nom}}, regardless of if we are talking about Romanian, Serbo-Croatian, Icelandic or Albanian. Symbols are defined in the {{tag|sdefs}} section with {{tag|sdef}} elements. |
|||
<pre> |
|||
<sdefs> |
|||
<sdef n="n" c="Noun"/> |
|||
<sdef n="ma" c="Masculine (animate)"/> |
|||
<sdef n="mi" c="Masculine (inanimate)"/> |
|||
<sdef n="nt" c="Neuter"/> |
|||
<sdef n="f" c="Feminine"/> |
|||
<sdef n="sg" c="Singular"/> |
|||
<sdef n="du" c="Dual"/> |
|||
<sdef n="pl" c="Plural"/> |
|||
<sdef n="nom" c="Nominative"/> |
|||
<sdef n="gen" c="Genitive"/> |
|||
<sdef n="dat" c="Dative"/> |
|||
<sdef n="acc" c="Accusative"/> |
|||
<sdef n="ins" c="Instrumental"/> |
|||
<sdef n="loc" c="Locative"/> |
|||
<sdef n="voc" c="Vocative"/> |
|||
</sdefs> |
|||
</pre> |
|||
The <code>c</code> after each symbol definition stands for comment and is optional but quite convenient if you have a lot of tags and want a quick reference to what they mean. |
|||
;Our first paradigm |
|||
After we've defined our symbols, then the next thing to do is to write our first paradigm. We'll start with the paradigm for ''nan'' "father". There is a convention in Apertium that each major paradigm identifier is made up of at least the name of an exemplar word and its part of speech. In this case we will also add the gender. |
|||
A paradigm is made up of a series of entries. Each entry has a ''pair'' ({{tag|p}}), which in turn has a ''left'' side ({{tag|l}}) and a ''right'' side ({{tag|r}}). Normally, the [[surface form]] is found on the left and the [[lexical form]] on the right. |
|||
We can use the symbols we defined earlier with {{tag|sdef}} tags by calling them with the {{tag|s}} element. |
|||
<pre> |
|||
<pardefs> |
|||
<pardef n="nan__n_ma"> |
|||
<e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="nom"/></r></p></e> |
|||
<e><p><l>a</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="gen"/></r></p></e> |
|||
<e><p><l>ej</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="dat"/></r></p></e> |
|||
<e><p><l>a</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="acc"/></r></p></e> |
|||
<e><p><l>om</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="ins"/></r></p></e> |
|||
<e><p><l>je</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="loc"/></r></p></e> |
|||
<e><p><l>o</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="voc"/></r></p></e> |
|||
<e><p><l>aj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="nom"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="gen"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="dat"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="acc"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="ins"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="loc"/></r></p></e> |
|||
<e><p><l>aj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="voc"/></r></p></e> |
|||
<e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="nom"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="gen"/></r></p></e> |
|||
<e><p><l>am</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="dat"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="acc"/></r></p></e> |
|||
<e><p><l>ami</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="ins"/></r></p></e> |
|||
<e><p><l>ach</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="loc"/></r></p></e> |
|||
<e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="voc"/></r></p></e> |
|||
</pardef> |
|||
</pardefs> |
|||
</pre> |
|||
;Using the paradigm |
|||
Now that we've defined a paradigm, we can add a word that uses it. The obvious choice is "nan", being as that is the name of the paradigm. |
|||
<pre> |
|||
<section id="main" type="standard"> |
|||
<e lm="nan"><i>nan</i><par n="nan__n_ma"/></e> |
|||
</section> |
|||
</pre> |
|||
The {{tag|e}} element is the same as in the paradigm, but in the case of lexical entries (as opposed to morphological entries), it commonly contains the attribute <code>lm</code> "lemma". The {{tag|i}} tag stands for "invariant" and means that the left side is the same as the right side. |
|||
So by this point we should have a whole dictionary with a single word in it. Save the file. |
|||
===Compiling=== |
|||
Once you've saved the file, you can go to the command line and try to validate it. Presuming that the file is called <code>hsb.dix</code>, then the following will check it against the definition: |
|||
<pre> |
|||
$ apertium-validate-dictionary hsb.dix |
|||
</pre> |
|||
If the dictionary is valid, you should get no output. |
|||
This is a major benefit over related software (e.g. [[HFST]]). If you leave out a symbol definition, then you will get an angry message from the validation script, such as the following: |
|||
<pre> |
|||
$ apertium-validate-dictionary hsb.dix |
|||
hsb.dix:25: element s: validity error : IDREF attribute n references an unknown ID "nom" |
|||
hsb.dix:33: element s: validity error : IDREF attribute n references an unknown ID "nom" |
|||
hsb.dix:41: element s: validity error : IDREF attribute n references an unknown ID "nom" |
|||
Document hsb.dix does not validate against /home/fran/local/share/apertium/dix.dtd |
|||
</pre> |
|||
In this case, it's best to go back and check that all your symbols are defined. |
|||
Assuming that our dictionary is valid, we can move to the next step and compile it. |
|||
<pre> |
|||
$ lt-comp lr hsb.dix hsb-mor.bin |
|||
main@standard 29 45 |
|||
$ lt-comp rl hsb.dix hsb-gen.bin |
|||
main@standard 29 45 |
|||
</pre> |
|||
The <code>lr</code> and <code>rl</code> in the compilation command stand for "left to right" and "right to left", respectively. Presuming that we have our surface form on the left and our lexical form on the right, compiling <code>lr</code> will make a morphological ''analyser'', and compiling <code>rl</code> will make a ''generator''. |
|||
===Usage=== |
|||
{{see-also|lttoolbox}} |
|||
We can then test them both as follows: |
|||
<pre> |
|||
$ echo "nanow" | lt-proc hsb-mor.bin |
|||
^nanow/nan<n><ma><du><gen>/nan<n><ma><du><acc>/nan<n><ma><pl><gen>/nan<n><ma><pl><acc>$ |
|||
$ echo "^nan<n><ma><pl><gen>$" | lt-proc -g hsb-gen.bin |
|||
nanow |
|||
</pre> |
|||
To get a full listing of the dictionary, the command <code>lt-expand</code> can be used: |
|||
<pre> |
|||
$ lt-expand hsb.dix |
|||
nan:nan<n><ma><sg><nom> |
|||
nana:nan<n><ma><sg><gen> |
|||
nanej:nan<n><ma><sg><dat> |
|||
nana:nan<n><ma><sg><acc> |
|||
nanom:nan<n><ma><sg><ins> |
|||
nanje:nan<n><ma><sg><loc> |
|||
nano:nan<n><ma><sg><voc> |
|||
nanaj:nan<n><ma><du><nom> |
|||
nanow:nan<n><ma><du><gen> |
|||
... |
|||
</pre> |
|||
We've got everything in place for building the dictionary. Now on to our next word. |
|||
==Organising paradigms== |
|||
The obvious thing to do when adding the word ''hrěch'' "sin" would be to duplicate the <code>nan__n_ma</code> paradigm but change the gender and the surface forms, which are different. Then we would end up with a new paradigm, something like: |
|||
<pre> |
|||
<pardef n="hrěch__n_mi"> |
|||
<e><p><l></l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="nom"/></r></p></e> |
|||
<e><p><l>a</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="gen"/></r></p></e> |
|||
<e><p><l>ej</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="dat"/></r></p></e> |
|||
<e><p><l></l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="acc"/></r></p></e> |
|||
<e><p><l>om</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="ins"/></r></p></e> |
|||
<e><p><l>u</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="loc"/></r></p></e> |
|||
<e><p><l>o</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="voc"/></r></p></e> |
|||
<e><p><l>aj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="nom"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="gen"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="dat"/></r></p></e> |
|||
<e><p><l>oj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="acc"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="ins"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="loc"/></r></p></e> |
|||
<e><p><l>aj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="voc"/></r></p></e> |
|||
<e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="nom"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="gen"/></r></p></e> |
|||
<e><p><l>am</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="dat"/></r></p></e> |
|||
<e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="acc"/></r></p></e> |
|||
<e><p><l>ami</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="ins"/></r></p></e> |
|||
<e><p><l>ach</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="loc"/></r></p></e> |
|||
<e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="voc"/></r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
We add an entry in the main section: |
|||
<pre> |
|||
<e lm="hrěch"><i>hrěch</i><par n="hrěch__n_mi"/></e> |
|||
</pre> |
|||
All is fine, and it's a good place to start, but if we look at the tables above, the paradigm for ''nan'' and the paradigm for ''hrěch'' share many suffixes. We can call paradigms from other paradigms, so why should we duplicate them? |
|||
As an alternative, the first thing we do is to split out the common suffixes into a separate paradigm. Let's call it <code>common__m</code> (for common masculine suffixes). |
|||
<pre> |
|||
<pardef n="common__m"> |
|||
<e><p><l></l><r><s n="sg"/><s n="nom"/></r></p></e> |
|||
<e><p><l>a</l><r><s n="sg"/><s n="gen"/></r></p></e> |
|||
<e><p><l>ej</l><r><s n="sg"/><s n="dat"/></r></p></e> |
|||
<e><p><l>om</l><r><s n="sg"/><s n="ins"/></r></p></e> |
|||
<e><p><l>o</l><r><s n="sg"/><s n="voc"/></r></p></e> |
|||
<e><p><l>aj</l><r><s n="du"/><s n="nom"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="du"/><s n="gen"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="du"/><s n="dat"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="du"/><s n="ins"/></r></p></e> |
|||
<e><p><l>omaj</l><r><s n="du"/><s n="loc"/></r></p></e> |
|||
<e><p><l>aj</l><r><s n="du"/><s n="voc"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="pl"/><s n="gen"/></r></p></e> |
|||
<e><p><l>am</l><r><s n="pl"/><s n="dat"/></r></p></e> |
|||
<e><p><l>ami</l><r><s n="pl"/><s n="ins"/></r></p></e> |
|||
<e><p><l>ach</l><r><s n="pl"/><s n="loc"/></r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
(Note: We don't include the part of speech or gender, as that is different depending on the lemma.) |
|||
Now with this "common" paradigm available, we can simplify both the <code>nan__n_ma</code> and <code>hrěch__n_mi</code> paradigms, thusly: |
|||
<pre> |
|||
<pardef n="nan__n_ma"> |
|||
<e><p><l></l><r><s n="n"/><s n="ma"/></r></p><par n="common__m"/></e> |
|||
<e><p><l>a</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="acc"/></r></p></e> |
|||
<e><p><l>je</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="loc"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="acc"/></r></p></e> |
|||
<e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="nom"/></r></p></e> |
|||
<e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="acc"/></r></p></e> |
|||
<e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="voc"/></r></p></e> |
|||
</pardef> |
|||
<pardef n="hrěch__n_mi"> |
|||
<e><p><l></l><r><s n="n"/><s n="mi"/></r></p><par n="common__m"/></e> |
|||
<e><p><l></l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="acc"/></r></p></e> |
|||
<e><p><l>u</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="loc"/></r></p></e> |
|||
<e><p><l>oj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="acc"/></r></p></e> |
|||
<e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="nom"/></r></p></e> |
|||
<e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="acc"/></r></p></e> |
|||
<e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="voc"/></r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
Factoring out common suffixes makes paradigms more maintainable but also more complicated to understand. The features of the language, the depth of the description and the intuitions of the person writing the dictionary will dictate to what extent parts can be factored out in this way. |
|||
Now try and add the other two words to the dictionary, along with their inflectional paradigms. A solution can be found on the [[Talk:Starting a new language with lttoolbox|talk page]]. |
|||
You can also try adding the alternative forms (for example ''hrěchu'' as a possible genitive singular of ''hrěch''). |
|||
==Notes== |
|||
<references/> |
|||
==Further reading== |
|||
==See also== |
==See also== |
||
* [[Monodix basics]] |
|||
* [[Starting a new language with lttoolbox]] |
|||
* [[Starting a new language with HFST]] |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
[[Category:Documentation in English]] |
|||
[[Category:Quickstart]] |
Latest revision as of 12:10, 26 September 2016
- For information on how to install lttoolbox, see lttoolbox and minimal installation from SVN
This page is going to describe how to start a new language with lttoolbox. As lttoolbox is not really suited to agglutinative languages, or languages with complex and regular morphophonology (or at least no-one has written a dictionary from scratch using lttoolbox for one of these languages yet), we're going to work on one with simpler and less regular morphology. We particularly encourage people to use lttoolbox wherever possible; it has a straightforward syntax, has some very useful features for validation and is a canonical part of Apertium, not requiring any special software to be installed.
Preliminaries[edit]
A morphological transducer in lttoolbox has typically one file, a .dix
file. This defines both how morphemes in the language are joined together, morphotactics, and how changes happen when these morphemes are joined together, morphographemics (or morphophonology). For example,
- Morphotactics: wolf<n><pl> → wolf + s
- Morphographemics: wolf + s → wolves
These two phenomena are treated in the same file.
The language[edit]
The language we will be modelling is Upper Sorbian, a Slavic language spoken in Germany. There is a limited grammar available in English here, and that is what we will be basing our analysis on. The part of speech we're going to look at for this small tutorial is nouns. Nouns in Upper Sorbian have seven cases (nominative, genitive, dative, accusative, locative, instrumental, vocative), three numbers (singular, dual, plural) and three genders (masculine, feminine, neuter). Like other Slavic languages, the category of animacy is distinguished in the masculine.[1]
Paradigms[edit]
Here we give four example paradigms; these will form the basis of our implementation.
- Masculine animate (nan "father")
Singular | Dual | Plural | |
---|---|---|---|
Nominative | nan | nanaj | nanojo |
Genitive | nana | nanow | nanow |
Dative | nanej | nanomaj | nanam |
Accusative | nana | nanow | nanow |
Instrumental | nanom | nanomaj | nanami |
Locative | nanje | nanomaj | nanach |
Vocative | nano! | nanaj! | nanojo! |
- Masculine inanimate (hrěch "sin")
The differences from the masculine animate paradigm are indicated in blue.
Singular | Dual | Plural | |
---|---|---|---|
Nominative | hrěch | hrěchaj | hrěchi |
Genitive | hrěcha | hrěchow | hrěchow |
Dative | hrěchej | hrěchomaj | hrěcham |
Accusative | hrěch | hrěchaj | hrěchi |
Instrumental | hrěchom | hrěchomaj | hrěchami |
Locative | hrěchu | hrěchomaj | hrěchach |
Vocative | hrěcho! | hrěchaj! | hrěchi! |
- Feminine (wróna "crow")
The parts in common with the masculine paradigms are highlighted in green.
Singular | Dual | Plural | |
---|---|---|---|
Nominative | wróna | wrónje | wróny |
Genitive | wrónu | wrónow | wrónow |
Dative | wrónje | wrónomaj | wrónam |
Accusative | wrónu | wrónje | wróny |
Instrumental | wrónu | wrónomaj | wrónami |
Locative | wrónje | wrónomaj | wrónach |
Vocative | wróna! | wrónje! | wrónu! |
- Neuter (trašidło "monster")
Forms in common with both the masculine and feminine paradigms are highlighted in red.
Singular | Dual | Plural | |
---|---|---|---|
Nominative | trašidło | trašidłe | trašidła |
Genitive | trašidła | trašidłow | trašidłow |
Dative | trašidłu | trašidłomaj | trašidłam |
Accusative | trašidło | trašidłe | trašidła |
Instrumental | trašidłom | trašidłomaj | trašidłami |
Locative | trašidłe | trašidłomaj | trašidłach |
Vocative | trašidło! | trašidłe! | trašidła! |
Lexicon[edit]
Given the description above, how do we start to write a morphological description in lttoolbox? Well, first we start with our filename, hsb.dix
, so open up a text editor and save an empty document with that name.
The basics[edit]
- The skeleton
The basic skeleton of an lttoolbox dictionary looks like the following:
<dictionary> <alphabet>abc...</alphabet> <sdefs> ... </sdefs> <pardefs> ... </pardefs> <section id="main" type="standard"> ... </section> </dictionary>
So type this up into the file, and this gives the outline of our the main parts of our morphology: the alphabet (used for tokenisation); the symbols (or tags), which give us useful mnemonics for grammatical features; the <pardefs>
section, which gives our inflectional paradigms; and finally the main section of the file, which contains our lexical items.
- Symbol (tag) definitions
The first thing we'll start with is the list of symbols which are going to encode our grammatical features (part-of-speech, gender, number, case). The page list of symbols gives some common tags in Apertium. Generally we try and keep features which are named the same thing among languages tagged the same; thus, for example, the tag for "nominative" will be <nom>
, regardless of if we are talking about Romanian, Serbo-Croatian, Icelandic or Albanian. Symbols are defined in the <sdefs>
section with <sdef>
elements.
<sdefs> <sdef n="n" c="Noun"/> <sdef n="ma" c="Masculine (animate)"/> <sdef n="mi" c="Masculine (inanimate)"/> <sdef n="nt" c="Neuter"/> <sdef n="f" c="Feminine"/> <sdef n="sg" c="Singular"/> <sdef n="du" c="Dual"/> <sdef n="pl" c="Plural"/> <sdef n="nom" c="Nominative"/> <sdef n="gen" c="Genitive"/> <sdef n="dat" c="Dative"/> <sdef n="acc" c="Accusative"/> <sdef n="ins" c="Instrumental"/> <sdef n="loc" c="Locative"/> <sdef n="voc" c="Vocative"/> </sdefs>
The c
after each symbol definition stands for comment and is optional but quite convenient if you have a lot of tags and want a quick reference to what they mean.
- Our first paradigm
After we've defined our symbols, then the next thing to do is to write our first paradigm. We'll start with the paradigm for nan "father". There is a convention in Apertium that each major paradigm identifier is made up of at least the name of an exemplar word and its part of speech. In this case we will also add the gender.
A paradigm is made up of a series of entries. Each entry has a pair (<p>
), which in turn has a left side (<l>
) and a right side (<r>
). Normally, the surface form is found on the left and the lexical form on the right.
We can use the symbols we defined earlier with <sdef>
tags by calling them with the <s>
element.
<pardefs> <pardef n="nan__n_ma"> <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="nom"/></r></p></e> <e><p><l>a</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="gen"/></r></p></e> <e><p><l>ej</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="dat"/></r></p></e> <e><p><l>a</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="acc"/></r></p></e> <e><p><l>om</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="ins"/></r></p></e> <e><p><l>je</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="loc"/></r></p></e> <e><p><l>o</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="voc"/></r></p></e> <e><p><l>aj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="nom"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="gen"/></r></p></e> <e><p><l>omaj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="dat"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="acc"/></r></p></e> <e><p><l>omaj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="ins"/></r></p></e> <e><p><l>omaj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="loc"/></r></p></e> <e><p><l>aj</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="voc"/></r></p></e> <e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="nom"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="gen"/></r></p></e> <e><p><l>am</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="dat"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="acc"/></r></p></e> <e><p><l>ami</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="ins"/></r></p></e> <e><p><l>ach</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="loc"/></r></p></e> <e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="voc"/></r></p></e> </pardef> </pardefs>
- Using the paradigm
Now that we've defined a paradigm, we can add a word that uses it. The obvious choice is "nan", being as that is the name of the paradigm.
<section id="main" type="standard"> <e lm="nan"><i>nan</i><par n="nan__n_ma"/></e> </section>
The <e>
element is the same as in the paradigm, but in the case of lexical entries (as opposed to morphological entries), it commonly contains the attribute lm
"lemma". The <i>
tag stands for "invariant" and means that the left side is the same as the right side.
So by this point we should have a whole dictionary with a single word in it. Save the file.
Compiling[edit]
Once you've saved the file, you can go to the command line and try to validate it. Presuming that the file is called hsb.dix
, then the following will check it against the definition:
$ apertium-validate-dictionary hsb.dix
If the dictionary is valid, you should get no output.
This is a major benefit over related software (e.g. HFST). If you leave out a symbol definition, then you will get an angry message from the validation script, such as the following:
$ apertium-validate-dictionary hsb.dix hsb.dix:25: element s: validity error : IDREF attribute n references an unknown ID "nom" hsb.dix:33: element s: validity error : IDREF attribute n references an unknown ID "nom" hsb.dix:41: element s: validity error : IDREF attribute n references an unknown ID "nom" Document hsb.dix does not validate against /home/fran/local/share/apertium/dix.dtd
In this case, it's best to go back and check that all your symbols are defined.
Assuming that our dictionary is valid, we can move to the next step and compile it.
$ lt-comp lr hsb.dix hsb-mor.bin main@standard 29 45 $ lt-comp rl hsb.dix hsb-gen.bin main@standard 29 45
The lr
and rl
in the compilation command stand for "left to right" and "right to left", respectively. Presuming that we have our surface form on the left and our lexical form on the right, compiling lr
will make a morphological analyser, and compiling rl
will make a generator.
Usage[edit]
- See also: lttoolbox
We can then test them both as follows:
$ echo "nanow" | lt-proc hsb-mor.bin ^nanow/nan<n><ma><du><gen>/nan<n><ma><du><acc>/nan<n><ma><pl><gen>/nan<n><ma><pl><acc>$ $ echo "^nan<n><ma><pl><gen>$" | lt-proc -g hsb-gen.bin nanow
To get a full listing of the dictionary, the command lt-expand
can be used:
$ lt-expand hsb.dix nan:nan<n><ma><sg><nom> nana:nan<n><ma><sg><gen> nanej:nan<n><ma><sg><dat> nana:nan<n><ma><sg><acc> nanom:nan<n><ma><sg><ins> nanje:nan<n><ma><sg><loc> nano:nan<n><ma><sg><voc> nanaj:nan<n><ma><du><nom> nanow:nan<n><ma><du><gen> ...
We've got everything in place for building the dictionary. Now on to our next word.
Organising paradigms[edit]
The obvious thing to do when adding the word hrěch "sin" would be to duplicate the nan__n_ma
paradigm but change the gender and the surface forms, which are different. Then we would end up with a new paradigm, something like:
<pardef n="hrěch__n_mi"> <e><p><l></l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="nom"/></r></p></e> <e><p><l>a</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="gen"/></r></p></e> <e><p><l>ej</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="dat"/></r></p></e> <e><p><l></l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="acc"/></r></p></e> <e><p><l>om</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="ins"/></r></p></e> <e><p><l>u</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="loc"/></r></p></e> <e><p><l>o</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="voc"/></r></p></e> <e><p><l>aj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="nom"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="gen"/></r></p></e> <e><p><l>omaj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="dat"/></r></p></e> <e><p><l>oj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="acc"/></r></p></e> <e><p><l>omaj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="ins"/></r></p></e> <e><p><l>omaj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="loc"/></r></p></e> <e><p><l>aj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="voc"/></r></p></e> <e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="nom"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="gen"/></r></p></e> <e><p><l>am</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="dat"/></r></p></e> <e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="acc"/></r></p></e> <e><p><l>ami</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="ins"/></r></p></e> <e><p><l>ach</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="loc"/></r></p></e> <e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="voc"/></r></p></e> </pardef>
We add an entry in the main section:
<e lm="hrěch"><i>hrěch</i><par n="hrěch__n_mi"/></e>
All is fine, and it's a good place to start, but if we look at the tables above, the paradigm for nan and the paradigm for hrěch share many suffixes. We can call paradigms from other paradigms, so why should we duplicate them?
As an alternative, the first thing we do is to split out the common suffixes into a separate paradigm. Let's call it common__m
(for common masculine suffixes).
<pardef n="common__m"> <e><p><l></l><r><s n="sg"/><s n="nom"/></r></p></e> <e><p><l>a</l><r><s n="sg"/><s n="gen"/></r></p></e> <e><p><l>ej</l><r><s n="sg"/><s n="dat"/></r></p></e> <e><p><l>om</l><r><s n="sg"/><s n="ins"/></r></p></e> <e><p><l>o</l><r><s n="sg"/><s n="voc"/></r></p></e> <e><p><l>aj</l><r><s n="du"/><s n="nom"/></r></p></e> <e><p><l>ow</l><r><s n="du"/><s n="gen"/></r></p></e> <e><p><l>omaj</l><r><s n="du"/><s n="dat"/></r></p></e> <e><p><l>omaj</l><r><s n="du"/><s n="ins"/></r></p></e> <e><p><l>omaj</l><r><s n="du"/><s n="loc"/></r></p></e> <e><p><l>aj</l><r><s n="du"/><s n="voc"/></r></p></e> <e><p><l>ow</l><r><s n="pl"/><s n="gen"/></r></p></e> <e><p><l>am</l><r><s n="pl"/><s n="dat"/></r></p></e> <e><p><l>ami</l><r><s n="pl"/><s n="ins"/></r></p></e> <e><p><l>ach</l><r><s n="pl"/><s n="loc"/></r></p></e> </pardef>
(Note: We don't include the part of speech or gender, as that is different depending on the lemma.)
Now with this "common" paradigm available, we can simplify both the nan__n_ma
and hrěch__n_mi
paradigms, thusly:
<pardef n="nan__n_ma"> <e><p><l></l><r><s n="n"/><s n="ma"/></r></p><par n="common__m"/></e> <e><p><l>a</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="acc"/></r></p></e> <e><p><l>je</l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="loc"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="du"/><s n="acc"/></r></p></e> <e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="nom"/></r></p></e> <e><p><l>ow</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="acc"/></r></p></e> <e><p><l>ojo</l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="voc"/></r></p></e> </pardef> <pardef n="hrěch__n_mi"> <e><p><l></l><r><s n="n"/><s n="mi"/></r></p><par n="common__m"/></e> <e><p><l></l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="acc"/></r></p></e> <e><p><l>u</l><r><s n="n"/><s n="mi"/><s n="sg"/><s n="loc"/></r></p></e> <e><p><l>oj</l><r><s n="n"/><s n="mi"/><s n="du"/><s n="acc"/></r></p></e> <e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="nom"/></r></p></e> <e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="acc"/></r></p></e> <e><p><l>i</l><r><s n="n"/><s n="mi"/><s n="pl"/><s n="voc"/></r></p></e> </pardef>
Factoring out common suffixes makes paradigms more maintainable but also more complicated to understand. The features of the language, the depth of the description and the intuitions of the person writing the dictionary will dictate to what extent parts can be factored out in this way.
Now try and add the other two words to the dictionary, along with their inflectional paradigms. A solution can be found on the talk page.
You can also try adding the alternative forms (for example hrěchu as a possible genitive singular of hrěch).
Notes[edit]
- ↑ This description is simplistic; the reality is more complicated, but it will do for a tutorial.