Starting a new language with lttoolbox

From Apertium
Jump to navigation Jump to search
For information on how to install lttoolbox, see lttoolbox and minimal installation from SVN

This page is going to describe how to start a new language with lttoolbox. As lttoolbox is not really suited to agglutinative languages, or languages with complex and regular morphophonology (see starting a new language with HFST), we're going to work on one with simpler and less regular morphology.

Preliminaries

A morphological transducer in lttoolbox has typically one file, a .dix file. This defines both how morphemes in the language are joined together, morphotactics, and how changes happen when these morphemes are joined together, morphographemics (or morphophonology). For example,

  • Morphotactics: wolf<n><pl> → wolf + s
  • Morphographemics: wolf + s → wolves

These two phenomena are treated in the same file.

The language

The language we will be modelling is Upper Sorbian, a Slavic language spoken in Germany. There is a limited grammar available in English here and that is what we will be basing our analysis on. The part of speech we're going to look at for this small tutorial is nouns. Nouns in Upper Sorbian have seven cases (nominative, genitive, dative, accusative, locative, instrumental, vocative), three numbers (singular, dual, plural) and three genders (masculine, feminine, neuter). Like other Slavic languages, the category of animacy is distinguished in the masculine.

Paradigms

Here we give four example paradigms, these will form the basis of our implementation.

Masculine animate (nan "father")
Singular Dual Plural
Nominative nan nanaj nanojo
Genitive nana nanow nanow
Dative nanej nanomaj nanam
Accusative nana nanow nanow
Instrumental nanom nanomaj nanami
Locative nanje nanomaj nanach
Vocative nano! nanaj! nanojo!
Masculine inanimate (hrěch "sin")

The differences from the masculine animate paradigm are indicated in blue.

Singular Dual Plural
Nominative hrěch hrěchaj hrěchi
Genitive hrěcha hrěchow hrěchow
Dative hrěchej hrěchomaj hrěcham
Accusative hrěch hrěchaj hrěchi
Instrumental hrěchom hrěchomaj hrěchami
Locative hrěchu hrěchomaj hrěchach
Vocative hrěcho! hrěchaj! hrěchi!
Feminine (wróna "crow")

The parts in common with the masculine paradigms are highlighted in green.

Singular Dual Plural
Nominative wróna wrónje wróny
Genitive wrónu wrónow wrónaow
Dative wrónje wrónomaj wrónaam
Accusative wrónu wrónje wróny
Instrumental wrónu wrónomaj wrónaami
Locative wrónje wrónomaj wrónaach
Vocative wróna! wrónje! wrónu!
Neuter (trašidło "monster")

Forms in common with both the masculine and feminine paradigms are highlighted in red.

Singular Dual Plural
Nominative trašidło trašidłe trašidła
Genitive trašidła trašidłow trašidłow
Dative trašidłu trašidłomaj trašidłam
Accusative trašidło trašidłe trašidła
Instrumental trašidłom trašidłomaj trašidłami
Locative trašidłe trašidłomaj trašidłach
Vocative trašidło! trašidłe! trašidła!

Lexicon

So, given the description above, how do we start to write a morphological description in lttoolbox ? Well, first we start with our filename, hsb.dix, so open up a text editor and save an empty document with that name.

The basics

The skeleton

The basic skeleton of an lttoolbox dictionary looks like the following:


<dictionary>
  <alphabet>abc...</alphabet>
  <sdefs>
    ...
  </sdefs>
  <pardefs>
    ...
  </pardefs>
  <section id="main" type="standard">
    ...
  </section>
</dictionary>

So type this up into the file, this gives the outline of our the main parts of our morphology: the alphabet (used for tokenisation) the symbols (or tags) which give us useful mnemonics for grammatical features, the <pardefs> section, which gives our inflectional paradigms, and finally the main section of the file which contains our lexical items.

Symbol (tag) definitions

The first thing we'll start with is the list of symbols which are going to encode our grammatical features (part-of-speech, gender, number, case). The page list of symbols gives some common tags in Apertium. Generally we try and keep features which are the same between languages tagged the same, thus for example the tag for "nominative" will be <nom>, regardless of if we are talking about Romanian, Serbo-Croatian, Icelandic or Albanian. Symbols are defined in the <sdefs> section with <sdef> elements,


<sdefs>
  <sdef n="n"     c="Noun"/>

  <sdef n="ma"    c="Masculine (animate)"/>
  <sdef n="mi"    c="Masculine (inanimate)"/>
  <sdef n="nt"    c="Neuter"/>
  <sdef n="f"     c="Feminine"/>

  <sdef n="sg"    c="Singular"/>
  <sdef n="du"    c="Dual"/>
  <sdef n="pl"    c="Plural"/>

  <sdef n="nom"   c="Nominative"/>
  <sdef n="gen"   c="Genitive"/>
  <sdef n="dat"   c="Dative"/>
  <sdef n="acc"   c="Accusative"/>
  <sdef n="ins"   c="Instrumental"/>
  <sdef n="loc"   c="Locative"/>
  <sdef n="voc"   c="Vocative"/>
</sdefs>

Our first paradigm!

After we've defined our symbols then the next thing to do is to write our first paradigm... We'll start with the paradigm for nan "father". There is a convention in Apertium that each major paradigm identifier is made up of at least the name of one of an exemplar word, and its part of speech. In this case we will also add the gender.

  <pardef n="nan__n_ma">
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="nom"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="gen"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="dat"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="acc"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="ins"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="loc"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="sg"/><s n="voc"/></r></p></e>

    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="du"/><s n="nom"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="du"/><s n="gen"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="du"/><s n="dat"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="du"/><s n="acc"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="du"/><s n="ins"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="du"/><s n="loc"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="du"/><s n="voc"/></r></p></e>

    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="nom"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="gen"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="dat"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="acc"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="ins"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="loc"/></r></p></e>
    <e><p><l></l><r><s n="n"/><s n="ma"/><s n="pl"/><s n="voc"/></r></p></e>

  </pardef>

Compiling

Paradigms

Analysis and generation

Troubleshooting

Notes


Further reading

See also