Difference between revisions of "Starting a new language with HFST"

From Apertium
Jump to navigation Jump to search
Line 97: Line 97:
==Lexicon==
==Lexicon==


So, after going through the little description above, let's start with the lexicon. The file we're going to make is called <code>tk.lexc</code>, and it will contain the lexicon of the transducer. So open up your text editor.
So, after going through the little


The first thing we need to define are the tags that we want to produce. In [[lttoolbox]], this is done through the <code><sdefs></code> section of the <code>.dix</code> file.

<pre>
Multichar_Symbols

<n> ! Noun
<nom> ! Nominative
</pre>

We also need to define a <code>Root</code> lexicon, which is going to point to a list of stems in the lexicon <code>NounStems</code>. The <code>Root</code> lexicon is analagous to the <code><section id="main" type="standard"></code> in [[lttoolbox]]:

<pre>

LEXICON Root

NounStems ;

</pre>

Now let's add our two words:

<pre>

LEXICON NounStems

esger Ninfl ; ! "soldier"

</pre>


==Notes==
==Notes==

Revision as of 20:34, 31 March 2011

For information on how to install HFST, see HFST

This page is going to describe how to start a new language with HFST. There are some great references out there to the lexc and twol formalisms, for example the FSMBook, but a lot of them deal with the proprietary Xerox implementations, not the free HFST implementation.

While the actual formalisms are more or less identical, the commands used to compile them are not necessarily the same. HFST has a much more Unix-compatible philosophy. So we're going to take advantage of this. As most Indo-European languages, and isolating languages can be dealt with fairly easily in lttoolbox, we're going to deal with a language that is not from this family, and one that has more complex morphology that isn't easily dealt with in lttoolbox.

Preliminaries

A morphological transducer in HFST has two principle files, one is a lexc file. This defines how morphemes in the language are joined together, morphotactics. The other file can be a twol (two-level rules) or xfst (sequential rewrite rules) file. These describe what changes happen when these morphemes are joined together, morphographemics (or morphophonology). For example,

Morphotactics: wolf<n><pl>wolf + s
Morphographemics: wolf + swolves

Here we're going to deal with twol, the two-level rules. If you're interested in xfst, there is a nice tutorial on the Foma site.

In the next sections we're going to start with the lexicon (lexc file) then progress onto the morphographemics (twol file).

The language

The language we're going to model today — well, start to model — is Turkmen, a Turkic language spoken in Turkmenistan. We're going to try and model the basic inflection (number, case) of the category of nouns. The basic inflection for Turkmen nouns is: Six cases, two numbers, and possessive. Suffixes can have different forms depending on if they are attached to a vowel ending stem, or a consonant ending stem.

Vowel harmony

Simplifying a lot,[1] we can say that stems in Turkmen can be one of two types, back-vowel stems, or front-vowel stems. Back-vowel stems, such as mugallym "teacher" only have back vowels, and front-vowel stems, such as kädi "pumpkin" have only front vowels. The back vowels in Turkmen are: a, y, o, and u. The front vowels are: ä, e, i, ö, and ü.

So, when adding a suffix to a stem, we need to know what vowels are in the stem in order to choose the right vowel to put in the suffix.

Number

Number in Turkmen can either be undefined (where there is no suffix) or plural, where the suffix is -lar or -ler. The first is used with back vowels, and the second with front vowels.

Case

We use a more compact representation below to show the suffixes for case. In between { and } are vowel alternations in the suffixes, and in between ( and ) are epentheses.

Case Suffix Usage Example
V C V C
Nominative Indicates the subject of the sentence pagta gazan
Genitive -n{y,i,u,ü}ň -{y,i,u,ü}ň Indicates possession pagtanyň gazanyň
Dative -{a,ä} , -n{a,e} -{a,e} Indirect object (directed action) pagta gazana
Accusative -n{y,i} -{y,i} Direct object pagtany gazany
Inessive -(n)d{a,e} -d{a,e} Time/place pagtada gazanda
Instrumental -(n)d{a,e}n -d{a,e}n Origin pagtadan gazandan

Full paradigm

Note: This does not include the possessive.

maşgala "family"
Case Singular Plural
Nominative maşgala maşgalalar
Genitive maşgalanyň maşgalalaryň
Dative maşgala maşgalalara
Accusative maşgalany maşgalalary
Inessive maşgalada maşgalalarda
Instrumental maşgaladan maşgalalardan
esger "soldier"
Case Singular Plural
Nominative esger esgerler
Genitive esgeriň esgerleriň
Dative esgere esgerlere
Accusative esgeri esgerleri
Inessive esgerde esgerlerde
Instrumental esgerden esgerlerden

Lexicon

So, after going through the little description above, let's start with the lexicon. The file we're going to make is called tk.lexc, and it will contain the lexicon of the transducer. So open up your text editor.

The first thing we need to define are the tags that we want to produce. In lttoolbox, this is done through the <sdefs> section of the .dix file.

Multichar_Symbols

<n>   ! Noun
<nom> ! Nominative

We also need to define a Root lexicon, which is going to point to a list of stems in the lexicon NounStems. The Root lexicon is analagous to the <section id="main" type="standard"> in lttoolbox:


LEXICON Root

NounStems ;

Now let's add our two words:


LEXICON NounStems

esger Ninfl ; ! "soldier"

Notes

  1. This is actually supercomplicated, but for this didactic example, it'll do

Further reading