Difference between revisions of "Starting a new language with HFST"

From Apertium
Jump to navigation Jump to search
(Created page with ':''For information on how to install HFST, see HFST'' This page is going to describe how to start a new language with HFST. There are some great references out there to …')
 
Line 9: Line 9:
A morphological transducer in HFST has two principle files, one is a <code>lexc</code> file. This defines how morphemes in the language are joined together, ''morphotactics''. The other file can be a <code>twol</code> (two-level rules) or <code>xfst</code> (sequential rewrite rules) file. These describe what changes happen when these morphemes are joined together, ''morphographemics'' (or ''morphophonology''). For example,
A morphological transducer in HFST has two principle files, one is a <code>lexc</code> file. This defines how morphemes in the language are joined together, ''morphotactics''. The other file can be a <code>twol</code> (two-level rules) or <code>xfst</code> (sequential rewrite rules) file. These describe what changes happen when these morphemes are joined together, ''morphographemics'' (or ''morphophonology''). For example,


:Morphotactics: <code>wolf<n><pl></code> → wolf + s
:Morphotactics: <code>wolf<n><pl></code> → <code>wolf + s</code>
:Morphographemics: wolf + s → wolves
:Morphographemics: <code>wolf + s</code><code>wolves</code>


Here we're going to deal with <code>twol</code>, the two-level rules. If you're interested in <code>xfst</code>, there is a nice [http://foma.sourceforge.net/dokuwiki/doku.php?id=wiki:morphtutorial tutorial] on the [[Foma]] site.
Here we're going to deal with <code>twol</code>, the two-level rules. If you're interested in <code>xfst</code>, there is a nice [http://foma.sourceforge.net/dokuwiki/doku.php?id=wiki:morphtutorial tutorial] on the [[Foma]] site.
Line 19: Line 19:


The language we're going to model today &mdash; well, start to model &mdash; is Turkmen, a Turkic language spoken in Turkmenistan. We're going to try and model the basic inflection of the category of nouns.
The language we're going to model today &mdash; well, start to model &mdash; is Turkmen, a Turkic language spoken in Turkmenistan. We're going to try and model the basic inflection of the category of nouns.

The basic inflection for Turkmen nouns is: Six cases, two numbers, and possessive.

{|
! Case !! Suffix !! Usage
|-
| Nominative || || Indicates the subject of the sentence
|-
| Genitive || || Indicates possession
|-
| Dative || || Indirect object (directed action)
|-
| Accusative || || Direct object
|-
| Inessive || || Time/place
|-
| Instrumental || || Origin
|-
|}



==Lexicon==
==Lexicon==






Revision as of 19:38, 31 March 2011

For information on how to install HFST, see HFST

This page is going to describe how to start a new language with HFST. There are some great references out there to the lexc and twol formalisms, for example the FSMBook, but a lot of them deal with the proprietary Xerox implementations, not the free HFST implementation.

While the actual formalisms are more or less identical, the commands used to compile them are not necessarily the same. HFST has a much more Unix-compatible philosophy. So we're going to take advantage of this. As most Indo-European languages, and isolating languages can be dealt with fairly easily in lttoolbox, we're going to deal with a language that is not from this family, and one that has more complex morphology that isn't easily dealt with in lttoolbox.

Preliminaries

A morphological transducer in HFST has two principle files, one is a lexc file. This defines how morphemes in the language are joined together, morphotactics. The other file can be a twol (two-level rules) or xfst (sequential rewrite rules) file. These describe what changes happen when these morphemes are joined together, morphographemics (or morphophonology). For example,

Morphotactics: wolf<n><pl>wolf + s
Morphographemics: wolf + swolves

Here we're going to deal with twol, the two-level rules. If you're interested in xfst, there is a nice tutorial on the Foma site.

In the next sections we're going to start with the lexicon (lexc file) then progress onto the morphographemics (twol file).

The language

The language we're going to model today — well, start to model — is Turkmen, a Turkic language spoken in Turkmenistan. We're going to try and model the basic inflection of the category of nouns.

The basic inflection for Turkmen nouns is: Six cases, two numbers, and possessive.

Case Suffix Usage
Nominative Indicates the subject of the sentence
Genitive Indicates possession
Dative Indirect object (directed action)
Accusative Direct object
Inessive Time/place
Instrumental Origin


Lexicon