Difference between revisions of "Starting a new language with HFST"
Line 197: | Line 197: | ||
|- |
|- |
||
|} |
|} |
||
Part of the reason it looks complicated is all of the <code>%</code> symbols. If we remove them it looks far more readable: |
|||
<pre> |
|||
<n><pl>:>l{A}r # ; |
|||
</pre> |
|||
(You need to have them though) |
|||
So, we've added the first of our inflections, the plural. We need to do two things before we can test it. First we need to add <code>%{A%} to the <code>Multichar_Symbols</code> section of the file, so scroll to the top and add it, you should get something like: |
|||
<pre> |
|||
Multichar_Symbols |
|||
%<n%> ! Noun |
|||
%<nom%> ! Nominative |
|||
%<pl%> ! Plural |
|||
%{A%} ! Archivowel 'a' or 'e' |
|||
</pre> |
|||
Now save the file. The next thing we need to do is compile again: |
|||
<pre> |
|||
$ hfst-lexc tk.lexc > tk.hfst |
|||
</pre> |
|||
And then we can test: |
|||
<pre> |
|||
$ hfst-fst2strings tk.hfst |
|||
maşgala<n><pl>:maşgala>l{A}r |
|||
maşgala<n>:maşgala |
|||
esger<n><pl>:esger>l{A}r |
|||
esger<n>:esger |
|||
</pre> |
|||
Ok, so this is cool, but it also kind of sucks, these aren't real surface forms. We'll never see ''esger>l{A}r'' in any text. The surface form we're looking for is ''esgerler''. So how do we get that ? |
|||
==Enter twolc== |
|||
==Notes== |
==Notes== |
Revision as of 21:02, 31 March 2011
- For information on how to install HFST, see HFST
This page is going to describe how to start a new language with HFST. There are some great references out there to the lexc and twol formalisms, for example the FSMBook, but a lot of them deal with the proprietary Xerox implementations, not the free HFST implementation.
While the actual formalisms are more or less identical, the commands used to compile them are not necessarily the same. HFST has a much more Unix-compatible philosophy. So we're going to take advantage of this. As most Indo-European languages, and isolating languages can be dealt with fairly easily in lttoolbox, we're going to deal with a language that is not from this family, and one that has more complex morphology that isn't easily dealt with in lttoolbox.
Preliminaries
A morphological transducer in HFST has two principle files, one is a lexc
file. This defines how morphemes in the language are joined together, morphotactics. The other file can be a twol
(two-level rules) or xfst
(sequential rewrite rules) file. These describe what changes happen when these morphemes are joined together, morphographemics (or morphophonology). For example,
- Morphotactics:
wolf<n><pl>
→wolf + s
- Morphographemics:
wolf + s
→wolves
Here we're going to deal with twol
, the two-level rules. If you're interested in xfst
, there is a nice tutorial on the Foma site.
In the next sections we're going to start with the lexicon (lexc
file) then progress onto the morphographemics (twol
file).
The language
The language we're going to model today — well, start to model — is Turkmen, a Turkic language spoken in Turkmenistan. We're going to try and model the basic inflection (number, case) of the category of nouns. The basic inflection for Turkmen nouns is: Six cases, two numbers, and possessive. Suffixes can have different forms depending on if they are attached to a vowel ending stem, or a consonant ending stem.
Vowel harmony
Simplifying a lot,[1] we can say that stems in Turkmen can be one of two types, back-vowel stems, or front-vowel stems. Back-vowel stems, such as mugallym "teacher" only have back vowels, and front-vowel stems, such as kädi "pumpkin" have only front vowels. The back vowels in Turkmen are: a, y, o, and u. The front vowels are: ä, e, i, ö, and ü.
So, when adding a suffix to a stem, we need to know what vowels are in the stem in order to choose the right vowel to put in the suffix.
Number
Number in Turkmen can either be undefined (where there is no suffix) or plural, where the suffix is -lar or -ler. The first is used with back vowels, and the second with front vowels.
Case
We use a more compact representation below to show the suffixes for case. In between { and } are vowel alternations in the suffixes, and in between ( and ) are epentheses.
Case | Suffix | Usage | Example | ||
---|---|---|---|---|---|
V | C | V | C | ||
Nominative | Indicates the subject of the sentence | pagta | gazan | ||
Genitive | -n{y,i,u,ü}ň | -{y,i,u,ü}ň | Indicates possession | pagtanyň | gazanyň |
Dative | -{a,ä} , -n{a,e} | -{a,e} | Indirect object (directed action) | pagta | gazana |
Accusative | -n{y,i} | -{y,i} | Direct object | pagtany | gazany |
Inessive | -(n)d{a,e} | -d{a,e} | Time/place | pagtada | gazanda |
Instrumental | -(n)d{a,e}n | -d{a,e}n | Origin | pagtadan | gazandan |
Full paradigm
Note: This does not include the possessive.
maşgala "family" | ||
---|---|---|
Case | Singular | Plural |
Nominative | maşgala | maşgalalar |
Genitive | maşgalanyň | maşgalalaryň |
Dative | maşgala | maşgalalara |
Accusative | maşgalany | maşgalalary |
Inessive | maşgalada | maşgalalarda |
Instrumental | maşgaladan | maşgalalardan |
esger "soldier" | ||
---|---|---|
Case | Singular | Plural |
Nominative | esger | esgerler |
Genitive | esgeriň | esgerleriň |
Dative | esgere | esgerlere |
Accusative | esgeri | esgerleri |
Inessive | esgerde | esgerlerde |
Instrumental | esgerden | esgerlerden |
Lexicon
So, after going through the little description above, let's start with the lexicon. The file we're going to make is called tk.lexc
, and it will contain the lexicon of the transducer. So open up your text editor.
The basics
The first thing we need to define are the tags that we want to produce. In lttoolbox, this is done through the <sdefs>
section of the .dix
file.
Multichar_Symbols %<n%> ! Noun %<nom%> ! Nominative %<pl%> ! Plural
The symbols <
and >
are reserved in lexc
, so we need to escape them with %
We also need to define a Root
lexicon, which is going to point to a list of stems in the lexicon NounStems
. The Root
lexicon is analagous to the <section id="main" type="standard">
in lttoolbox:
LEXICON Root NounStems ;
Now let's add our two words:
LEXICON NounStems maşgala Ninfl ; ! "family" esger Ninfl ; ! "soldier"
First we put the stem, then we put the paradigm (or continuation class) that it belongs to, in this case Ninfl
, and finally, in a comment (the comment symbol is !
) we put the translation.
And define the most basic of inflection, that is, tagging the bare stem with <n>
to indicate a noun:
LEXICON Ninfl %<n%>: # ;
This LEXICON
should go before the NounStems
lexicon. The #
symbol is the end-of-word boundary. It is very important to have this, as it tells the transducer where to stop.
Compiling
So, now we've got our basic lexicon, let's compile it and test it. We compile with hfst-lexc
:
$ hfst-lexc tk.lexc > tk.hfst
And we can test it both with hfst-fst2strings
:
$ hfst-fst2strings tk.hfst maşgala<n>:maşgala esger<n>:esger
Continuation lexica
So, we've managed to describe that maşgala and esger are nouns, but what about the inflection. This is where continuation lexica come in. These are like paradigms in lttoolbox.
The basic morphotactics of the Turkmen noun is:
- stem plural? possessive? case
Where ?
denotes optionality. We're just working with number and case here, so let's describe the inflection, first we can start with number. In the section of the file LEXICON Ninfl
, add the following line:
%<n%>%<pl%>:%>l%{A%}r # ;
Phew, that looks pretty complicated!! Well, perhaps, but each part has it's reason, let's describe them:
Part | Description |
---|---|
%<n%>%<pl%> |
The part on the left side defines the analysis, in this case noun, plural. Note, this is in contrast to lttoolbox, where the analysis is usually on the right side. |
: |
The symbol : delimits the left and right sides (or surface side, and lexical side)
|
%>%>l%{A%}r |
This is the surface form, which is split into: |
%> |
The morpheme boundary delimiter (we'll talk about this later, but you put it in between morphemes where changes might happen. |
l%{A%}r |
The surface morpheme, in this case -lar or -ler |
%{A%} |
An "archivowel"... a placeholder for a vowel that can be either a or e |
# |
The end of word boundary |
; |
End of line |
Part of the reason it looks complicated is all of the %
symbols. If we remove them it looks far more readable:
<n><pl>:>l{A}r # ;
(You need to have them though)
So, we've added the first of our inflections, the plural. We need to do two things before we can test it. First we need to add %{A%} to the
Multichar_Symbols
section of the file, so scroll to the top and add it, you should get something like:
Multichar_Symbols
%<n%> ! Noun
%<nom%> ! Nominative
%<pl%> ! Plural
%{A%} ! Archivowel 'a' or 'e'
Now save the file. The next thing we need to do is compile again:
$ hfst-lexc tk.lexc > tk.hfst
And then we can test:
$ hfst-fst2strings tk.hfst
maşgala<n><pl>:maşgala>l{A}r
maşgala<n>:maşgala
esger<n><pl>:esger>l{A}r
esger<n>:esger
Ok, so this is cool, but it also kind of sucks, these aren't real surface forms. We'll never see esger>l{A}r in any text. The surface form we're looking for is esgerler. So how do we get that ?
Enter twolc
Notes
- ↑ This is actually supercomplicated, but for this didactic example, it'll do
Further reading