Difference between revisions of "Jaunas valodas uzsakšana ar HFST"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
{{TOCD}}
Šī lapa paskaidros kā sākt jaunas valodas mācīties ar HFST. Šeit ir dažas lieliskas norādes ar lexc un twol formālismu, piemēram FSMBook, bet daudzi no viņiem nodarbojas ar patentēta Xerox realizāciju, nevis bezmaksas HFST patentēšanu.
:''Informācijai kā instalēt HFST, apskatiet [[HFST]]''
Kamēr patiesais formālisms ir vairāk vai mazāk vienāds, komandas, kuras izmanto, lai kompilētu tos nevienmēr ir vienāds. HFST ir daudz saderīgāka ar Unix filozofiju. Tātad mēs to izmantosim. Lielākā daļa indoeiropiešu valodas un izolētās valodas var tikt viegli izskatītas ar lttoolbox, mēs varēsim tikt galā ar valodu, kas nav no šīs saimes, un vienu, kas ir morfoloģiski sarežģītāka, ko ir sarežģīti aplūkot ar lttoolbox.

Šī lapa paskaidros kā sākt jaunas valodas mācīties ar [[HFST]]. Šeit ir dažas lieliskas norādes ar [[lexc]] un [[twol]] formālismu, piemēram [http://www.fsmbook.com FSMBook], bet daudzi no viņiem nodarbojas ar patentēta Xerox realizāciju, nevis bezmaksas HFST patentēšanu.

Kamēr patiesais formālisms ir vairāk vai mazāk vienāds, komandas, kuras izmanto, lai kompilētu tos nevienmēr ir vienāds. HFST ir daudz saderīgāka ar Unix filozofiju. Tātad mēs to izmantosim. Lielākā daļa indoeiropiešu valodas un izolētās valodas var tikt viegli izskatītas ar [[lttoolbox]], mēs varēsim tikt galā ar valodu, kas nav no šīs saimes, un vienu, kas ir morfoloģiski sarežģītāka, ko ir sarežģīti aplūkot ar [[lttoolbox]].

==Priekšdarbi==

Morfoloģiskais devējs HFST ir divi principiāli faili, viens ir <code>lexc</code> fails. Tas definē kā morfēmas valodā ir savienotas, ''morfotaktikas''. Otrs fails var būt <code>twol</code> (divu līmeņu noteikums) vai <code>xfst</code> (kārtas pārrakstīšanas noteikums) fails. Šie faili apraksta, kādas pārmaiņas notiks, ka šīs morfēmas savienosies kopā ''morfografemikas'' (vai ''morfonoloģija''). Piemēram,

:Morfotaktikas: <code>wolf<n><pl></code> → <code>wolf + s</code>
:Morfografemikas: <code>wolf + s</code> → <code>wolves</code>

Šeit mēs darbosimies ar <code>twol</code>, divu līmeņu noteikumu. Ja jūs esat ieinteresēti <code>xfst</code> failā, šeit ir jauka pamācība [http://foma.sourceforge.net/dokuwiki/doku.php?id=wiki:morphtutorial pamācība] [[Foma]] lapā.

Nākamajā sekcijā mēs sāksim ar leksikonu (<code>lexc</code> file) tad progresēsim morfografemētikas (<code>twol</code> failos).

Pārliecinaties, ka jums ir [[Hfst#Compiling_HFST3|HFST3]] kompilēts.

==Valoda==

Valodu, ko mēs gatavojamies modelēt šodien &mdash; ir turkmēņu valoda, Turku valodā runā Turkmenistānā. Valodu pāri ar kuriem mēs strādāsim ir turku--turkmēņu. Mēs gatavojamies modelēt un izmēģināt pamata locīšanas (skaitļi, locījumi) kategorijas lietvārdus. Pamata locīšana turkmēņu lietvārdiem ir: seši locījumi, divi skaitļu un piederība. Piedēkļiem var būt dažādas formas atkarībā no tā vai tie ir pievienoti patskaņa celmam, vai konstantam beigu celmam.

===Patskaņu saskaņa===

Vienkāršojot daudzus,<ref> šis patiesībā ir superkomplicēts, bet šim pamācošs piemērs, tas ir jādara</ref>, jo varam teikt, ka cenlms turkmēņu vārdiem var būt ar vienu no diviem tipiem, aizmugurējo patskaņu celms, vai priekšējo patskaņu celms. Aizmugurējo patskaņu celms, tādam vārdam kā ''mugallym'' "skolotājs" ir tikai aizmugurējie patskaņi, un priekšējo patskaņu celms, tādam vārdam kā ''kädi'' "ķirbis" ir tikai priekšējie patskaņi. Aizmugurējie patskaņi Turkmēņiem ir: ''a, y, o,'' un ''u''. Priekšējie patskaņi ir: '' ä, e, i, ö,'' un ''ü''.

Tātad, kad pievienojam priedēkli pie celma, mums ir jāzina ka patskaņi celmā ir secībā, lai izvēlētos pareizo patskani ko ievietot priedēklī.

===Skaitļi===

Number in Turkmen can either be undefined (where there is no suffix) or plural, where the suffix is ''-lar'' or ''-ler''. The first is used with back vowels, and the second with front vowels.

===Case===

We use a more compact representation below to show the suffixes for case. In between ''{'' and ''}'' are vowel alternations in the suffixes, and in between ''('' and '')'' are [http://en.wikipedia.org/wiki/Epenthetic epentheses].

{|class=wikitable
! Case !!colspan=2| Suffix !! Usage !!colspan=2| Example
|-
! !! V !! C !! !! V !! C
|-
| Nominative || || || Indicates the subject of the sentence || pagta || gazan
|-
| Genitive || ''-n{y,i,u,ü}ň'' || ''-{y,i,u,ü}ň'' || Indicates possession || pagta<u>nyň</u> || gaza<u>nyň</u>
|-
| Dative || ''-{a,ä} , -n{a,e}'' || ''-{a,e}'' || Indirect object (directed action) || pagta || gazan<u>a</u>
|-
| Accusative || ''-n{y,i}'' || ''-{y,i}'' || Direct object || pagta<u>ny</u> || gaza<u>ny</u>
|-
| Inessive || ''-(n)d{a,e}'' || ''-d{a,e}'' || Time/place || pagta<u>da</u> || gazan<u>da</u>
|-
| Instrumental || ''-(n)d{a,e}n'' || ''-d{a,e}n'' || Origin || pagta<u>dan</u> || gazan<u>dan</u>
|-
|}

===Full paradigm===

Note: This does not include the possessive.

{|class=wikitable
!colspan=3|''maşgala'' "family"
|-
! Case !! Singular !! Plural
|-
| '''Nominative''' || maşgala || maşgalalar
|-
| '''Genitive''' || maşgalanyň || maşgalalaryň
|-
| '''Dative''' || maşgala || maşgalalara
|-
| '''Accusative''' || maşgalany || maşgalalary
|-
| '''Inessive''' || maşgalada || maşgalalarda
|-
| '''Instrumental''' || maşgaladan || maşgalalardan
|-
|}

{|class=wikitable
!colspan=3|''esger'' "soldier"
|-
! Case !! Singular !! Plural
|-
| '''Nominative''' || esger || esgerler
|-
| '''Genitive''' || esgeriň || esgerleriň
|-
| '''Dative''' || esgere || esgerlere
|-
| '''Accusative''' || esgeri || esgerleri
|-
| '''Inessive''' || esgerde || esgerlerde
|-
| '''Instrumental''' || esgerden || esgerlerden
|-
|}

==Lexicon==

So, after going through the little description above, let's start with the lexicon. The file we're going to make is called <code>apertium-tr-tk.tk.lexc</code>, and it will contain the lexicon of the transducer. So open up your text editor.

===The basics===

The first thing we need to define are the tags that we want to produce. In [[lttoolbox]], this is done through the <code><sdefs></code> section of the <code>.dix</code> file.

<pre>
Multichar_Symbols

%<n%> ! Noun
%<nom%> ! Nominative
%<pl%> ! Plural
</pre>

The symbols <code>&lt;</code> and <code>&gt;</code> are reserved in <code>lexc</code>, so we need to escape them with <code>%</code>

We also need to define a <code>Root</code> lexicon, which is going to point to a list of stems in the lexicon <code>NounStems</code>. The <code>Root</code> lexicon is analagous to the <code><section id="main" type="standard"></code> in [[lttoolbox]]:

<pre>

LEXICON Root

NounStems ;

</pre>

Now let's add our two words:

<pre>
LEXICON NounStems

maşgala Ninfl ; ! "family"
esger Ninfl ; ! "soldier"
</pre>

First we put the stem, then we put the ''paradigm'' (or ''continuation class'') that it belongs to, in this case <code>Ninfl</code>, and finally, in a comment (the comment symbol is <code>!</code>) we put the translation.

And define the most basic of inflection, that is, tagging the bare stem with <code><n></code> to indicate a noun:

<pre>
LEXICON Ninfl

%<n%>: # ;
</pre>

This <code>LEXICON</code> should go ''before'' the <code>NounStems</code> lexicon. The <code>#</code> symbol is the end-of-word boundary. It is very important to have this, as it tells the transducer where to stop.

===Compiling===

So, now we've got our basic lexicon, let's compile it and test it. We compile with <code>hfst-lexc</code>:

<pre>
$ hfst-lexc apertium-tr-tk.tk.lexc -o tk-tr.lexc.hfst
</pre>

(If you do not have <code>hfst-lexc</code> installed, you have a problem -- probably you need to compile with <code>--enable-lexc</code>, but in the meantime you can use <code>hfst-lexc2fst</code> in place of <code>hfst-lexc</code>)

And we can test it both with <code>hfst-fst2strings</code>:

<pre>
$ hfst-fst2strings tk-tr.lexc.hfst
maşgala<n>:maşgala
esger<n>:esger
</pre>

===Continuation lexica===

So, we've managed to describe that ''maşgala'' and ''esger'' are nouns, but what about the inflection. This is where ''continuation lexica'' come in. These are like ''paradigms'' in [[lttoolbox]].

The basic morphotactics of the Turkmen noun is:

:{{sc|stem}} {{sc|plural?}} {{sc|possessive?}} {{sc|case}} {{sc|copula?}}

Where <code>?</code> denotes optionality. We're just working with number and case here, so let's describe the inflection, first we can start with number. In the section of the file <code>LEXICON Ninfl</code>, add the following line:

<pre>
%<n%>%<pl%>:%>l%{A%}r # ;
</pre>

Phew, that looks pretty complicated!! Well, perhaps, but each part has it's reason, let's describe them:

{|class=wikitable
! Part !! Description
|-
| <code>%&lt;n%&gt;%&lt;pl%&gt;</code> || The part on the left side defines the analysis, in this case noun, plural. Note, this is in contrast to lttoolbox, where the analysis is usually on the right side.
|-
| <code>:</code> || The symbol <code>:</code> delimits the left and right sides (or surface side, and lexical side)
|-
| <code>%&gt;%&gt;l%{A%}r</code> || This is the surface form, which is split into:
|-
|&nbsp;&nbsp;&nbsp; <code>%&gt;</code> || The morpheme boundary delimiter (we'll talk about this later, but you put it in between morphemes where changes might happen.
|-
|&nbsp;&nbsp;&nbsp; <code>l%{A%}r</code> || The surface morpheme, in this case ''-lar'' or ''-ler''
|-
|&nbsp;&nbsp;&nbsp; <code>%{A%}</code> || An "archivowel"... a placeholder for a vowel that can be either ''a'' or ''e''
|-
| <code>#</code> || The end of word boundary
|-
| <code>;</code> || End of line
|-
|}

Part of the reason it looks complicated is all of the <code>%</code> symbols. If we remove them it looks far more readable:

<pre>
<n><pl>:>l{A}r # ;
</pre>

(You need to have them though)

For comparison, in lttoolbox (using · for morpheme boundary and <s n="A"/> for the {A}) for , this would look something like:

<pre>
<e><p><l>·l<s n="A"/>r</l><r><s n="n"/><s n="pl"/></e>
</pre>

So, we've added the first of our inflections, the plural. We need to do two things before we can test it. First we need to add <code>%{A%}</code> to the <code>Multichar_Symbols</code> section of the file, so scroll to the top and add it, you should get something like:

<pre>
Multichar_Symbols

%<n%> ! Noun
%<nom%> ! Nominative
%<pl%> ! Plural

%{A%} ! Archivowel 'a' or 'e'
</pre>

Now save the file. The next thing we need to do is compile again:

<pre>
$ hfst-lexc apertium-tr-tk.tk.lexc > tk-tr.lexc.hfst
</pre>

And then we can test:

<pre>
$ hfst-fst2strings tk-tr.lexc.hfst
maşgala<n><pl>:maşgala>l{A}r
maşgala<n>:maşgala
esger<n><pl>:esger>l{A}r
esger<n>:esger
</pre>

Ok, so this is cool, but it also kind of sucks, these aren't real surface forms. We'll never see ''esger>l{A}r'' in any text. The surface form we're looking for is ''esgerler''. So how do we get that ?

==Enter <code>twol</code>==

The idea of <code>twol</code> is to take the surface forms produced by lexc and apply rules to them to change them into real surface forms. So, this is where we change ''-l{A}r'' into ''-lar'' or ''-ler''.

What we basically want to say is "if the stem contains front vowels, then we want the front vowel alternation, if it contains back vowels then we want the back vowel alternation". And at the same time, remove the morpheme boundary. So let's give it a shot.

We're going to make a new file <code>apertium-tr-tk.tk.twol</code>.

First we need to define the alphabet:

<pre>
Alphabet
A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z
a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z
%{A%}:a ;
</pre>

You don't have to define the upper and lower case on separate lines, but it can help make it clearer.

We also want to define at this point, that whatever happens, we want to remove the morpheme boundaries <code>%&gt;</code> from the surface forms, so add the following line just below the last line of lower case letters, and before the <code>;</code>:

<pre>
%>:0
</pre>

Here, the left side is the morphotactic form, and the right side is the surface form. Doing <code>%&gt;:0</code> changes <code>%&gt;</code> into <code>0</code>, which is the same as deleting it. The <code>0</code> symbol is not output.

So, the final alphabet section will look like this:

<pre>
Alphabet
A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z
a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z
%{A%}:a %>:0 ;
</pre>

Next we need to define some "sets" to work with, these are basically for giving mnemonics to features, like "front vowel" and "back vowel" which we want to refer to later in the rules:

<pre>
Sets

Consonant = B Ç D F G H J Ž K L M N Ň P R S Ş T W Z
b ç d f g h j ž k l m n ň p r s ş t w z ;
Vowel = A E Ä I O Ö U Ü Y Ý
a e ä i o ö u ü y ý ;
FrontVowel = Ä E I Ö Ü ä e i ö ü ;
BackVowel = A Y O U a y o u ;
NonBack = Consonant FrontVowel %> ;
NonFront = Consonant BackVowel %> ;
</pre>

So now we've got everything set up, to add the rule, there is a new section, <code>Rules</code>:

<pre>
Rules
"Front harmony in suffixes"
%{A%}:e <=> FrontVowel: NonBack:* %>: NonBack:* _ ;

</pre>

The rule is basically saying: "Substitute {A} with e if the previous letters are anything except back vowels, then there is a morpheme boundary, then there are no back vowels, and at some point there is a front vowel"

Next up, to compile the rule and test it:

<pre>
$ hfst-twolc -R -i apertium-tr-tk.tk.twol -o tk-tr.twol.hfst
Reading input from tk.twol.
Writing output to tk.twol.hfst.
Reading alphabet.
Reading sets.
Reading rules and compiling their contexts and centers.
Compiling and storing rules.
Compiling rules.
Storing rules.
</pre>

===With the power of intersecting composition!===

In order to get the final transducer, what we need to do is combine the morphotactic model (<code>lexc</code>) with the morphographemic model (<code>twol</code>). There is a way of doing this called "intersecting composition" which is fairly efficient. There is also a tool in HFST called <code>hfst-compose-intersect</code> which is what we'll be using.

<pre>
$ hfst-compose-intersect -1 tr-tk.lexc.hfst -2 tr-tk.twol.hfst -o tr-tk.autogen.hfst
</pre>

Now we can test the final transducer:

<pre>
$ hfst-fst2strings tr-tk.autogen.hfst
maşgala<n>:maşgala
maşgala<n><pl>:maşgalalar
esger<n>:esger
esger<n><pl>:esgerler
</pre>

Great!! We have the desired forms.

==Analysis and generation==

The transducer we made above was for generation, but we can't yet use it with <code>hfst-proc</code> because of the format. If we want to use it with <code>hfst-proc</code>, all we need to do is change the format, with the following command:

<pre>
$ hfst-fst2fst -O -i tr-tk.autogen.hfst -o tr-tk.autogen.hfst.ol
</pre>

Now we should be able to generate both of our plurals:

<pre>
$ echo "^maşgala<n><pl>$" | hfst-proc -g tr-tk.autogen.hfst.ol
maşgalalar
</pre>

and

<pre>
$ echo "^esger<n><pl>$" | hfst-proc -g tr-tk.autogen.hfst.ol
esgerler
</pre>

But what if we want to analyse some words ? Well, then we need to ''invert'' the transducer. This is changing the left side to the right side, and the right side to the left side, let's do it in two stages so we can see the results:

<pre>
$ hfst-invert -i tr-tk.autogen.hfst -o tk-tr.automorf.hfst

$ hfst-fst2strings tk-mor.hfst
maşgala:maşgala<n>
maşgalalar:maşgala<n><pl>
esger:esger<n>
esgerler:esger<n><pl>
</pre>

As we can see, now the left side is the surface form, and the right side the analysis. Now just to convert the analyser to ''optimised lookup'' format:

<pre>
$ hfst-fst2fst -O -i tk-tr.automorf.hfst -o tk-tr.automorf.hfst.ol
</pre>

And do some analysis:

<pre>
$ echo "maşgalalar" | hfst-proc tk-tr.automorf.hfst.ol
^maşgalalar/maşgala<n><pl>$

$ echo "esgerler" | hfst-proc tk-tr.automorf.hfst.ol
^esgerler/esger<n><pl>$
</pre>

==Troubleshooting==

Here is a brief troubleshooting checklist for when you do something, but it isn't working:

* Are all your ''multicharacter symbols'' defined ? Including archivowels/consonants. If you think you added them, triple check. This goes for problems in <code>twol</code> as well as in <code>lexc</code>.

==Notes==
<references/>

==Further reading==

[[Category:HFST]]

Revision as of 18:50, 5 December 2011

Informācijai kā instalēt HFST, apskatiet HFST

Šī lapa paskaidros kā sākt jaunas valodas mācīties ar HFST. Šeit ir dažas lieliskas norādes ar lexc un twol formālismu, piemēram FSMBook, bet daudzi no viņiem nodarbojas ar patentēta Xerox realizāciju, nevis bezmaksas HFST patentēšanu.

Kamēr patiesais formālisms ir vairāk vai mazāk vienāds, komandas, kuras izmanto, lai kompilētu tos nevienmēr ir vienāds. HFST ir daudz saderīgāka ar Unix filozofiju. Tātad mēs to izmantosim. Lielākā daļa indoeiropiešu valodas un izolētās valodas var tikt viegli izskatītas ar lttoolbox, mēs varēsim tikt galā ar valodu, kas nav no šīs saimes, un vienu, kas ir morfoloģiski sarežģītāka, ko ir sarežģīti aplūkot ar lttoolbox.

Priekšdarbi

Morfoloģiskais devējs HFST ir divi principiāli faili, viens ir lexc fails. Tas definē kā morfēmas valodā ir savienotas, morfotaktikas. Otrs fails var būt twol (divu līmeņu noteikums) vai xfst (kārtas pārrakstīšanas noteikums) fails. Šie faili apraksta, kādas pārmaiņas notiks, ka šīs morfēmas savienosies kopā morfografemikas (vai morfonoloģija). Piemēram,

Morfotaktikas: wolf<n><pl>wolf + s
Morfografemikas: wolf + swolves

Šeit mēs darbosimies ar twol, divu līmeņu noteikumu. Ja jūs esat ieinteresēti xfst failā, šeit ir jauka pamācība pamācība Foma lapā.

Nākamajā sekcijā mēs sāksim ar leksikonu (lexc file) tad progresēsim morfografemētikas (twol failos).

Pārliecinaties, ka jums ir HFST3 kompilēts.

Valoda

Valodu, ko mēs gatavojamies modelēt šodien — ir turkmēņu valoda, Turku valodā runā Turkmenistānā. Valodu pāri ar kuriem mēs strādāsim ir turku--turkmēņu. Mēs gatavojamies modelēt un izmēģināt pamata locīšanas (skaitļi, locījumi) kategorijas lietvārdus. Pamata locīšana turkmēņu lietvārdiem ir: seši locījumi, divi skaitļu un piederība. Piedēkļiem var būt dažādas formas atkarībā no tā vai tie ir pievienoti patskaņa celmam, vai konstantam beigu celmam.

Patskaņu saskaņa

Vienkāršojot daudzus,[1], jo varam teikt, ka cenlms turkmēņu vārdiem var būt ar vienu no diviem tipiem, aizmugurējo patskaņu celms, vai priekšējo patskaņu celms. Aizmugurējo patskaņu celms, tādam vārdam kā mugallym "skolotājs" ir tikai aizmugurējie patskaņi, un priekšējo patskaņu celms, tādam vārdam kā kädi "ķirbis" ir tikai priekšējie patskaņi. Aizmugurējie patskaņi Turkmēņiem ir: a, y, o, un u. Priekšējie patskaņi ir: ä, e, i, ö, un ü.

Tātad, kad pievienojam priedēkli pie celma, mums ir jāzina ka patskaņi celmā ir secībā, lai izvēlētos pareizo patskani ko ievietot priedēklī.

Skaitļi

Number in Turkmen can either be undefined (where there is no suffix) or plural, where the suffix is -lar or -ler. The first is used with back vowels, and the second with front vowels.

Case

We use a more compact representation below to show the suffixes for case. In between { and } are vowel alternations in the suffixes, and in between ( and ) are epentheses.

Case Suffix Usage Example
V C V C
Nominative Indicates the subject of the sentence pagta gazan
Genitive -n{y,i,u,ü}ň -{y,i,u,ü}ň Indicates possession pagtanyň gazanyň
Dative -{a,ä} , -n{a,e} -{a,e} Indirect object (directed action) pagta gazana
Accusative -n{y,i} -{y,i} Direct object pagtany gazany
Inessive -(n)d{a,e} -d{a,e} Time/place pagtada gazanda
Instrumental -(n)d{a,e}n -d{a,e}n Origin pagtadan gazandan

Full paradigm

Note: This does not include the possessive.

maşgala "family"
Case Singular Plural
Nominative maşgala maşgalalar
Genitive maşgalanyň maşgalalaryň
Dative maşgala maşgalalara
Accusative maşgalany maşgalalary
Inessive maşgalada maşgalalarda
Instrumental maşgaladan maşgalalardan
esger "soldier"
Case Singular Plural
Nominative esger esgerler
Genitive esgeriň esgerleriň
Dative esgere esgerlere
Accusative esgeri esgerleri
Inessive esgerde esgerlerde
Instrumental esgerden esgerlerden

Lexicon

So, after going through the little description above, let's start with the lexicon. The file we're going to make is called apertium-tr-tk.tk.lexc, and it will contain the lexicon of the transducer. So open up your text editor.

The basics

The first thing we need to define are the tags that we want to produce. In lttoolbox, this is done through the <sdefs> section of the .dix file.

Multichar_Symbols

%<n%>   ! Noun
%<nom%> ! Nominative
%<pl%>  ! Plural

The symbols < and > are reserved in lexc, so we need to escape them with %

We also need to define a Root lexicon, which is going to point to a list of stems in the lexicon NounStems. The Root lexicon is analagous to the <section id="main" type="standard"> in lttoolbox:


LEXICON Root

NounStems ;

Now let's add our two words:

LEXICON NounStems

maşgala Ninfl ; ! "family"
esger Ninfl ;   ! "soldier"

First we put the stem, then we put the paradigm (or continuation class) that it belongs to, in this case Ninfl, and finally, in a comment (the comment symbol is !) we put the translation.

And define the most basic of inflection, that is, tagging the bare stem with <n> to indicate a noun:

LEXICON Ninfl

%<n%>: # ;

This LEXICON should go before the NounStems lexicon. The # symbol is the end-of-word boundary. It is very important to have this, as it tells the transducer where to stop.

Compiling

So, now we've got our basic lexicon, let's compile it and test it. We compile with hfst-lexc:

$ hfst-lexc apertium-tr-tk.tk.lexc -o tk-tr.lexc.hfst

(If you do not have hfst-lexc installed, you have a problem -- probably you need to compile with --enable-lexc, but in the meantime you can use hfst-lexc2fst in place of hfst-lexc)

And we can test it both with hfst-fst2strings:

$ hfst-fst2strings tk-tr.lexc.hfst 
maşgala<n>:maşgala
esger<n>:esger

Continuation lexica

So, we've managed to describe that maşgala and esger are nouns, but what about the inflection. This is where continuation lexica come in. These are like paradigms in lttoolbox.

The basic morphotactics of the Turkmen noun is:

stem plural? possessive? case copula?

Where ? denotes optionality. We're just working with number and case here, so let's describe the inflection, first we can start with number. In the section of the file LEXICON Ninfl, add the following line:

%<n%>%<pl%>:%>l%{A%}r # ;

Phew, that looks pretty complicated!! Well, perhaps, but each part has it's reason, let's describe them:

Part Description
%<n%>%<pl%> The part on the left side defines the analysis, in this case noun, plural. Note, this is in contrast to lttoolbox, where the analysis is usually on the right side.
: The symbol : delimits the left and right sides (or surface side, and lexical side)
%>%>l%{A%}r This is the surface form, which is split into:
    %> The morpheme boundary delimiter (we'll talk about this later, but you put it in between morphemes where changes might happen.
    l%{A%}r The surface morpheme, in this case -lar or -ler
    %{A%} An "archivowel"... a placeholder for a vowel that can be either a or e
# The end of word boundary
; End of line

Part of the reason it looks complicated is all of the % symbols. If we remove them it looks far more readable:

<n><pl>:>l{A}r # ;

(You need to have them though)

For comparison, in lttoolbox (using · for morpheme boundary and for the {A}) for , this would look something like:

<e><p><l>·l<s n="A"/>r</l><r><s n="n"/><s n="pl"/></e>

So, we've added the first of our inflections, the plural. We need to do two things before we can test it. First we need to add %{A%} to the Multichar_Symbols section of the file, so scroll to the top and add it, you should get something like:

Multichar_Symbols

%<n%>   ! Noun
%<nom%> ! Nominative
%<pl%>  ! Plural

%{A%}   ! Archivowel 'a' or 'e'

Now save the file. The next thing we need to do is compile again:

$ hfst-lexc apertium-tr-tk.tk.lexc > tk-tr.lexc.hfst

And then we can test:

$ hfst-fst2strings tk-tr.lexc.hfst 
maşgala<n><pl>:maşgala>l{A}r
maşgala<n>:maşgala
esger<n><pl>:esger>l{A}r
esger<n>:esger

Ok, so this is cool, but it also kind of sucks, these aren't real surface forms. We'll never see esger>l{A}r in any text. The surface form we're looking for is esgerler. So how do we get that ?

Enter twol

The idea of twol is to take the surface forms produced by lexc and apply rules to them to change them into real surface forms. So, this is where we change -l{A}r into -lar or -ler.

What we basically want to say is "if the stem contains front vowels, then we want the front vowel alternation, if it contains back vowels then we want the back vowel alternation". And at the same time, remove the morpheme boundary. So let's give it a shot.

We're going to make a new file apertium-tr-tk.tk.twol.

First we need to define the alphabet:

Alphabet
 A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z
 a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z
 %{A%}:a ;

You don't have to define the upper and lower case on separate lines, but it can help make it clearer.

We also want to define at this point, that whatever happens, we want to remove the morpheme boundaries %> from the surface forms, so add the following line just below the last line of lower case letters, and before the ;:

 %>:0 

Here, the left side is the morphotactic form, and the right side is the surface form. Doing %>:0 changes %> into 0, which is the same as deleting it. The 0 symbol is not output.

So, the final alphabet section will look like this:

Alphabet
 A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z
 a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z
 %{A%}:a %>:0  ;

Next we need to define some "sets" to work with, these are basically for giving mnemonics to features, like "front vowel" and "back vowel" which we want to refer to later in the rules:

Sets

Consonant = B Ç D F G H J Ž K L M N Ň P R S Ş T W Z
            b ç d f g h j ž k l m n ň p r s ş t w z ; 
Vowel =     A E Ä I O Ö U Ü Y Ý 
            a e ä i o ö u ü y ý ;
FrontVowel = Ä E I Ö Ü ä e i ö ü ;  
BackVowel = A Y O U a y o u ;
NonBack = Consonant FrontVowel %> ;
NonFront = Consonant BackVowel %> ; 

So now we've got everything set up, to add the rule, there is a new section, Rules:

Rules
  
"Front harmony in suffixes"
%{A%}:e <=> FrontVowel: NonBack:* %>: NonBack:* _ ;

The rule is basically saying: "Substitute {A} with e if the previous letters are anything except back vowels, then there is a morpheme boundary, then there are no back vowels, and at some point there is a front vowel"

Next up, to compile the rule and test it:

$ hfst-twolc -R -i apertium-tr-tk.tk.twol -o tk-tr.twol.hfst
Reading input from tk.twol.
Writing output to tk.twol.hfst.
Reading alphabet.
Reading sets.
Reading rules and compiling their contexts and centers.
Compiling and storing rules.
Compiling rules.
Storing rules.

With the power of intersecting composition!

In order to get the final transducer, what we need to do is combine the morphotactic model (lexc) with the morphographemic model (twol). There is a way of doing this called "intersecting composition" which is fairly efficient. There is also a tool in HFST called hfst-compose-intersect which is what we'll be using.

$ hfst-compose-intersect -1 tr-tk.lexc.hfst -2 tr-tk.twol.hfst -o tr-tk.autogen.hfst

Now we can test the final transducer:

$ hfst-fst2strings tr-tk.autogen.hfst
maşgala<n>:maşgala
maşgala<n><pl>:maşgalalar
esger<n>:esger
esger<n><pl>:esgerler

Great!! We have the desired forms.

Analysis and generation

The transducer we made above was for generation, but we can't yet use it with hfst-proc because of the format. If we want to use it with hfst-proc, all we need to do is change the format, with the following command:

$ hfst-fst2fst -O -i tr-tk.autogen.hfst -o tr-tk.autogen.hfst.ol

Now we should be able to generate both of our plurals:

$ echo "^maşgala<n><pl>$" | hfst-proc -g tr-tk.autogen.hfst.ol
maşgalalar

and

$ echo "^esger<n><pl>$" | hfst-proc -g tr-tk.autogen.hfst.ol
esgerler

But what if we want to analyse some words ? Well, then we need to invert the transducer. This is changing the left side to the right side, and the right side to the left side, let's do it in two stages so we can see the results:

$ hfst-invert -i tr-tk.autogen.hfst -o tk-tr.automorf.hfst

$ hfst-fst2strings tk-mor.hfst
maşgala:maşgala<n>
maşgalalar:maşgala<n><pl>
esger:esger<n>
esgerler:esger<n><pl>

As we can see, now the left side is the surface form, and the right side the analysis. Now just to convert the analyser to optimised lookup format:

$ hfst-fst2fst -O -i tk-tr.automorf.hfst -o tk-tr.automorf.hfst.ol

And do some analysis:

$ echo "maşgalalar" | hfst-proc tk-tr.automorf.hfst.ol
^maşgalalar/maşgala<n><pl>$

$ echo "esgerler" | hfst-proc tk-tr.automorf.hfst.ol
^esgerler/esger<n><pl>$

Troubleshooting

Here is a brief troubleshooting checklist for when you do something, but it isn't working:

  • Are all your multicharacter symbols defined ? Including archivowels/consonants. If you think you added them, triple check. This goes for problems in twol as well as in lexc.

Notes

  1. šis patiesībā ir superkomplicēts, bet šim pamācošs piemērs, tas ir jādara

Further reading