Difference between revisions of "Morphotactic constraints with twol"

From Apertium
Jump to navigation Jump to search
(Created page with "This page describes how to use twol to implement morphotactic constraints. These are useful for modelling prefix and circumfix morphology. ==See also== * [[Starting a...")
 
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This page describes how to use [[twol]] to implement morphotactic constraints. These are useful for modelling prefix and circumfix morphology.
This page describes how to use [[twol]] to implement morphotactic constraints. These are useful for modelling prefix and circumfix morphology.


Note: You should be comfortable with both [[lexc]] and [[twol]] before trying this. If you need a refresher, check out the [[Starting a new language with HFST|getting started guide]]!


==Prefixes==


Let's suppose you have a language like [[Avar]], where the first part of the verb stem changes for gender, for example:

{|class=wikitable
!colspan=3| бицине "to speak"
|-
| бицуна || {{sc|neu}}-иц-{{sc|aor}} || it speaks
|-
| вицуна || {{sc|msc}}-иц-{{sc|aor}} || he speaks; I (m.) speak; you (m.) speak
|-
| йицуна || {{sc|fem}}-иц-{{sc|aor}} || she speaks; I (f.) speak; you (f.) speak
|-
| рицуна || {{sc|pl}}-иц-{{sc|aor}} || they speak
|-
|}

So, the obvious thing would be just to put all of the forms in the lexicon (the <code>.lexc</code> file), like:

<pre>

LEXICON AOR

%<aor%>:уна # ;

LEXICON Verbs

бицине%<v%>%<tv%>%<nt%>:биц AOR ;
бицине%<v%>%<tv%>%<m%>:виц AOR ;
бицине%<v%>%<tv%>%<f%>:йиц AOR ;
бицине%<v%>%<tv%>%<pl%>:риц AOR ;

</pre>

But this is inefficient! What we would like to do is have a single lexicon for the gender prefixes,

<pre>

LEXICON AOR

%<aor%>:уна # ;

LEXICON Prefixes

%<nt%>:б%> Verbs ;
%<m%>:в%> Verbs ;
%<f%>:й%> Verbs ;
%<pl%>:р%> Verbs ;

LEXICON Verbs

бицине%<v%>%<tv%>:иц AOR ; ! "говорить"

</pre>

This is great, but we have a problem... The tag for our prefix is in the wrong place!

<pre>
<nt>бицине<v><tv><aor>:б>ицуна
<m>бицине<v><tv><aor>:в>ицуна
<f>бицине<v><tv><aor>:й>ицуна
<pl>бицине<v><tv><aor>:р>ицуна
</pre>

There is no way for us in basic [[lexc]] to have long distance control over which tags appear where. You will either end up having the tags in the wrong place in the string, or overgenerating.

==Overgenerate and constrain==

So, the approach we are going to take is to use [[twol]] to do morphotactic constraints, that means that we let our basic lexc file overgenerate, and we then strip out the paths that we don't want using twol. This is kind of like we do with morphophonology, we generate all the possible forms and then use twol rules to constrain the surface possibilities.

First, some conventions, we're going to use feature tags in <code><nowiki>[]</nowiki></code> to indicate the presence on the surface side of some prefix. For example:

<pre>
LEXICON AOR

%<aor%>%<nt%>%[%+nt%]:уна # ;
%<aor%>%<m%>%[%+m%]:уна # ;
%<aor%>%<f%>%[%+f%]:уна # ;
%<aor%>%<pl%>%[%+pl%]:уна # ;

LEXICON Prefixes

%[%+б%]:б%> Verbs ;
%[%+в%]:в%> Verbs ;
%[%+й%]:й%> Verbs ;
%[%+р%]:р%> Verbs ;

LEXICON Verbs

бицине%<v%>%<tv%>:иц AOR ; ! "говорить"
</pre>

The prefixes are marked with their morpheme, e.g. <code><nowiki>[+б]</nowiki></code>, and the agreement tag is marked with <code><nowiki>[+nt]</nowiki></code>. So let's try compiling that...

<pre>
$ hfst-lexc ava.lexc | hfst-fst2strings
[+б]бицине<v><tv><aor><pl>[+pl]:б>ицуна
[+б]бицине<v><tv><aor><f>[+f]:б>ицуна
[+б]бицине<v><tv><aor><m>[+m]:б>ицуна
[+б]бицине<v><tv><aor><nt>[+nt]:б>ицуна
[+в]бицине<v><tv><aor><pl>[+pl]:в>ицуна
[+в]бицине<v><tv><aor><f>[+f]:в>ицуна
[+в]бицине<v><tv><aor><m>[+m]:в>ицуна
[+в]бицине<v><tv><aor><nt>[+nt]:в>ицуна
[+й]бицине<v><tv><aor><pl>[+pl]:й>ицуна
[+й]бицине<v><tv><aor><f>[+f]:й>ицуна
[+й]бицине<v><tv><aor><m>[+m]:й>ицуна
[+й]бицине<v><tv><aor><nt>[+nt]:й>ицуна
[+р]бицине<v><tv><aor><pl>[+pl]:р>ицуна
[+р]бицине<v><tv><aor><f>[+f]:р>ицуна
[+р]бицине<v><tv><aor><m>[+m]:р>ицуна
[+р]бицине<v><tv><aor><nt>[+nt]:р>ицуна
</pre>

As you can see, we massively overgenerate. So the next thing to do is write a constraint:

<pre>
Alphabet

%[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ;

Rules

"Match gender prefixes with agreement tags"
Sx:0 /<= _ ;
except
_ (:*) Sy:0 ;
where Sx in ( %[%+б%] %[%+в%] %[%+й%] %[%+р%] )
Sy in ( %[%+nt%] %[%+m%] %[%+f%] %[%+pl%] )
matched ;

</pre>

This basically says that always remove paths with any of the gender prefixes, '''except''' if you find a matching agreement tag to the right.

<pre>
$ hfst-invert ava.lexc.hfst | hfst-compose-intersect -1 - -2 ava.twoc.hfst | hfst-invert | hfst-fst2strings
бицине<v><tv><aor><nt>:б>ицуна
бицине<v><tv><aor><m>:в>ицуна
бицине<v><tv><aor><f>:й>ицуна
бицине<v><tv><aor><pl>:р>ицуна
</pre>

==Troubleshooting==

; Compiling takes ages

Check the size of the compiled <code>twoc</code> file. Have you kept the symbols in the <code>Alphabet</code> to a minimum? e.g. you shouldn't have <code>&lt;...&gt;</code> tags in the <code>Alphabet</code>, just <code>[...]</code> tags.


==See also==
==See also==


* [[Starting a new language with HFST]]
* [[Starting a new language with HFST]]
* [[Replacement for flag diacritics]]


[[Category:HFST]]
[[Category:HFST]]

Latest revision as of 09:10, 13 February 2017

This page describes how to use twol to implement morphotactic constraints. These are useful for modelling prefix and circumfix morphology.

Note: You should be comfortable with both lexc and twol before trying this. If you need a refresher, check out the getting started guide!

Prefixes[edit]

Let's suppose you have a language like Avar, where the first part of the verb stem changes for gender, for example:

бицине "to speak"
бицуна neu-иц-aor it speaks
вицуна msc-иц-aor he speaks; I (m.) speak; you (m.) speak
йицуна fem-иц-aor she speaks; I (f.) speak; you (f.) speak
рицуна pl-иц-aor they speak

So, the obvious thing would be just to put all of the forms in the lexicon (the .lexc file), like:


LEXICON AOR

%<aor%>:уна # ;

LEXICON Verbs

бицине%<v%>%<tv%>%<nt%>:биц AOR ;
бицине%<v%>%<tv%>%<m%>:виц AOR ;
бицине%<v%>%<tv%>%<f%>:йиц AOR ;
бицине%<v%>%<tv%>%<pl%>:риц AOR ;

But this is inefficient! What we would like to do is have a single lexicon for the gender prefixes,


LEXICON AOR

%<aor%>:уна # ;

LEXICON Prefixes

%<nt%>:б%> Verbs ;
%<m%>:в%> Verbs ;
%<f%>:й%> Verbs ;
%<pl%>:р%> Verbs ;

LEXICON Verbs

бицине%<v%>%<tv%>:иц AOR ; ! "говорить"

This is great, but we have a problem... The tag for our prefix is in the wrong place!

<nt>бицине<v><tv><aor>:б>ицуна
<m>бицине<v><tv><aor>:в>ицуна
<f>бицине<v><tv><aor>:й>ицуна
<pl>бицине<v><tv><aor>:р>ицуна

There is no way for us in basic lexc to have long distance control over which tags appear where. You will either end up having the tags in the wrong place in the string, or overgenerating.

Overgenerate and constrain[edit]

So, the approach we are going to take is to use twol to do morphotactic constraints, that means that we let our basic lexc file overgenerate, and we then strip out the paths that we don't want using twol. This is kind of like we do with morphophonology, we generate all the possible forms and then use twol rules to constrain the surface possibilities.

First, some conventions, we're going to use feature tags in [] to indicate the presence on the surface side of some prefix. For example:

LEXICON AOR

%<aor%>%<nt%>%[%+nt%]:уна # ;
%<aor%>%<m%>%[%+m%]:уна # ;
%<aor%>%<f%>%[%+f%]:уна # ;
%<aor%>%<pl%>%[%+pl%]:уна # ;

LEXICON Prefixes

%[%+б%]:б%> Verbs ;
%[%+в%]:в%> Verbs ;
%[%+й%]:й%> Verbs ;
%[%+р%]:р%> Verbs ;

LEXICON Verbs

бицине%<v%>%<tv%>:иц AOR ; ! "говорить"

The prefixes are marked with their morpheme, e.g. [+б], and the agreement tag is marked with [+nt]. So let's try compiling that...

$ hfst-lexc ava.lexc | hfst-fst2strings 
[+б]бицине<v><tv><aor><pl>[+pl]:б>ицуна
[+б]бицине<v><tv><aor><f>[+f]:б>ицуна
[+б]бицине<v><tv><aor><m>[+m]:б>ицуна
[+б]бицине<v><tv><aor><nt>[+nt]:б>ицуна
[+в]бицине<v><tv><aor><pl>[+pl]:в>ицуна
[+в]бицине<v><tv><aor><f>[+f]:в>ицуна
[+в]бицине<v><tv><aor><m>[+m]:в>ицуна
[+в]бицине<v><tv><aor><nt>[+nt]:в>ицуна
[+й]бицине<v><tv><aor><pl>[+pl]:й>ицуна
[+й]бицине<v><tv><aor><f>[+f]:й>ицуна
[+й]бицине<v><tv><aor><m>[+m]:й>ицуна
[+й]бицине<v><tv><aor><nt>[+nt]:й>ицуна
[+р]бицине<v><tv><aor><pl>[+pl]:р>ицуна
[+р]бицине<v><tv><aor><f>[+f]:р>ицуна
[+р]бицине<v><tv><aor><m>[+m]:р>ицуна
[+р]бицине<v><tv><aor><nt>[+nt]:р>ицуна

As you can see, we massively overgenerate. So the next thing to do is write a constraint:

Alphabet

%[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ;

Rules

"Match gender prefixes with agreement tags"
Sx:0 /<= _ ;
   except
       _ (:*) Sy:0 ;
   where Sx in ( %[%+б%]  %[%+в%] %[%+й%] %[%+р%]  )
         Sy in ( %[%+nt%] %[%+m%] %[%+f%] %[%+pl%] )
   matched ;

This basically says that always remove paths with any of the gender prefixes, except if you find a matching agreement tag to the right.

$ hfst-invert ava.lexc.hfst | hfst-compose-intersect -1 - -2 ava.twoc.hfst | hfst-invert | hfst-fst2strings 
бицине<v><tv><aor><nt>:б>ицуна
бицине<v><tv><aor><m>:в>ицуна
бицине<v><tv><aor><f>:й>ицуна
бицине<v><tv><aor><pl>:р>ицуна

Troubleshooting[edit]

Compiling takes ages

Check the size of the compiled twoc file. Have you kept the symbols in the Alphabet to a minimum? e.g. you shouldn't have <...> tags in the Alphabet, just [...] tags.

See also[edit]