Morphotactic constraints with twol

From Apertium
Jump to navigation Jump to search

This page describes how to use twol to implement morphotactic constraints. These are useful for modelling prefix and circumfix morphology.

Note: You should be comfortable with both lexc and twol before trying this. If you need a refresher, check out the getting started guide!


Let's suppose you have a language like Avar, where the first part of the verb stem changes for gender, for example:

бицине "to speak"
бицуна neu-иц-aor it speaks
вицуна msc-иц-aor he speaks; I (m.) speak; you (m.) speak
йицуна fem-иц-aor she speaks; I (f.) speak; you (f.) speak
рицуна pl-иц-aor they speak

So, the obvious thing would be just to put all of the forms in the lexicon (the .lexc file), like:


%<aor%>:уна # ;


бицине%<v%>%<tv%>%<nt%>:биц AOR ;
бицине%<v%>%<tv%>%<m%>:виц AOR ;
бицине%<v%>%<tv%>%<f%>:йиц AOR ;
бицине%<v%>%<tv%>%<pl%>:риц AOR ;

But this is inefficient! What we would like to do is have a single lexicon for the gender prefixes,


%<aor%>:уна # ;

LEXICON Prefixes

%<nt%>:б%> Verbs ;
%<m%>:в%> Verbs ;
%<f%>:й%> Verbs ;
%<pl%>:р%> Verbs ;


бицине%<v%>%<tv%>:иц AOR ; ! "говорить"

This is great, but we have a problem... The tag for our prefix is in the wrong place!


There is no way for us in basic lexc to have long distance control over which tags appear where. You will either end up having the tags in the wrong place in the string, or overgenerating.

Overgenerate and constrain

So, the approach we are going to take is to use twol to do morphotactic constraints, that means that we let our basic lexc file overgenerate, and we then strip out the paths that we don't want using twol. This is kind of like we do with morphophonology, we generate all the possible forms and then use twol rules to constrain the surface possibilities.

First, some conventions, we're going to use feature tags in [] to indicate the presence on the surface side of some prefix. For example:


%<aor%>%<nt%>%[%+nt%]:уна # ;
%<aor%>%<m%>%[%+m%]:уна # ;
%<aor%>%<f%>%[%+f%]:уна # ;
%<aor%>%<pl%>%[%+pl%]:уна # ;

LEXICON Prefixes

%[%+б%]:б%> Verbs ;
%[%+в%]:в%> Verbs ;
%[%+й%]:й%> Verbs ;
%[%+р%]:р%> Verbs ;


бицине%<v%>%<tv%>:иц AOR ; ! "говорить"

The prefixes are marked with their morpheme, e.g. [+б], and the agreement tag is marked with [+nt]. So let's try compiling that...

$ hfst-lexc ava.lexc | hfst-fst2strings 

As you can see, we massively overgenerate. So the next thing to do is write a constraint:


%[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ;


"Match gender prefixes with agreement tags"
Sx:0 /<= _ ;
       _ (:*) Sy:0 ;
   where Sx in ( %[%+б%]  %[%+в%] %[%+й%] %[%+р%]  )
         Sy in ( %[%+nt%] %[%+m%] %[%+f%] %[%+pl%] )
   matched ;

This basically says that always remove paths with any of the gender prefixes, except if you find a matching agreement tag to the right.

$ hfst-invert ava.lexc.hfst | hfst-compose-intersect -1 - -2 ava.twoc.hfst | hfst-invert | hfst-fst2strings 


Compiling takes ages

Check the size of the compiled twoc file. Have you kept the symbols in the Alphabet to a minimum? e.g. you shouldn't have <...> tags in the Alphabet, just [...] tags.

See also