Difference between revisions of "Morphotactic constraints with twol"

Latest revision as of 09:10, 13 February 2017

This page describes how to use twol to implement morphotactic constraints. These are useful for modelling prefix and circumfix morphology.

Note: You should be comfortable with both lexc and twol before trying this. If you need a refresher, check out the getting started guide!

Prefixes[edit]

Let's suppose you have a language like Avar, where the first part of the verb stem changes for gender, for example:

бицине "to speak"
бицуна	neu-иц-aor	it speaks
вицуна	msc-иц-aor	he speaks; I (m.) speak; you (m.) speak
йицуна	fem-иц-aor	she speaks; I (f.) speak; you (f.) speak
рицуна	pl-иц-aor	they speak

So, the obvious thing would be just to put all of the forms in the lexicon (the .lexc file), like:


LEXICON AOR

%<aor%>:уна # ;

LEXICON Verbs

бицине%<v%>%<tv%>%<nt%>:биц AOR ;
бицине%<v%>%<tv%>%<m%>:виц AOR ;
бицине%<v%>%<tv%>%<f%>:йиц AOR ;
бицине%<v%>%<tv%>%<pl%>:риц AOR ;

But this is inefficient! What we would like to do is have a single lexicon for the gender prefixes,


LEXICON AOR

%<aor%>:уна # ;

LEXICON Prefixes

%<nt%>:б%> Verbs ;
%<m%>:в%> Verbs ;
%<f%>:й%> Verbs ;
%<pl%>:р%> Verbs ;

LEXICON Verbs

бицине%<v%>%<tv%>:иц AOR ; ! "говорить"

This is great, but we have a problem... The tag for our prefix is in the wrong place!

<nt>бицине<v><tv><aor>:б>ицуна
<m>бицине<v><tv><aor>:в>ицуна
<f>бицине<v><tv><aor>:й>ицуна
<pl>бицине<v><tv><aor>:р>ицуна

There is no way for us in basic lexc to have long distance control over which tags appear where. You will either end up having the tags in the wrong place in the string, or overgenerating.

Overgenerate and constrain[edit]

So, the approach we are going to take is to use twol to do morphotactic constraints, that means that we let our basic lexc file overgenerate, and we then strip out the paths that we don't want using twol. This is kind of like we do with morphophonology, we generate all the possible forms and then use twol rules to constrain the surface possibilities.

First, some conventions, we're going to use feature tags in [] to indicate the presence on the surface side of some prefix. For example:

LEXICON AOR

%<aor%>%<nt%>%[%+nt%]:уна # ;
%<aor%>%<m%>%[%+m%]:уна # ;
%<aor%>%<f%>%[%+f%]:уна # ;
%<aor%>%<pl%>%[%+pl%]:уна # ;

LEXICON Prefixes

%[%+б%]:б%> Verbs ;
%[%+в%]:в%> Verbs ;
%[%+й%]:й%> Verbs ;
%[%+р%]:р%> Verbs ;

LEXICON Verbs

бицине%<v%>%<tv%>:иц AOR ; ! "говорить"

The prefixes are marked with their morpheme, e.g. [+б], and the agreement tag is marked with [+nt]. So let's try compiling that...

$ hfst-lexc ava.lexc | hfst-fst2strings 
[+б]бицине<v><tv><aor><pl>[+pl]:б>ицуна
[+б]бицине<v><tv><aor><f>[+f]:б>ицуна
[+б]бицине<v><tv><aor><m>[+m]:б>ицуна
[+б]бицине<v><tv><aor><nt>[+nt]:б>ицуна
[+в]бицине<v><tv><aor><pl>[+pl]:в>ицуна
[+в]бицине<v><tv><aor><f>[+f]:в>ицуна
[+в]бицине<v><tv><aor><m>[+m]:в>ицуна
[+в]бицине<v><tv><aor><nt>[+nt]:в>ицуна
[+й]бицине<v><tv><aor><pl>[+pl]:й>ицуна
[+й]бицине<v><tv><aor><f>[+f]:й>ицуна
[+й]бицине<v><tv><aor><m>[+m]:й>ицуна
[+й]бицине<v><tv><aor><nt>[+nt]:й>ицуна
[+р]бицине<v><tv><aor><pl>[+pl]:р>ицуна
[+р]бицине<v><tv><aor><f>[+f]:р>ицуна
[+р]бицине<v><tv><aor><m>[+m]:р>ицуна
[+р]бицине<v><tv><aor><nt>[+nt]:р>ицуна

As you can see, we massively overgenerate. So the next thing to do is write a constraint:

Alphabet

%[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ;

Rules

"Match gender prefixes with agreement tags"
Sx:0 /<= _ ;
   except
       _ (:*) Sy:0 ;
   where Sx in ( %[%+б%]  %[%+в%] %[%+й%] %[%+р%]  )
         Sy in ( %[%+nt%] %[%+m%] %[%+f%] %[%+pl%] )
   matched ;

This basically says that always remove paths with any of the gender prefixes, except if you find a matching agreement tag to the right.

$ hfst-invert ava.lexc.hfst | hfst-compose-intersect -1 - -2 ava.twoc.hfst | hfst-invert | hfst-fst2strings 
бицине<v><tv><aor><nt>:б>ицуна
бицине<v><tv><aor><m>:в>ицуна
бицине<v><tv><aor><f>:й>ицуна
бицине<v><tv><aor><pl>:р>ицуна

Troubleshooting[edit]

Compiling takes ages

Check the size of the compiled twoc file. Have you kept the symbols in the Alphabet to a minimum? e.g. you shouldn't have <...> tags in the Alphabet, just [...] tags.

@@ Line 121: / Line 121: @@
 <pre>
 Alphabet
-%<nt%> %<m%> %<f%> %<pl%> %<v%> %<tv%> %<aor%>
 %[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ;
@@ Line 147: / Line 145: @@
 бицине<v><tv><aor><pl>:р>ицуна
 </pre>
+==Troubleshooting==
+; Compiling takes ages
+Check the size of the compiled <code>twoc</code> file. Have you kept the symbols in the <code>Alphabet</code> to a minimum? e.g. you shouldn't have <code>&lt;...&gt;</code> tags in the <code>Alphabet</code>, just <code>[...]</code> tags.
 ==See also==

Difference between revisions of "Morphotactic constraints with twol"

Latest revision as of 09:10, 13 February 2017

Contents

Prefixes[edit]

Overgenerate and constrain[edit]

Troubleshooting[edit]

See also[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools