Difference between revisions of "Morphotactic constraints with twol"
Firespeaker (talk | contribs) |
|||
(One intermediate revision by the same user not shown) | |||
Line 121: | Line 121: | ||
<pre> |
<pre> |
||
Alphabet |
Alphabet |
||
%<nt%> %<m%> %<f%> %<pl%> %<v%> %<tv%> %<aor%> |
|||
%[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ; |
%[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ; |
||
Line 147: | Line 145: | ||
бицине<v><tv><aor><pl>:р>ицуна |
бицине<v><tv><aor><pl>:р>ицуна |
||
</pre> |
</pre> |
||
==Troubleshooting== |
|||
; Compiling takes ages |
|||
Check the size of the compiled <code>twoc</code> file. Have you kept the symbols in the <code>Alphabet</code> to a minimum? e.g. you shouldn't have <code><...></code> tags in the <code>Alphabet</code>, just <code>[...]</code> tags. |
|||
==See also== |
==See also== |
Latest revision as of 09:10, 13 February 2017
This page describes how to use twol to implement morphotactic constraints. These are useful for modelling prefix and circumfix morphology.
Note: You should be comfortable with both lexc and twol before trying this. If you need a refresher, check out the getting started guide!
Prefixes[edit]
Let's suppose you have a language like Avar, where the first part of the verb stem changes for gender, for example:
бицине "to speak" | ||
---|---|---|
бицуна | neu-иц-aor | it speaks |
вицуна | msc-иц-aor | he speaks; I (m.) speak; you (m.) speak |
йицуна | fem-иц-aor | she speaks; I (f.) speak; you (f.) speak |
рицуна | pl-иц-aor | they speak |
So, the obvious thing would be just to put all of the forms in the lexicon (the .lexc
file), like:
LEXICON AOR %<aor%>:уна # ; LEXICON Verbs бицине%<v%>%<tv%>%<nt%>:биц AOR ; бицине%<v%>%<tv%>%<m%>:виц AOR ; бицине%<v%>%<tv%>%<f%>:йиц AOR ; бицине%<v%>%<tv%>%<pl%>:риц AOR ;
But this is inefficient! What we would like to do is have a single lexicon for the gender prefixes,
LEXICON AOR %<aor%>:уна # ; LEXICON Prefixes %<nt%>:б%> Verbs ; %<m%>:в%> Verbs ; %<f%>:й%> Verbs ; %<pl%>:р%> Verbs ; LEXICON Verbs бицине%<v%>%<tv%>:иц AOR ; ! "говорить"
This is great, but we have a problem... The tag for our prefix is in the wrong place!
<nt>бицине<v><tv><aor>:б>ицуна <m>бицине<v><tv><aor>:в>ицуна <f>бицине<v><tv><aor>:й>ицуна <pl>бицине<v><tv><aor>:р>ицуна
There is no way for us in basic lexc to have long distance control over which tags appear where. You will either end up having the tags in the wrong place in the string, or overgenerating.
Overgenerate and constrain[edit]
So, the approach we are going to take is to use twol to do morphotactic constraints, that means that we let our basic lexc file overgenerate, and we then strip out the paths that we don't want using twol. This is kind of like we do with morphophonology, we generate all the possible forms and then use twol rules to constrain the surface possibilities.
First, some conventions, we're going to use feature tags in []
to indicate the presence on the surface side of some prefix. For example:
LEXICON AOR %<aor%>%<nt%>%[%+nt%]:уна # ; %<aor%>%<m%>%[%+m%]:уна # ; %<aor%>%<f%>%[%+f%]:уна # ; %<aor%>%<pl%>%[%+pl%]:уна # ; LEXICON Prefixes %[%+б%]:б%> Verbs ; %[%+в%]:в%> Verbs ; %[%+й%]:й%> Verbs ; %[%+р%]:р%> Verbs ; LEXICON Verbs бицине%<v%>%<tv%>:иц AOR ; ! "говорить"
The prefixes are marked with their morpheme, e.g. [+б]
, and the agreement tag is marked with [+nt]
. So let's try compiling that...
$ hfst-lexc ava.lexc | hfst-fst2strings [+б]бицине<v><tv><aor><pl>[+pl]:б>ицуна [+б]бицине<v><tv><aor><f>[+f]:б>ицуна [+б]бицине<v><tv><aor><m>[+m]:б>ицуна [+б]бицине<v><tv><aor><nt>[+nt]:б>ицуна [+в]бицине<v><tv><aor><pl>[+pl]:в>ицуна [+в]бицине<v><tv><aor><f>[+f]:в>ицуна [+в]бицине<v><tv><aor><m>[+m]:в>ицуна [+в]бицине<v><tv><aor><nt>[+nt]:в>ицуна [+й]бицине<v><tv><aor><pl>[+pl]:й>ицуна [+й]бицине<v><tv><aor><f>[+f]:й>ицуна [+й]бицине<v><tv><aor><m>[+m]:й>ицуна [+й]бицине<v><tv><aor><nt>[+nt]:й>ицуна [+р]бицине<v><tv><aor><pl>[+pl]:р>ицуна [+р]бицине<v><tv><aor><f>[+f]:р>ицуна [+р]бицине<v><tv><aor><m>[+m]:р>ицуна [+р]бицине<v><tv><aor><nt>[+nt]:р>ицуна
As you can see, we massively overgenerate. So the next thing to do is write a constraint:
Alphabet %[%+б%]:0 %[%+в%]:0 %[%+й%]:0 %[%+р%]:0 %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0 ; Rules "Match gender prefixes with agreement tags" Sx:0 /<= _ ; except _ (:*) Sy:0 ; where Sx in ( %[%+б%] %[%+в%] %[%+й%] %[%+р%] ) Sy in ( %[%+nt%] %[%+m%] %[%+f%] %[%+pl%] ) matched ;
This basically says that always remove paths with any of the gender prefixes, except if you find a matching agreement tag to the right.
$ hfst-invert ava.lexc.hfst | hfst-compose-intersect -1 - -2 ava.twoc.hfst | hfst-invert | hfst-fst2strings бицине<v><tv><aor><nt>:б>ицуна бицине<v><tv><aor><m>:в>ицуна бицине<v><tv><aor><f>:й>ицуна бицине<v><tv><aor><pl>:р>ицуна
Troubleshooting[edit]
- Compiling takes ages
Check the size of the compiled twoc
file. Have you kept the symbols in the Alphabet
to a minimum? e.g. you shouldn't have <...>
tags in the Alphabet
, just [...]
tags.