Initial consonant mutation

From Apertium
Revision as of 18:08, 25 May 2009 by Francis Tyers (talk | contribs) (New page: {{TOCD}} This page gives a brief overview of how initial consonant mutations are currently analysed, disambiguated and generated in Apertium. It uses as an example the <code>apertium-ga-g...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page gives a brief overview of how initial consonant mutations are currently analysed, disambiguated and generated in Apertium. It uses as an example the apertium-ga-gd language pair.

Analysis

Analysis is taken care of by creating word-initial paradigms which simply replace the non-mutated forms with the mutated forms. For example for the initial consonant, 'b', which can be lenited as 'bh' or eclipsed as 'mb', we get the following initial mutation paradigm:

    <pardef n="initial-b">
      <e><p><l>b</l><r>b</r></p></e>
      <e><p><l>bh</l><r>b</r></p></e>
      <e><p><l>mb</l><r>b</r></p></e>
    </pardef>

This can then be applied to a word, e.g. "bulc" (), like:

    <e lm="bulc"><par n="initial-b"/><i>u</i><par n="bu/lc__n"/></e>

The initial mutation is 'b', and the word takes the bu/lc__n paradigm. The problem with this method is that sometimes it can cause "over analysis", but we can take care of this with disambiguation, see below.

Disambiguation

Disambiguation of initial-mutations can be done using constraint grammar (see the file apertium-ga-gd.ga-gd.rlx). The apertium-tagger is not useful for this purpose as it cannot look at surface forms of words, only lexical units. A trivial illustrative example of how the constraint grammar can be used is presented below. Lets say we want to choose between a feminine possessive determiner and a masculine possessive determiner based on the type of mutation exhibited by the following noun, for example in the following two phrases:[1]

  • a pheann, — his pen
  • a haois — her age

Here the determiner, "a", can be either masculine or feminine (that is, "his" or "her"). So, if we have the following input to the constraint grammar:

  ^a/a<det><pos><p3><mf><pl>/a<det><pos><p3><m><sg>/a<det><pos><p3><f><sg>$ ^pheann/peann<n><m><sg><nom>/peann<n><m><pl><gen>$
  ^a/a<det><pos><p3><mf><pl>/a<det><pos><p3><m><sg>/a<det><pos><p3><f><sg>$ ^haois/aois<n><f><sg><nom>$ 

So, first we define what we want to work with,

LIST DetPos = (det pos);             # possessive determiner

LIST hPro = ("<h.*>"r "[aeiou].*"r); # h-prothesis
LIST Len = ("<ph.*>"r "p.*"r);       # lenition

SET DetPosF = DetPos | (f);          # feminine possessive determiner
SET DetPosM = DetPos | (m);          # masculine possessive determiner

This should be fairly straightforward, then we write rules that say "Choose the feminine possessive when the noun that follows is subject to h-prothesis, and the masculine possessive when the noun that follows is subject to lenition",

SELECT DetPosF IF (1 hPro);
SELECT DetPosM IF (1 Len);

Applying this grammar gives:

  ^a<det><pos><p3><m><sg>$ ^peann<n><m><sg><nom>$ 
  ^a<det><pos><p3><f><sg>$ ^oíche<n><f><sg><nom>$ 

The desired result.

Generation

Overview

Generation of initial mutations takes place in two files, where is the code of the language that is being generated (ga for Irish, gd for Scottish Gaelic).

  • apertium-ga-gd.pre-.t1x — Transfer rules which add tags defining the mutation to the beginning of words which should be mutated.
  • apertium-ga-gd.muta-.dix — A post-generation dictionary which takes the tag and the initial letter of the word and outputs the mutated form.

For example, when translating the phrase "do theach" (your house) from Irish to Scottish Gaelic, the result will be do thaigh (where the initial mutation is marked by an underscore). The output of apertium-transfer will be:

  ^do<det><pos><p2><mf><sg>$ ^taigh<n><m><sg><nom>$ 

This is then passed through apertium-ga-gd.pre-gd, which adds a tag, <l1> for lenition.

  ^do<det><pos><p2><mf><sg>$ ^<l1>$^taigh<n><m><sg><nom>$ 

The morphological generator then outputs the surface forms of the words, and appends an "alarm" signal to the mutation tag.

  do ~<l1>taigh

Finally, the mutation dictionary, apertium-ga-gd.muta-gd.dix, replaces the string ~<l1>t with 'th', which is the lenited form of 't'.

  do thaigh
apertium-ga-gd.pre-.t1x

As mentioned above, the input to this stage is:

  ^do<det><pos><p2><mf><sg>$ ^taigh<n><m><sg><nom>$ 

A simplified (although functioning) rule in this file might look something like:

<pattern>
  <pattern-item n="det"/>
  <pattern-item n="nom"/>
</pattern>
<choose> 
  <when> <!-- When the lemma of the determiner is "do", 
              apply lenition to the following noun -->
    <test>
      <equal>
        <clip pos="1" side="tl" part="lem"/>
        <lit v="do"/>
      </equal>
     </test>
     <out>
      <lu><clip pos="1" side="tl" part="whole"/></lu>
      <b/>
      <lu><lit-tag v="l1"/></lu>    <!-- Lenition -->
      <lu><clip pos="2" side="tl" part="whole"/></lu>
    </out>
  </when>
</choose>

And the output will be,

  ^do<det><pos><p2><mf><sg>$ ^<l1>$^taigh<n><m><sg><nom>$ 
apertium-ga-gd.muta-.dix

The input to this stage is:

  do ~<l1>taigh

The "rule", or rather "entry" in the mutation dictionary will look like:

    <e>
      <p>
        <l><a/><s n="l1"/>t</l>
        <r>th</r>
      </p>
      <par n="alphabet"/> 
    </e>

Here, the alphabet is defined as a paradigm which for any given input letter, just outputs the letter unchanged. So this basically says,

"When we have the alarm symbol '~', followed by a tag indicating lenition followed by a 't' and then any alphabetic character, output 'th' followed by the next character"

The output of this stage is a correctly mutated phrase,

  do thaigh
  1. Note that this is not really relevant for Irish to Scots Gaelic as the surface forms of each are the same