Scottish Gaelic and Irish

From Apertium
Jump to navigation Jump to search

Todo

  • Add ability to analyse initial mutations to the monolingual dictionary.
-- I have most of the work done for this -- Jimregan
  • Add all closed categories to the monolingual dictionaries.
--
  • Improve the tagger -- write restrictions/constraints, and then retrain.
  • Perform an intersection on the monolingual dictionaries.
    • We only want stuff in the Irish analyser that we can translate into Scottish Gaelic -- so, in order for a word to be included, it should be in both the Irish monolingual, bilingual and the translation in the Scottish Gaelic monolingual. With the words for which we don't have translations we can just comment them out for now.
-- Count me out on this one; I will suggest using <e i="yes"> etc. instead of xml comments -- Jimregan
  • Do some fixing of the bilingual dictionary
    • There are some entries with unknown gender on the Scottish Gaelic side.
    • Some restrictions probably need adding.
    • Some conjunctions are marked "cnj" and not subdivided for "cnjcoo", "cnjsub" etc.
-- I'll take this one too -- Jimregan
  • Write rules to do initial mutations for generation.
  • Write some transfer rules.
    • For example to do tenses, number agreement, etc.
-- We can probably take most of this stuff from another language pair and add the consonant etc. stuff later; for the most part, adjective chunks etc. should be the same as those in at least one other pair (I'll scout around for which) -- Jimregan

Tagger

Initial mutations

As members of the group of Celtic languages, both Scottish Gaelic and Irish exhibit initial consonant mutation. There follows a brief description of how the analysis, disambiguation and generation of this phenomenon is dealt with in the apertium-ga-gd package.

Analysis and disambiguation

Generation

Overview

Generation of initial mutations takes place in two files, where is the code of the language that is being generated (ga for Irish, gd for Scottish Gaelic).

  • apertium-ga-gd.pre-.t1x — Transfer rules which add tags defining the mutation to the beginning of words which should be mutated.
  • apertium-ga-gd.muta-.dix — A post-generation dictionary which takes the tag and the initial letter of the word and outputs the mutated form.

For example, when translating the phrase "do theach" (your house) from Irish to Scottish Gaelic, the result will be do thaigh (where the initial mutation is marked by an underscore). The output of apertium-transfer will be:

  ^do<det><pos><p2><mf><sg>$ ^taigh<n><m><sg><nom>$ 

This is then passed through apertium-ga-gd.pre-gd, which adds a tag, <l1> for lenition.

  ^do<det><pos><p2><mf><sg>$ ^<l1>$^taigh<n><m><sg><nom>$ 

The morphological generator then outputs the surface forms of the words, and appends an "alarm" signal to the mutation tag.

  do ~<l1>taigh

Finally, the mutation dictionary, apertium-ga-gd.muta-gd.dix, replaces the string ~<l1>t with 'th', which is the lenited form of 't'.

  do thaigh
apertium-ga-gd.pre-.t1x

As mentioned above, the input to this stage is:

  ^do<det><pos><p2><mf><sg>$ ^taigh<n><m><sg><nom>$ 

A rule in this file might look something like:

<pattern>
  <pattern-item n="det"/>
  <pattern-item n="nom"/>
</pattern>
<choose> 
  <when> <!-- When the lemma of the determiner is "do", 
              apply lenition to the following noun -->
    <test>
      <equal>
        <clip pos="1" side="tl" part="lem"/>
        <lit v="do"/>
      </equal>
     </test>
     <out>
      <lu><clip pos="1" side="tl" part="whole"/></lu>
      <b/>
      <lu><lit-tag v="l1"/></lu>    <!-- Lenition -->
      <lu><clip pos="2" side="tl" part="whole"/></lu>
    </out>
  </when>
</choose>

And the output will be,

  ^do<det><pos><p2><mf><sg>$ ^<l1>$^taigh<n><m><sg><nom>$ 
apertium-ga-gd.muta-.dix

The input to this stage is:

  do ~<l1>taigh

The "rule", or rather "entry" in the mutation dictionary will look like:

    <e>
      <p>
        <l><a/><s n="l1"/>t</l>
        <r>th</r>
      </p>
      <par n="alphabet"/> 
    </e>

Here, the alphabet is defined as a paradigm which for any given input letter, just outputs the letter unchanged. So this basically says,

"When we have the alarm symbol '~', followed by a tag indicating lenition followed by a 't' and then any alphabetic character, output 'th' followed by the next character"

See also

External links