Difference between revisions of "Scottish Gaelic and Irish"
Jump to navigation
Jump to search
(→Todo) |
m (Irish doesn't have synthetic adjectives) |
||
(10 intermediate revisions by one other user not shown) | |||
Line 3: | Line 3: | ||
==Todo== |
==Todo== |
||
===Irish dictionary=== |
|||
* Add all closed categories to the monolingual dictionaries. |
|||
* '''Perform an intersection on the monolingual dictionaries.''' |
|||
# <s>put the paradigms in 1 entry-per-line format</s> |
|||
** We only want stuff in the Irish analyser that we can translate into Scottish Gaelic -- so, in order for a word to be included, it should be in both the Irish monolingual, bilingual and the translation in the Scottish Gaelic monolingual. With the words for which we don't have translations we can just comment them out for now. |
|||
# noun paradigms |
|||
## some have only one entry -- these are defective? -- e.g. <code>bá__n_m</code> |
|||
## some have three entries -- defective also? -- e.g. <code>band/ia__n_m</code> |
|||
# verb paradigms |
|||
## sort the entries so that the order makes sense |
|||
## <s>is there an imperative p1.sg ???</s> |
|||
# adjective paradigms |
|||
## some paradigms have more entries than others, e.g. <code>ca/s__adj</code> has 3, and <code>bré/an__adj</code> has 4 |
|||
# are some proper nouns marked with common noun paradigms instead of proper noun paradigms ? |
|||
## find out with <code>cat apertium-ga-gd.ga.dix | grep '<e lm="[A-Z]'</code> |
|||
# sort the entries in the <section id="main"> by a) part-of-speech, b) alphabetical order |
|||
# i think we're missing possessives and demonstratives, quantifiers and perhaps some definite/indefinite pronouns |
|||
==Old todo== |
|||
* '''Perform an intersection on the monolingual dictionaries. (Making them consistent)''' |
|||
** We only want stuff in the Irish analyser that we can translate into Scottish Gaelic -- so, in order for a word to be included, it should be in both the Irish monolingual, bilingual and the translation in the Scottish Gaelic monolingual. With the words for which we don't have translations we can just comment them out -- or move them to a separate file in <code>dev/</code> |
|||
* Add all missing closed categories to the monolingual dictionaries. |
|||
* Do some fixing of the bilingual dictionary |
* Do some fixing of the bilingual dictionary |
||
** Some restrictions probably need adding. |
** Some restrictions probably need adding. |
||
** Some conjunctions are marked "cnj" and not subdivided for "cnjcoo", "cnjsub" etc. |
** Some conjunctions are marked "cnj" and not subdivided for "cnjcoo", "cnjsub" etc. |
||
* Making constraint grammar rules more CG-like |
|||
* Write rules to do initial mutations for generation. |
* Write rules to do initial mutations for generation. |
||
* Write some transfer rules. |
* Write some transfer rules. |
||
** For example to do tenses, number agreement, etc. |
** For example to do tenses, number agreement, etc. |
||
::-- We can probably take most of this stuff from another language pair and add the consonant etc. stuff later; for the most part, adjective chunks etc. should be the same as those in at least one other pair (I'll scout around for which) -- [[User:Jimregan|Jimregan]] |
|||
==Tagger== |
|||
==Initial mutations== |
|||
As members of the group of Celtic languages, both Scottish Gaelic and Irish exhibit initial consonant mutation. There follows a brief description of how the analysis, disambiguation and generation of this phenomenon is dealt with in the <code>apertium-ga-gd</code> package. |
|||
===Analysis=== |
|||
Analysis is taken care of by creating word-initial paradigms which simply replace the non-mutated forms with the mutated forms. For example for the initial consonant, 'b', which can be lenited as 'bh' or eclipsed as 'mb', we get the following initial mutation paradigm: |
|||
<pre> |
|||
<pardef n="initial-b"> |
|||
<e><p><l>b</l><r>b</r></p></e> |
|||
<e><p><l>bh</l><r>b</r></p></e> |
|||
<e><p><l>mb</l><r>b</r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
This can then be applied to a word, e.g. "bulc" (), like: |
|||
<pre> |
|||
<e lm="bulc"><par n="initial-b"/><i>u</i><par n="bu/lc__n"/></e> |
|||
</pre> |
|||
The initial mutation is 'b', and the word takes the <code>bu/lc__n</code> paradigm. The problem with this method is that sometimes it can cause "over analysis", but we can take care of this with disambiguation, see below. |
|||
===Disambiguation=== |
|||
Disambiguation of initial-mutations can be done using [[constraint grammar]] (see the file <code>apertium-ga-gd.ga-gd.rlx</code>). The apertium-tagger is not useful for this purpose as it cannot look at surface forms of words, only lexical units. A trivial illustrative example of how the constraint grammar can be used is presented below. Lets say we want to choose between a feminine possessive determiner and a masculine possessive determiner based on the type of mutation exhibited by the following noun, for example in the following two phrases:<ref>Note that this is not really relevant for Irish to Scots Gaelic as the surface forms of each are the same</ref> |
|||
* a pheann, — ''his pen'' |
|||
* a haois — ''her age'' |
|||
Here the determiner, "a", can be either masculine or feminine (that is, "his" or "her"). So, if we have the following input to the constraint grammar: |
|||
<pre> |
|||
^a/a<det><pos><p3><mf><pl>/a<det><pos><p3><m><sg>/a<det><pos><p3><f><sg>$ ^pheann/peann<n><m><sg><nom>/peann<n><m><pl><gen>$ |
|||
^a/a<det><pos><p3><mf><pl>/a<det><pos><p3><m><sg>/a<det><pos><p3><f><sg>$ ^haois/aois<n><f><sg><nom>$ |
|||
</pre> |
|||
So, first we define what we want to work with, |
|||
<pre> |
|||
LIST DetPos = (det pos); # possessive determiner |
|||
LIST hPro = ("<h.*>"r "[aeiou].*"r); # h-prothesis |
|||
LIST Len = ("<ph.*>"r "p.*"r); # lenition |
|||
SET DetPosF = DetPos | (f); # feminine possessive determiner |
|||
SET DetPosM = DetPos | (m); # masculine possessive determiner |
|||
</pre> |
|||
This should be fairly straightforward, then we write rules that say "Choose the feminine possessive when the noun that follows is subject to h-prothesis, and the masculine possessive when the noun that follows is subject to lenition", |
|||
<pre> |
|||
SELECT DetPosF IF (1 hPro); |
|||
SELECT DetPosM IF (1 Len); |
|||
</pre> |
|||
Applying this grammar gives: |
|||
<pre> |
|||
^a<det><pos><p3><m><sg>$ ^peann<n><m><sg><nom>$ |
|||
^a<det><pos><p3><f><sg>$ ^oíche<n><f><sg><nom>$ |
|||
</pre> |
|||
The desired result. |
|||
===Generation=== |
|||
;Overview |
|||
Generation of initial mutations takes place in two files, where <math>x</math> is the code of the language that is being generated (<code>ga</code> for Irish, <code>gd</code> for Scottish Gaelic). |
|||
* <code>apertium-ga-gd.pre-<math>x</math>.t1x</code> — Transfer rules which add tags defining the mutation to the beginning of words which should be mutated. |
|||
* <code>apertium-ga-gd.muta-<math>x</math>.dix</code> — A post-generation dictionary which takes the tag and the initial letter of the word and outputs the mutated form. |
|||
For example, when translating the phrase "do theach" (your house) from Irish to Scottish Gaelic, the result will be do <u>th</u>aigh (where the initial mutation is marked by an underscore). The output of <code>apertium-transfer</code> will be: |
|||
<pre> |
|||
^do<det><pos><p2><mf><sg>$ ^taigh<n><m><sg><nom>$ |
|||
</pre> |
|||
This is then passed through <code>apertium-ga-gd.pre-gd</code>, which adds a tag, <code><l1></code> for lenition. |
|||
<pre> |
|||
^do<det><pos><p2><mf><sg>$ ^<l1>$^taigh<n><m><sg><nom>$ |
|||
</pre> |
|||
The morphological generator then outputs the surface forms of the words, and appends an "alarm" signal to the mutation tag. |
|||
<pre> |
|||
do ~<l1>taigh |
|||
</pre> |
|||
Finally, the mutation dictionary, <code>apertium-ga-gd.muta-gd.dix</code>, replaces the string <code>~<l1>t</code> with 'th', which is the lenited form of 't'. |
|||
<pre> |
|||
do thaigh |
|||
</pre> |
|||
;apertium-ga-gd.pre-<math>x</math>.t1x |
|||
As mentioned above, the input to this stage is: |
|||
<pre> |
|||
^do<det><pos><p2><mf><sg>$ ^taigh<n><m><sg><nom>$ |
|||
</pre> |
|||
A simplified (although functioning) rule in this file might look something like: |
|||
<pre> |
|||
<pattern> |
|||
<pattern-item n="det"/> |
|||
<pattern-item n="nom"/> |
|||
</pattern> |
|||
<choose> |
|||
<when> <!-- When the lemma of the determiner is "do", |
|||
apply lenition to the following noun --> |
|||
<test> |
|||
<equal> |
|||
<clip pos="1" side="tl" part="lem"/> |
|||
<lit v="do"/> |
|||
</equal> |
|||
</test> |
|||
<out> |
|||
<lu><clip pos="1" side="tl" part="whole"/></lu> |
|||
<b/> |
|||
<lu><lit-tag v="l1"/></lu> <!-- Lenition --> |
|||
<lu><clip pos="2" side="tl" part="whole"/></lu> |
|||
</out> |
|||
</when> |
|||
</choose> |
|||
</pre> |
|||
And the output will be, |
|||
<pre> |
|||
^do<det><pos><p2><mf><sg>$ ^<l1>$^taigh<n><m><sg><nom>$ |
|||
</pre> |
|||
;apertium-ga-gd.muta-<math>x</math>.dix |
|||
The input to this stage is: |
|||
<pre> |
|||
do ~<l1>taigh |
|||
</pre> |
|||
The "rule", or rather "entry" in the mutation dictionary will look like: |
|||
<pre> |
|||
<e> |
|||
<p> |
|||
<l><a/><s n="l1"/>t</l> |
|||
<r>th</r> |
|||
</p> |
|||
<par n="alphabet"/> |
|||
</e> |
|||
</pre> |
|||
Here, the alphabet is defined as a paradigm which for any given input letter, just outputs the letter unchanged. So this basically says, |
|||
:"When we have the alarm symbol '~', followed by a tag indicating lenition followed by a 't' and then any alphabetic character, output 'th' followed by the next character" |
|||
The output of this stage is a correctly mutated phrase, |
|||
<pre> |
|||
do thaigh |
|||
</pre> |
|||
==Testing== |
==Testing== |
||
Line 205: | Line 51: | ||
* [http://www.smo.uhi.ac.uk/gaidhlig/ga-ge/faclair.html Faclair Gàidhlig-Gaeilge] |
* [http://www.smo.uhi.ac.uk/gaidhlig/ga-ge/faclair.html Faclair Gàidhlig-Gaeilge] |
||
[[Category: |
[[Category:Scottish Gaelic and Irish|*]] |
||
[[Category:Scottish Gaelic and Irish]] |
Latest revision as of 09:59, 8 May 2011
Todo[edit]
Irish dictionary[edit]
put the paradigms in 1 entry-per-line format- noun paradigms
- some have only one entry -- these are defective? -- e.g.
bá__n_m
- some have three entries -- defective also? -- e.g.
band/ia__n_m
- some have only one entry -- these are defective? -- e.g.
- verb paradigms
- sort the entries so that the order makes sense
is there an imperative p1.sg ???
- adjective paradigms
- some paradigms have more entries than others, e.g.
ca/s__adj
has 3, andbré/an__adj
has 4
- some paradigms have more entries than others, e.g.
- are some proper nouns marked with common noun paradigms instead of proper noun paradigms ?
- find out with
cat apertium-ga-gd.ga.dix | grep '<e lm="[A-Z]'
- find out with
- sort the entries in the <section id="main"> by a) part-of-speech, b) alphabetical order
- i think we're missing possessives and demonstratives, quantifiers and perhaps some definite/indefinite pronouns
Old todo[edit]
- Perform an intersection on the monolingual dictionaries. (Making them consistent)
- We only want stuff in the Irish analyser that we can translate into Scottish Gaelic -- so, in order for a word to be included, it should be in both the Irish monolingual, bilingual and the translation in the Scottish Gaelic monolingual. With the words for which we don't have translations we can just comment them out -- or move them to a separate file in
dev/
- We only want stuff in the Irish analyser that we can translate into Scottish Gaelic -- so, in order for a word to be included, it should be in both the Irish monolingual, bilingual and the translation in the Scottish Gaelic monolingual. With the words for which we don't have translations we can just comment them out -- or move them to a separate file in
- Add all missing closed categories to the monolingual dictionaries.
- Do some fixing of the bilingual dictionary
- Some restrictions probably need adding.
- Some conjunctions are marked "cnj" and not subdivided for "cnjcoo", "cnjsub" etc.
- Making constraint grammar rules more CG-like
- Write rules to do initial mutations for generation.
- Write some transfer rules.
- For example to do tenses, number agreement, etc.
Testing[edit]
See also[edit]
Notes[edit]