Talk:Compounds

From Apertium
Revision as of 08:34, 9 January 2009 by Francis Tyers (talk | contribs) (New page: <pre> > gīrvāṇabhāṣārasāsvādatatparān > "those who take delight in the language of the gods." compound with the constituents: gīrv...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
        


        >     gīrvāṇabhāṣārasāsvādatatparān
        >     "those who take delight in the language of the gods."
         
compound with the constituents: gīrvāṇa-bhāṣā-rasa-āsvāda-tatpara + case{2nd} + number{plural}
It can be in either a neuter or masc gender. Its default paradigm will be 'rAma' if masc and 'puRpa' if neuter.



Ok, so, this seems to me a bit like what happens in many Germanic
languages with word compounding (although probably more productive). An
example:

   widerstandkämpferinnen

   widerstand-kämpfer-in-en

  `resistance fighter + gender{feminine} + number{plural}'

Here we could split the word into two:

   widerstand kämpferinnen

And then analyse these separately. 

So, how do we split in two? Well, we think that possibly it might be
better to have a separate "pre-analysis splitting" stage that would work
using the morphological analyser. 

For example, if "widerstandkämpferinnen" is a word not in the
dictionary, but the two other words are, we can use a splitter like
this:

 <dictionary>
   <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
   <section id="splitter" type="inconditional">
     <e><p><l>widerstand</l><r>widerstand</r></p></e>
     <e><p><l>kämpfer</l><r>kämpfer</r></p></e>
     <e><p><l>kämpferin</l><r>kämpferin</r></p></e>
     <e><p><l>kämpferinnen</l><r>kämpferinnen</r></p></e>
   </section> 
 </dictionary>

Of course here we can also use paradigms, but the important thing is
that no analysis is performed.

 $ echo "widerstandkämpferinnen" | lt-proc splitter.bin  
 ^widerstand/widerstand$^kämpferinnen/kämpferinnen$

Here you could use also regular expressions to perform the splits.

    <e><re>[aeiou]+[dt]+[aeiou]</re></e>

$ echo "adaada" | lt-proc splitter.bin 
^ada/ada$^ada/ada$

Which will split the word, which can then be analysed as its constituent
parts. Of course, this only helps if the compound word is compositional.

What we will then do is after we have this split, we can write a new
mode to the analyser to process the split forms using the existant
analyser.

Do you think we might be able to solve the problem this way? 

== ==

        
         - Would it be problematic to have words split slightly differently?
           e.g. ^bekharim/be<prefix><ind>+kharan<vblex><pres><p1><sg>$
           Where the "standard analysis" would be:
                ^bekharim/kharan<vblex><ind><pres><p1><sg>$
        
Here is an example of prefix analysis:

 AgacCa=A-gam1<prayogaH:karwari><lakAraH:lot><puruRaH:m><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>

where 'A' is the prefix, and 'gam1' is the verbal root.
So the analysis is close to what you have suggested.

However, since 'euphonic transformations' (sandhi operations) are involved, it is not just concatenation of prefix with the root.
For example look at the following two words, one has a prefix and the other does not. The analysis is exactly same except the prefix part.
And there is a 'sandhi' 'A + a -> A'

agacCaw=gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
AgacCaw=A-gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>

Further, there can be more than one prefixes possible. We have examples of 5 prefixes. Total possible prefixes are 22. Thus maximum 22^5 possibilities are possible.

== ==

I'll give a brief note on Sanskrit morphology here.
I assume you have the diagram before you for reference.

a) Sanskrit has two types of suffixes at inflectional level. They are nominal and verbal.
So to handle this simple morphology, just a paradigm model is sufficient for nouns. For verbs also, it should be possible to have a paradigm model. But I have not implemented it that way. Of course, I do not see any difficulty in implementing it with the paradigm approach.

b) Second type of morphology is with the prefixes, attached to verbs.
Here the main problem is there are some euphoneic changes when  prefix(s) is(are) added.

So for example,
gam -> agacCaw
A-gam -> AgacCaw

c) Third is the derivational morphology.
These are of following types:
i) verbal roots -> nominal roots (known as kridanta)
ii) verbal roots -> verbal roots (known as sanaadi)
iii) nominal roots -> verbal root (also known as sanaadi)
iv) nominal roots -> nominal roots (known as taddhita)
v) compounds

Again compounds are 6 different types, of which two are very productive. They are
a) noun+noun
b) noun+verb

The example which you have quoted from Wikipedia is of noun-noun compound type.
Here we have complexity at two levels.
All the pre-noun words are in a special form.
For example very famous compund 'rAjapuruReNa' (I am using WX notation)
has -- rAja--  as its first part, which can not appear in a text as an isolated form. This is a pre-counpound form.
Further the second word undergoes the inflection as usual.
As of now I have handled it by writing a seperate pardef def for handling pre-compound forms.
This will be invoked only if it encounters another word following it.

There is another severe problem. It has to do with the compounds which behave as an adjective.

Typical example is again 'ambara' which is in neuter gender.
A compound 'pIwa-ambara' means a yellow cloth.
But if this word refers to a lady wearing a yellow compound, then its form changes to 'pIwambarA', and it behaves like an 'A' ending fem paradigm.
However we do not get these words in the normal dictionary.
We have developed a special module to handle these cases. Here care should be taken to avoid over-recognition / generation.

Coming back to the 4 types of derived roots,  it is possible that new roots get generated by applying the above rules recursively, and finally the derivation ends once an inflectional suffix is added.