Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Talk:Compounds

From Apertium
Jump to: navigation, search

         - Would it be problematic to have words split slightly differently?
           e.g. ^bekharim/be<prefix><ind>+kharan<vblex><pres><p1><sg>$
           Where the "standard analysis" would be:
                ^bekharim/kharan<vblex><ind><pres><p1><sg>$
        
Here is an example of prefix analysis:

 AgacCa=A-gam1<prayogaH:karwari><lakAraH:lot><puruRaH:m><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>

where 'A' is the prefix, and 'gam1' is the verbal root.
So the analysis is close to what you have suggested.

However, since 'euphonic transformations' (sandhi operations) are involved, it is not just concatenation of prefix with the root.
For example look at the following two words, one has a prefix and the other does not. The analysis is exactly same except the prefix part.
And there is a 'sandhi' 'A + a -> A'

agacCaw=gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
AgacCaw=A-gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>

Further, there can be more than one prefixes possible. We have examples of 5 prefixes. Total possible prefixes are 22. Thus maximum 22^5 possibilities are possible.

== ==

Examples:

      gīrvāṇabhāṣārasāsvādatatparān
      "those who take delight in the language of the gods."
         
compound with the constituents: gīrvāṇa-bhāṣā-rasa-āsvāda-tatpara + case{2nd} + number{plural}
It can be in either a neuter or masc gender. Its default paradigm will be 'rAma' if masc and 'puRpa' if neuter.

       widerstandkämpferinnen

       widerstand-kämpfer-in-en
 
      `resistance fighter + gender{feminine} + number{plural}'

Here we could split the word into two:

   widerstand kämpferinnen

And then analyse these separately. 

So, how do we split in two? Well, we think that possibly it might be
better to have a separate "pre-analysis splitting" stage that would work
using the morphological analyser. 

For example, if "widerstandkämpferinnen" is a word not in the
dictionary, but the two other words are, we can use a splitter like
this:

 <dictionary>
   <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
   <section id="splitter" type="inconditional">
     <e><p><l>widerstand</l><r>widerstand</r></p></e>
     <e><p><l>kämpfer</l><r>kämpfer</r></p></e>
     <e><p><l>kämpferin</l><r>kämpferin</r></p></e>
     <e><p><l>kämpferinnen</l><r>kämpferinnen</r></p></e>
   </section> 
 </dictionary>

Of course here we can also use paradigms, but the important thing is
that no analysis is performed.

 $ echo "widerstandkämpferinnen" | lt-proc splitter.bin  
 ^widerstand/widerstand$^kämpferinnen/kämpferinnen$

Here you could use also regular expressions to perform the splits.

    <e><re>[aeiou]+[dt]+[aeiou]</re></e>

$ echo "adaada" | lt-proc splitter.bin 
^ada/ada$^ada/ada$

Which will split the word, which can then be analysed as its constituent
parts. Of course, this only helps if the compound word is compositional.

What we will then do is after we have this split, we can write a new
mode to the analyser to process the split forms using the existant
analyser.

== ==

I'll give a brief note on Sanskrit morphology here.

I assume you have the diagram before you for reference.

a) Sanskrit has two types of suffixes at inflectional level. They are nominal and verbal.
   So to handle this simple morphology, just a paradigm model is sufficient for nouns. For verbs also, it 
   should be possible to have a paradigm model. But I have not implemented it that way. Of course, I do not 
   see any difficulty in implementing it with the paradigm approach.

b) Second type of morphology is with the prefixes, attached to verbs.
   Here the main problem is there are some euphonic changes when  prefix(es) is(are) added.

   So for example,
    gam -> agacCaw
    A-gam -> AgacCaw

c) Third is the derivational morphology. These are of following types:

    i) verbal roots -> nominal roots (known as kridanta)
    ii) verbal roots -> verbal roots (known as sanaadi)
    iii) nominal roots -> verbal root (also known as sanaadi)
    iv) nominal roots -> nominal roots (known as taddhita)
    v) compounds

Again compounds are 6 different types, of which two are very productive. They are:

    a) noun+noun
    b) noun+verb

The example which you have quoted from Wikipedia is of noun-noun compound type.
Here we have complexity at two levels.
All the pre-noun words are in a special form.
For example very famous compound 'rAjapuruReNa' (I am using WX notation)
has -- rAja--  as its first part, which can not appear in a text as an isolated form. This is a pre-compound form.
Further the second word undergoes the inflection as usual.
As of now I have handled it by writing a separate pardef def for handling pre-compound forms.
This will be invoked only if it encounters another word following it.

There is another severe problem. It has to do with the compounds which behave as an adjective.

Typical example is again 'ambara' which is in neuter gender.
A compound 'pIwa-ambara' means a yellow cloth.
But if this word refers to a lady wearing a yellow compound, then its form changes to 'pIwambarA', and it behaves 
like an 'A' ending fem paradigm.
However we do not get these words in the normal dictionary.
We have developed a special module to handle these cases. Here care should be taken to avoid over-recognition / generation.

Coming back to the 4 types of derived roots,  it is possible that new roots get generated by applying the above rules 
recursively, and finally the derivation ends once an inflectional suffix is added.

Re-compounding words isn't a special problem, per se: it can be easily done using <mlu>; knowing when to do it isn't any more significant than the same problem in other languages: noun+noun.pl in English is treated as noun.pl de noun in Spanish, without hed to whether it should actually be noun.pl de noun.pl, or if another preposition should be used, etc. -- Jimregan 00:12, 18 February 2009 (UTC)

It _can_ be done with mlu's, but it is probably nicer to do it outside of transfer... - Francis Tyers 06:51, 18 February 2009 (UTC)
can it be done with mlu's even if the compound does not exist as one word in the target dix? (echo ^prosjekt<n><nt><sg><ind>+plan<n><m><sg><ind>$ | lt-proc -g nb-nn.autogen.bin => #prosjekt, want prosjektplan) --unhammer 08:28, 15 September 2009 (UTC)
ah, you just don't add a space in the first place, no need for <mlu>unhammer 18:14, 15 September 2009 (UTC)
Personal tools