Difference between revisions of "Talk:Compounds"

From Apertium
Jump to navigation Jump to search
(New page: <pre> > gīrvāṇabhāṣārasāsvādatatparān > "those who take delight in the language of the gods." compound with the constituents: gīrv...)
 
m
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
<pre>
 
<pre>
  +
 
- Would it be problematic to have words split slightly differently?
 
e.g. ^bekharim/be<prefix><ind>+kharan<vblex><pres><p1><sg>$
 
Where the "standard analysis" would be:
 
^bekharim/kharan<vblex><ind><pres><p1><sg>$
 
 
 
Here is an example of prefix analysis:
   
 
AgacCa=A-gam1<prayogaH:karwari><lakAraH:lot><puruRaH:m><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
   
 
where 'A' is the prefix, and 'gam1' is the verbal root.
> gīrvāṇabhāṣārasāsvādatatparān
 
 
So the analysis is close to what you have suggested.
> "those who take delight in the language of the gods."
 
 
compound with the constituents: gīrvāṇa-bhāṣā-rasa-āsvāda-tatpara + case{2nd} + number{plural}
 
It can be in either a neuter or masc gender. Its default paradigm will be 'rAma' if masc and 'puRpa' if neuter.
 
   
 
However, since 'euphonic transformations' (sandhi operations) are involved, it is not just concatenation of prefix with the root.
 
For example look at the following two words, one has a prefix and the other does not. The analysis is exactly same except the prefix part.
 
And there is a 'sandhi' 'A + a -> A'
   
 
agacCaw=gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
 
AgacCaw=A-gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
   
 
Further, there can be more than one prefixes possible. We have examples of 5 prefixes. Total possible prefixes are 22. Thus maximum 22^5 possibilities are possible.
Ok, so, this seems to me a bit like what happens in many Germanic
 
languages with word compounding (although probably more productive). An
 
example:
 
   
 
== ==
widerstandkämpferinnen
 
   
  +
Examples:
widerstand-kämpfer-in-en
 
   
 
gīrvāṇabhāṣārasāsvādatatparān
`resistance fighter + gender{feminine} + number{plural}'
 
 
"those who take delight in the language of the gods."
 
 
compound with the constituents: gīrvāṇa-bhāṣā-rasa-āsvāda-tatpara + case{2nd} + number{plural}
 
It can be in either a neuter or masc gender. Its default paradigm will be 'rAma' if masc and 'puRpa' if neuter.
  +
 
widerstandkämpferinnen
  +
 
widerstand-kämpfer-in-en
  +
 
`resistance fighter + gender{feminine} + number{plural}'
   
 
Here we could split the word into two:
 
Here we could split the word into two:
Line 64: Line 81:
 
mode to the analyser to process the split forms using the existant
 
mode to the analyser to process the split forms using the existant
 
analyser.
 
analyser.
 
Do you think we might be able to solve the problem this way?
 
   
 
== ==
 
== ==
   
 
I'll give a brief note on Sanskrit morphology here.
 
- Would it be problematic to have words split slightly differently?
 
e.g. ^bekharim/be<prefix><ind>+kharan<vblex><pres><p1><sg>$
 
Where the "standard analysis" would be:
 
^bekharim/kharan<vblex><ind><pres><p1><sg>$
 
 
Here is an example of prefix analysis:
 
   
AgacCa=A-gam1<prayogaH:karwari><lakAraH:lot><puruRaH:m><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
 
 
where 'A' is the prefix, and 'gam1' is the verbal root.
 
So the analysis is close to what you have suggested.
 
 
However, since 'euphonic transformations' (sandhi operations) are involved, it is not just concatenation of prefix with the root.
 
For example look at the following two words, one has a prefix and the other does not. The analysis is exactly same except the prefix part.
 
And there is a 'sandhi' 'A + a -> A'
 
 
agacCaw=gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
 
AgacCaw=A-gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
 
 
Further, there can be more than one prefixes possible. We have examples of 5 prefixes. Total possible prefixes are 22. Thus maximum 22^5 possibilities are possible.
 
 
== ==
 
 
I'll give a brief note on Sanskrit morphology here.
 
 
I assume you have the diagram before you for reference.
 
I assume you have the diagram before you for reference.
   
 
a) Sanskrit has two types of suffixes at inflectional level. They are nominal and verbal.
 
a) Sanskrit has two types of suffixes at inflectional level. They are nominal and verbal.
So to handle this simple morphology, just a paradigm model is sufficient for nouns. For verbs also, it should be possible to have a paradigm model. But I have not implemented it that way. Of course, I do not see any difficulty in implementing it with the paradigm approach.
+
So to handle this simple morphology, just a paradigm model is sufficient for nouns. For verbs also, it
  +
should be possible to have a paradigm model. But I have not implemented it that way. Of course, I do not
  +
see any difficulty in implementing it with the paradigm approach.
   
 
b) Second type of morphology is with the prefixes, attached to verbs.
 
b) Second type of morphology is with the prefixes, attached to verbs.
Here the main problem is there are some euphoneic changes when prefix(s) is(are) added.
+
Here the main problem is there are some euphonic changes when prefix(es) is(are) added.
   
So for example,
+
So for example,
gam -> agacCaw
+
gam -> agacCaw
A-gam -> AgacCaw
+
A-gam -> AgacCaw
   
c) Third is the derivational morphology.
+
c) Third is the derivational morphology. These are of following types:
These are of following types:
 
i) verbal roots -> nominal roots (known as kridanta)
 
ii) verbal roots -> verbal roots (known as sanaadi)
 
iii) nominal roots -> verbal root (also known as sanaadi)
 
iv) nominal roots -> nominal roots (known as taddhita)
 
v) compounds
 
   
 
i) verbal roots -> nominal roots (known as kridanta)
Again compounds are 6 different types, of which two are very productive. They are
 
 
ii) verbal roots -> verbal roots (known as sanaadi)
a) noun+noun
 
 
iii) nominal roots -> verbal root (also known as sanaadi)
b) noun+verb
 
 
iv) nominal roots -> nominal roots (known as taddhita)
 
v) compounds
  +
 
Again compounds are 6 different types, of which two are very productive. They are:
  +
 
a) noun+noun
 
b) noun+verb
   
 
The example which you have quoted from Wikipedia is of noun-noun compound type.
 
The example which you have quoted from Wikipedia is of noun-noun compound type.
 
Here we have complexity at two levels.
 
Here we have complexity at two levels.
 
All the pre-noun words are in a special form.
 
All the pre-noun words are in a special form.
For example very famous compund 'rAjapuruReNa' (I am using WX notation)
+
For example very famous compound 'rAjapuruReNa' (I am using WX notation)
has -- rAja-- as its first part, which can not appear in a text as an isolated form. This is a pre-counpound form.
+
has -- rAja-- as its first part, which can not appear in a text as an isolated form. This is a pre-compound form.
 
Further the second word undergoes the inflection as usual.
 
Further the second word undergoes the inflection as usual.
As of now I have handled it by writing a seperate pardef def for handling pre-compound forms.
+
As of now I have handled it by writing a separate pardef def for handling pre-compound forms.
 
This will be invoked only if it encounters another word following it.
 
This will be invoked only if it encounters another word following it.
   
Line 131: Line 126:
 
Typical example is again 'ambara' which is in neuter gender.
 
Typical example is again 'ambara' which is in neuter gender.
 
A compound 'pIwa-ambara' means a yellow cloth.
 
A compound 'pIwa-ambara' means a yellow cloth.
But if this word refers to a lady wearing a yellow compound, then its form changes to 'pIwambarA', and it behaves like an 'A' ending fem paradigm.
+
But if this word refers to a lady wearing a yellow compound, then its form changes to 'pIwambarA', and it behaves
  +
like an 'A' ending fem paradigm.
 
However we do not get these words in the normal dictionary.
 
However we do not get these words in the normal dictionary.
 
We have developed a special module to handle these cases. Here care should be taken to avoid over-recognition / generation.
 
We have developed a special module to handle these cases. Here care should be taken to avoid over-recognition / generation.
   
Coming back to the 4 types of derived roots, it is possible that new roots get generated by applying the above rules recursively, and finally the derivation ends once an inflectional suffix is added.
+
Coming back to the 4 types of derived roots, it is possible that new roots get generated by applying the above rules
  +
recursively, and finally the derivation ends once an inflectional suffix is added.
 
</pre>
 
</pre>
  +
  +
Re-compounding words isn't a special problem, per se: it can be easily done using <code>&lt;mlu&gt;</code>; knowing ''when'' to do it isn't any more significant than the same problem in other languages: noun+noun.pl in English is treated as noun.pl de noun in Spanish, without hed to whether it should actually be noun.pl de noun.pl, or if another preposition should be used, etc. -- [[User:Jimregan|Jimregan]] 00:12, 18 February 2009 (UTC)
  +
  +
:It _can_ be done with mlu's, but it is probably nicer to do it outside of transfer... - [[User:Francis Tyers|Francis Tyers]] 06:51, 18 February 2009 (UTC)
  +
:: can it be done with mlu's even if the compound does not exist as one word in the target dix? (<code>echo ^prosjekt<n><nt><sg><ind>+plan<n><m><sg><ind>$ | lt-proc -g nb-nn.autogen.bin => #prosjekt</code>, want ''prosjektplan'') --[[User:Unhammer|unhammer]] 08:28, 15 September 2009 (UTC)
  +
::: ah, you just don't add a space in the first place, no need for <code><mlu></code>[[User:Unhammer|unhammer]] 18:14, 15 September 2009 (UTC)

Latest revision as of 18:14, 15 September 2009


         - Would it be problematic to have words split slightly differently?
           e.g. ^bekharim/be<prefix><ind>+kharan<vblex><pres><p1><sg>$
           Where the "standard analysis" would be:
                ^bekharim/kharan<vblex><ind><pres><p1><sg>$
        
Here is an example of prefix analysis:

 AgacCa=A-gam1<prayogaH:karwari><lakAraH:lot><puruRaH:m><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>

where 'A' is the prefix, and 'gam1' is the verbal root.
So the analysis is close to what you have suggested.

However, since 'euphonic transformations' (sandhi operations) are involved, it is not just concatenation of prefix with the root.
For example look at the following two words, one has a prefix and the other does not. The analysis is exactly same except the prefix part.
And there is a 'sandhi' 'A + a -> A'

agacCaw=gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>
AgacCaw=A-gam1<prayogaH:karwari><lakAraH:laf><puruRaH:p><vacanam:1><paxI:parasmEpaxI><XAwuH:gamLz><gaNaH:BvAxiH>

Further, there can be more than one prefixes possible. We have examples of 5 prefixes. Total possible prefixes are 22. Thus maximum 22^5 possibilities are possible.

== ==

Examples:

      gīrvāṇabhāṣārasāsvādatatparān
      "those who take delight in the language of the gods."
         
compound with the constituents: gīrvāṇa-bhāṣā-rasa-āsvāda-tatpara + case{2nd} + number{plural}
It can be in either a neuter or masc gender. Its default paradigm will be 'rAma' if masc and 'puRpa' if neuter.

       widerstandkämpferinnen

       widerstand-kämpfer-in-en
 
      `resistance fighter + gender{feminine} + number{plural}'

Here we could split the word into two:

   widerstand kämpferinnen

And then analyse these separately. 

So, how do we split in two? Well, we think that possibly it might be
better to have a separate "pre-analysis splitting" stage that would work
using the morphological analyser. 

For example, if "widerstandkämpferinnen" is a word not in the
dictionary, but the two other words are, we can use a splitter like
this:

 <dictionary>
   <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
   <section id="splitter" type="inconditional">
     <e><p><l>widerstand</l><r>widerstand</r></p></e>
     <e><p><l>kämpfer</l><r>kämpfer</r></p></e>
     <e><p><l>kämpferin</l><r>kämpferin</r></p></e>
     <e><p><l>kämpferinnen</l><r>kämpferinnen</r></p></e>
   </section> 
 </dictionary>

Of course here we can also use paradigms, but the important thing is
that no analysis is performed.

 $ echo "widerstandkämpferinnen" | lt-proc splitter.bin  
 ^widerstand/widerstand$^kämpferinnen/kämpferinnen$

Here you could use also regular expressions to perform the splits.

    <e><re>[aeiou]+[dt]+[aeiou]</re></e>

$ echo "adaada" | lt-proc splitter.bin 
^ada/ada$^ada/ada$

Which will split the word, which can then be analysed as its constituent
parts. Of course, this only helps if the compound word is compositional.

What we will then do is after we have this split, we can write a new
mode to the analyser to process the split forms using the existant
analyser.

== ==

I'll give a brief note on Sanskrit morphology here.

I assume you have the diagram before you for reference.

a) Sanskrit has two types of suffixes at inflectional level. They are nominal and verbal.
   So to handle this simple morphology, just a paradigm model is sufficient for nouns. For verbs also, it 
   should be possible to have a paradigm model. But I have not implemented it that way. Of course, I do not 
   see any difficulty in implementing it with the paradigm approach.

b) Second type of morphology is with the prefixes, attached to verbs.
   Here the main problem is there are some euphonic changes when  prefix(es) is(are) added.

   So for example,
    gam -> agacCaw
    A-gam -> AgacCaw

c) Third is the derivational morphology. These are of following types:

    i) verbal roots -> nominal roots (known as kridanta)
    ii) verbal roots -> verbal roots (known as sanaadi)
    iii) nominal roots -> verbal root (also known as sanaadi)
    iv) nominal roots -> nominal roots (known as taddhita)
    v) compounds

Again compounds are 6 different types, of which two are very productive. They are:

    a) noun+noun
    b) noun+verb

The example which you have quoted from Wikipedia is of noun-noun compound type.
Here we have complexity at two levels.
All the pre-noun words are in a special form.
For example very famous compound 'rAjapuruReNa' (I am using WX notation)
has -- rAja--  as its first part, which can not appear in a text as an isolated form. This is a pre-compound form.
Further the second word undergoes the inflection as usual.
As of now I have handled it by writing a separate pardef def for handling pre-compound forms.
This will be invoked only if it encounters another word following it.

There is another severe problem. It has to do with the compounds which behave as an adjective.

Typical example is again 'ambara' which is in neuter gender.
A compound 'pIwa-ambara' means a yellow cloth.
But if this word refers to a lady wearing a yellow compound, then its form changes to 'pIwambarA', and it behaves 
like an 'A' ending fem paradigm.
However we do not get these words in the normal dictionary.
We have developed a special module to handle these cases. Here care should be taken to avoid over-recognition / generation.

Coming back to the 4 types of derived roots,  it is possible that new roots get generated by applying the above rules 
recursively, and finally the derivation ends once an inflectional suffix is added.

Re-compounding words isn't a special problem, per se: it can be easily done using <mlu>; knowing when to do it isn't any more significant than the same problem in other languages: noun+noun.pl in English is treated as noun.pl de noun in Spanish, without hed to whether it should actually be noun.pl de noun.pl, or if another preposition should be used, etc. -- Jimregan 00:12, 18 February 2009 (UTC)

It _can_ be done with mlu's, but it is probably nicer to do it outside of transfer... - Francis Tyers 06:51, 18 February 2009 (UTC)
can it be done with mlu's even if the compound does not exist as one word in the target dix? (echo ^prosjekt<n><nt><sg><ind>+plan<n><m><sg><ind>$ | lt-proc -g nb-nn.autogen.bin => #prosjekt, want prosjektplan) --unhammer 08:28, 15 September 2009 (UTC)
ah, you just don't add a space in the first place, no need for <mlu>unhammer 18:14, 15 September 2009 (UTC)