Difference between revisions of "Subreadings in Constraint Grammar"

From Apertium
Jump to navigation Jump to search
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
'''This is now implemented in vislcg3: http://beta.visl.sdu.dk/cg3/chunked/subreadings.html'''
==Current situation==
 
  +
  +
  +
==Why we need sub-readings==
 
Typical input with sub-readings:
 
Typical input with sub-readings:
   
Line 15: Line 18:
   
 
==What we need==
 
==What we need==
* We may need to refer to an earlier sub-reading in order to disambiguate
+
* We may need to refer to a non-main sub-reading in order to disambiguate
* We may want to put a mapping tag on an earlier sub-reading
+
* We may want to put a mapping tag on a non-main sub-reading
* And of course we want to be able to refer to the last as in the current situation
+
* And of course we want to be able to refer to the main sub-reading
   
 
===Referring to the final sub-reading===
 
===Referring to the final sub-reading===
 
Northern Sámi postpositions take genitive.
 
Northern Sámi postpositions take genitive.
 
Gloss:
 
 
soahtefámu vuostá
 
war.power.GEN against.PO
 
   
 
Input fragment:
 
Input fragment:
   
^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Acc>/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen>$ ^vuostá/vuostá<Po>/vuostá<Pr>/vuostá<N><Sg><Nom>$
+
^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Acc>/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen>$
  +
^vuostá/vuostá<Po>/vuostá<Pr>/vuostá<N><Sg><Nom>$
   
 
Correct output:
 
Correct output:
   
^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen><@→P>$ ^vuostá/vuostá<Po><@←ADVL>$^
+
^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen><@→P>$ # war.power.GEN
  +
^vuostá/vuostá<Po><@←ADVL>$^ # against.PO
   
 
If the input noun were unambiguously nominative, the Po reading should not be selected, so we might have a rule somewhere with
 
If the input noun were unambiguously nominative, the Po reading should not be selected, so we might have a rule somewhere with
Line 39: Line 39:
 
REMOVE Po if (-1 (Nom))
 
REMOVE Po if (-1 (Nom))
   
but if this matched non-final sub-readings, we would get the wrong tagging here. Currently, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there).
+
but if this matched non-final sub-readings, we would get the wrong tagging here. By default, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there).
   
 
===Referring to non-final sub-readings===
 
===Referring to non-final sub-readings===
Gloss:
 
 
D'an emgann ez an
 
to.the battle PART I.go
 
 
 
Input:
 
Input:
   
Line 56: Line 51:
 
Correct output:
 
Correct output:
   
^D'an/Da<pr><@ADVL→>+an<det><def><sp><@→N>$
+
^D'an/Da<pr><@ADVL→>+an<det><def><sp><@→N>$ # to.the
^emgann/emgann<n><m><sg><@P←>$
+
^emgann/emgann<n><m><sg><@P←>$ # battle
^ez/e<vpart><obj><@Pcle>$
+
^ez/e<vpart><obj><@Pcle>$ # PART
^an/mont<vblex><pri><p1><sg><@+FMAINV>$
+
^an/mont<vblex><pri><p1><sg><@+FMAINV>$ # I.go
   
* We want to ''refer'' to the &lt;pr&gt; sub-reading when tagging ''emgann'' as @P← (possibly also in disambiguation).
+
* We want to '''refer''' to the &lt;pr&gt; sub-reading when mapping ''emgann'' as @P← (possibly also in disambiguation).
* We want to MAP an @ADVL→ tag on the preposition (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer.
+
* We want to '''MAP''' an @ADVL→ tag on the &lt;pr&gt; sub-reading (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer.
   
==Some file==
+
==VISL CG-3 syntax==
  +
VISL CG-3 keeps the default behaviour that we always refer to only the last sub-reading unless explicitly mentioning sub-readings. But for some languages, you might want to prefer the first sub-reading to be main by default. VISL CG-3 caters to both preferences. From the manual:
<pre>
 
SECTION
 
   
  +
''The order of which is the primary reading vs. sub-readings depends on the grammar SUBREADINGS setting:''
SUBSTITUTE ("од") ("од:5") ("од") (-1 (adj));
 
   
  +
SUBREADINGS = RTL ; # Default, right-to-left
  +
SUBREADINGS = LTR ; # Alternate, left-to-right
   
^помладо/adj<pref><comp>+млад<adj><nt><sg><nom><ind>$ ^од/од<pr>$ ^30/30<num>$^./.<sent>$
 
</pre>
 
   
  +
Then, to '''refer''' to a non-final sub-reading in the default RTL mode, we could say
<pre>
 
MAP (@+FMAINV) TARGET VerbFin ;
 
   
  +
ADD (@ADV←) TARGET (n) IF (-2/1 (pr)) (-2 (n)) ;
^n'eus/ne<adv>+bezañ<vblex><pri><impers><sp>/ne<adv>+kaout<vblex><pri><p1><pl>$ ^kador/kador<n><f><sg>$ ^ebet/ebet<adv>$^./.<sent>$
 
</pre>
 
   
  +
to say that we require the next-to-final sub-reading of the cohort two positions left be a word that has the main reading <code>n</code> and next sub-reading <code>pr</code>. This would match if the input were e.g.
   
  +
^forsooth/for<pr>+sooth<n>/forsooth<adv>$ ^he/prpers<prn>$ ^be/be<vblex>$
<pre>
 
^D'an/Da<pr>+an<det><def><sp>$
 
^emgann/emgann<n><m><sg>$
 
^ez/e<vpart><obj>/ael<n><m><pl>/mont<vblex><pri><p2><sg>/monet<vblex><pri><p2><sg>/e<pr>+da<det><pos><mf><sp>$
 
^an/an<det><def><sp>/mont<vblex><pri><p1><sg>/monet<vblex><pri><p1><sg>$
 
   
  +
Since we only have two sub-readings here, we could also ask that the last sub-reading be <code>pr</code>, with the same effect:
Here we want:
 
   
  +
ADD (@ADV←) TARGET (n) IF (-2/-1 (pr)) (-2 (n)) ;
^D'an/Da<pr><@ADVL→>+an<det><def><sp><@→N>$ to.the
 
  +
^emgann/emgann<n><m><sg><@P←>$ battle
 
  +
^ez/e<vpart><obj><@Pcle>$
 
  +
Parallell to regular CG word indexes, 0 is the "head". In RTL mode, this is the last sub-reading, while -1 is one sub-reading to the left of that. Positive numbers read from the left, so 1 is the first sub-reading from the left. For three sub-readings, that gives us the following indexing:
^an/mont<vblex><pri><p1><sg><@+FMAINV>$ go.i
 
  +
  +
^foo<tags>+bar<tags>+fie<tags>$
  +
2 1 0
  +
-1 -2 -3
  +
  +
For LTR mode, the left sub-reading is the head with index 0, and counts go the other way:
  +
  +
^foo<tags>+bar<tags>+fie<tags>$
  +
0 1 2
  +
-3 -2 -1
  +
  +
  +
To ADD the tag to the non-final sub-reading itself, use the SUB:N keyword after ADD:
  +
  +
ADD SUB:-1 (@→V) TARGET (pr) IF (*1 (v)) ;
  +
  +
  +
We might also want to say "require ''any'' main- or sub-reading to be tagged <code>pr</code>":
  +
  +
ADD (@P←) TARGET (n) IF (-1/* (pr)) ;
  +
  +
or to say that all readings of the previous word are unambiguously pr (on one of the sub-readings):
  +
  +
ADD (@P←) TARGET (n) IF (-1C/* (pr)) ;
  +
  +
  +
You can now do
  +
  +
<pre>REMOVE SUB:* (pr) IF (1 (vblex));</pre>
  +
  +
to remove any reading which has a pr on some sub-reading if there's a following verb.
  +
  +
The old workaround was to do
  +
<pre>REMOVE (pr) IF (1 (vblex));
  +
REMOVE SUB:1 (pr) IF (1 (vblex));
  +
REMOVE SUB:2 (pr) IF (1 (vblex));
  +
REMOVE SUB:3 (pr) IF (1 (vblex));</pre>
  +
etc. as high as your analyser allowed.
  +
  +
==Wishlist==
  +
  +
===A special set for "has-subreading"===
  +
The same way that you can do
  +
<pre>LIST match-any = (*)</pre>
  +
it would be nice to be able to do e.g.
  +
<pre>LIST compound = (*/1)</pre>
  +
to match on any (sub)reading that has at least one subreading.
  +
  +
Note: This is typically meant to be used on the main reading, so that you can have rules like
  +
<pre>REMOVE compound IF …</pre>
  +
or
  +
<pre>MAP (@FOO) IF (-1 compound + Pl LINK …)</pre>
  +
(note that the <code>(*/1)</code> is a "tag" of the main reading here, so the + Pl means that the main reading is plural, and there is at least one subreading below the main reading).
  +
  +
===(a)+SUB:1(b) – requirements on both main and sub-readings at once===
  +
There is currently no way for SELECT/REMOVE to target tags in both the sub-reading and the reading. E.g. say you use the default mode where final readings are main, and your input is
  +
<pre>
  +
^foobar/foo<vblex>+bar<n>/foo<n>+bar<n>/foo<vblex>+bar<vblex>/foo<n>+bar<vblex>$
 
</pre>
 
</pre>
  +
Now given a verb-verb compund is less likely than an anything-noun compound, you want to remove any reading where both sub-readings are verbs. If you try
  +
<pre>REMOVE SUB:1 (vblex) (0 (vblex)) ;</pre>
  +
it will wrongly remove the verb-noun reading as well, since the (0 (vblex)) matches on another reading's sub-reading. If you try
  +
<pre>REMOVE (vblex) (0/1 (vblex)) ;</pre>
  +
it will wrongly remove the noun-verb as well, since the (0/1 (vblex)) matches on another reading's sub-reading.
  +
  +
Similarly, there is no way to MAP/ADD tags to the main reading of a reading that has a certain sub-reading. Using the above example, there is no way to map @v to only the verb-verb compound without either hitting the noun-verb or the verb-noun as well.
  +
  +
  +
Possible syntax, similar to set intersection:
  +
<pre>REMOVE (vblex) + SUB:1 (vblex);</pre>
  +
  +
  +
This might make sense inside a context condition as well:
  +
<pre>REMOVE (vblex) IF (0 (n) + 0/1 (n));</pre>
  +
  +
Or even as a variable, assuming people don't name their sets "SUB:1":
  +
<pre>SET verb-verb-compound = (vblex) + SUB:1 (vblex);</pre>

Latest revision as of 21:20, 10 December 2015

This is now implemented in vislcg3: http://beta.visl.sdu.dk/cg3/chunked/subreadings.html


Why we need sub-readings[edit]

Typical input with sub-readings:

^foobar/foo+bar/fubar/flue+barge$

Right now, only the last sub-reading is used, in the above example, vislcg3 treats it as if it were

^foobar/bar/fubar/barge$

This works great for compounds where the stuff before the + is mostly inconsequential, while for other multiword expressions it is not so good... (Also, mapping tags are only put on the last sub-reading now.)

Wait can't we just split on the + with pretransfer before sending this to cg-proc?
No, because we first have to disambiguate between eg. ^foobar/foo+bar/fubar/flue+barge$ (what would that even look like if split? wouldn't work)

What we need[edit]

  • We may need to refer to a non-main sub-reading in order to disambiguate
  • We may want to put a mapping tag on a non-main sub-reading
  • And of course we want to be able to refer to the main sub-reading

Referring to the final sub-reading[edit]

Northern Sámi postpositions take genitive.

Input fragment:

^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Acc>/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen>$ 
^vuostá/vuostá<Po>/vuostá<Pr>/vuostá<N><Sg><Nom>$

Correct output:

^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen><@→P>$        # war.power.GEN
^vuostá/vuostá<Po><@←ADVL>$^                                       # against.PO

If the input noun were unambiguously nominative, the Po reading should not be selected, so we might have a rule somewhere with

REMOVE Po if (-1 (Nom))

but if this matched non-final sub-readings, we would get the wrong tagging here. By default, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there).

Referring to non-final sub-readings[edit]

Input:

^D'an/Da<pr>+an<det><def><sp>$
^emgann/emgann<n><m><sg>$ 
^ez/e<vpart><obj>/ael<n><m><pl>/mont<vblex><pri><p2><sg>/monet<vblex><pri><p2><sg>/e<pr>+da<det><pos><mf><sp>$
^an/an<det><def><sp>/mont<vblex><pri><p1><sg>/monet<vblex><pri><p1><sg>$

Correct output:

^D'an/Da<pr><@ADVL→>+an<det><def><sp><@→N>$       # to.the
^emgann/emgann<n><m><sg><@P←>$                    # battle
^ez/e<vpart><obj><@Pcle>$                         # PART
^an/mont<vblex><pri><p1><sg><@+FMAINV>$           # I.go
  • We want to refer to the <pr> sub-reading when mapping emgann as @P← (possibly also in disambiguation).
  • We want to MAP an @ADVL→ tag on the <pr> sub-reading (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer.

VISL CG-3 syntax[edit]

VISL CG-3 keeps the default behaviour that we always refer to only the last sub-reading unless explicitly mentioning sub-readings. But for some languages, you might want to prefer the first sub-reading to be main by default. VISL CG-3 caters to both preferences. From the manual:

The order of which is the primary reading vs. sub-readings depends on the grammar SUBREADINGS setting:

     SUBREADINGS = RTL ; # Default, right-to-left
     SUBREADINGS = LTR ; # Alternate, left-to-right


Then, to refer to a non-final sub-reading in the default RTL mode, we could say

 ADD (@ADV←) TARGET (n) IF (-2/1 (pr)) (-2 (n)) ;

to say that we require the next-to-final sub-reading of the cohort two positions left be a word that has the main reading n and next sub-reading pr. This would match if the input were e.g.

 ^forsooth/for<pr>+sooth<n>/forsooth<adv>$ ^he/prpers<prn>$ ^be/be<vblex>$

Since we only have two sub-readings here, we could also ask that the last sub-reading be pr, with the same effect:

 ADD (@ADV←) TARGET (n) IF (-2/-1 (pr)) (-2 (n)) ;


Parallell to regular CG word indexes, 0 is the "head". In RTL mode, this is the last sub-reading, while -1 is one sub-reading to the left of that. Positive numbers read from the left, so 1 is the first sub-reading from the left. For three sub-readings, that gives us the following indexing:

   ^foo<tags>+bar<tags>+fie<tags>$
      2        1         0
     -1       -2        -3

For LTR mode, the left sub-reading is the head with index 0, and counts go the other way:

   ^foo<tags>+bar<tags>+fie<tags>$
      0        1         2
     -3       -2        -1


To ADD the tag to the non-final sub-reading itself, use the SUB:N keyword after ADD:

 ADD SUB:-1 (@→V) TARGET (pr) IF (*1 (v)) ;


We might also want to say "require any main- or sub-reading to be tagged pr":

 ADD (@P←) TARGET (n) IF (-1/* (pr)) ;

or to say that all readings of the previous word are unambiguously pr (on one of the sub-readings):

 ADD (@P←) TARGET (n) IF (-1C/* (pr)) ;


You can now do

REMOVE SUB:* (pr) IF (1 (vblex));

to remove any reading which has a pr on some sub-reading if there's a following verb.

The old workaround was to do

REMOVE (pr) IF (1 (vblex));
REMOVE SUB:1 (pr) IF (1 (vblex));
REMOVE SUB:2 (pr) IF (1 (vblex));
REMOVE SUB:3 (pr) IF (1 (vblex));

etc. as high as your analyser allowed.

Wishlist[edit]

A special set for "has-subreading"[edit]

The same way that you can do

LIST match-any = (*)

it would be nice to be able to do e.g.

LIST compound = (*/1)

to match on any (sub)reading that has at least one subreading.

Note: This is typically meant to be used on the main reading, so that you can have rules like

REMOVE compound IF …

or

MAP (@FOO) IF (-1 compound + Pl LINK …)

(note that the (*/1) is a "tag" of the main reading here, so the + Pl means that the main reading is plural, and there is at least one subreading below the main reading).

(a)+SUB:1(b) – requirements on both main and sub-readings at once[edit]

There is currently no way for SELECT/REMOVE to target tags in both the sub-reading and the reading. E.g. say you use the default mode where final readings are main, and your input is

^foobar/foo<vblex>+bar<n>/foo<n>+bar<n>/foo<vblex>+bar<vblex>/foo<n>+bar<vblex>$

Now given a verb-verb compund is less likely than an anything-noun compound, you want to remove any reading where both sub-readings are verbs. If you try

REMOVE SUB:1 (vblex) (0 (vblex)) ;

it will wrongly remove the verb-noun reading as well, since the (0 (vblex)) matches on another reading's sub-reading. If you try

REMOVE (vblex) (0/1 (vblex)) ;

it will wrongly remove the noun-verb as well, since the (0/1 (vblex)) matches on another reading's sub-reading.

Similarly, there is no way to MAP/ADD tags to the main reading of a reading that has a certain sub-reading. Using the above example, there is no way to map @v to only the verb-verb compound without either hitting the noun-verb or the verb-noun as well.


Possible syntax, similar to set intersection:

REMOVE (vblex) + SUB:1 (vblex);


This might make sense inside a context condition as well:

REMOVE (vblex) IF (0 (n) + 0/1 (n));

Or even as a variable, assuming people don't name their sets "SUB:1":

SET verb-verb-compound = (vblex) + SUB:1 (vblex);