Difference between revisions of "Subreadings in Constraint Grammar"

From Apertium
Jump to navigation Jump to search
(according to TD, SETs should be agnostic to that sort of thing)
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
'''This is now implemented in vislcg3: http://beta.visl.sdu.dk/cg3/chunked/subreadings.html'''
==Current situation==
 
  +
  +
  +
==Why we need sub-readings==
 
Typical input with sub-readings:
 
Typical input with sub-readings:
   
Line 15: Line 18:
   
 
==What we need==
 
==What we need==
* We may need to refer to an earlier sub-reading in order to disambiguate
+
* We may need to refer to a non-main sub-reading in order to disambiguate
* We may want to put a mapping tag on an earlier sub-reading
+
* We may want to put a mapping tag on a non-main sub-reading
* And of course we want to be able to refer to the last as in the current situation
+
* And of course we want to be able to refer to the main sub-reading
   
 
===Referring to the final sub-reading===
 
===Referring to the final sub-reading===
Line 36: Line 39:
 
REMOVE Po if (-1 (Nom))
 
REMOVE Po if (-1 (Nom))
   
but if this matched non-final sub-readings, we would get the wrong tagging here. Currently, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there).
+
but if this matched non-final sub-readings, we would get the wrong tagging here. By default, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there).
   
 
===Referring to non-final sub-readings===
 
===Referring to non-final sub-readings===
Line 56: Line 59:
 
* We want to '''MAP''' an @ADVL→ tag on the <pr> sub-reading (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer.
 
* We want to '''MAP''' an @ADVL→ tag on the <pr> sub-reading (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer.
   
===Possible syntax===
+
==VISL CG-3 syntax==
One alternative is to keep as the default behaviour that we always refer to only the last sub-reading unless explicitly mentioning sub-readings.
+
VISL CG-3 keeps the default behaviour that we always refer to only the last sub-reading unless explicitly mentioning sub-readings. But for some languages, you might want to prefer the first sub-reading to be main by default. VISL CG-3 caters to both preferences. From the manual:
   
  +
''The order of which is the primary reading vs. sub-readings depends on the grammar SUBREADINGS setting:''
Then, to '''refer''' to a non-final sub-reading, we could say
 
   
  +
SUBREADINGS = RTL ; # Default, right-to-left
MAP (@P←) TARGET (n) IF (-1/-1 (pr)) ;
 
  +
SUBREADINGS = LTR ; # Alternate, left-to-right
   
to say that we require the next-to-final sub-reading of the cohort to the left to be a preposition.
 
   
 
Then, to '''refer''' to a non-final sub-reading in the default RTL mode, we could say
Parallell to regular CG word indexes, 0 is the "head" (the last sub-reading), while -1 is one sub-reading to the left of that. Positive numbers would read from the left, so 1 is the first sub-reading from the left. For three sub-readings, that gives us the following indexing:
 
   
  +
ADD (@ADV←) TARGET (n) IF (-2/1 (pr)) (-2 (n)) ;
^sublem1<tags>+sublem2<tags>+sublem3<tags>$
 
1 2 3
 
-2 -1 0
 
   
  +
to say that we require the next-to-final sub-reading of the cohort two positions left be a word that has the main reading <code>n</code> and next sub-reading <code>pr</code>. This would match if the input were e.g.
   
  +
^forsooth/for<pr>+sooth<n>/forsooth<adv>$ ^he/prpers<prn>$ ^be/be<vblex>$
   
We might also want to say "require ''any'' main- or sub-reading to be tagged <code>pr</code>'':
+
Since we only have two sub-readings here, we could also ask that the last sub-reading be <code>pr</code>, with the same effect:
   
MAP (@P←) TARGET (n) IF (-1/*0 (pr)) ;
+
ADD (@ADV←) TARGET (n) IF (-2/-1 (pr)) (-2 (n)) ;
   
   
 
Parallell to regular CG word indexes, 0 is the "head". In RTL mode, this is the last sub-reading, while -1 is one sub-reading to the left of that. Positive numbers read from the left, so 1 is the first sub-reading from the left. For three sub-readings, that gives us the following indexing:
   
 
^foo<tags>+bar<tags>+fie<tags>$
To '''MAP''' to a non-final sub-reading, we could then say
 
 
2 1 0
  +
-1 -2 -3
   
  +
For LTR mode, the left sub-reading is the head with index 0, and counts go the other way:
MAP /-1 (@ADVL→) TARGET (pr) IF (1* (n)) ;
 
   
  +
^foo<tags>+bar<tags>+fie<tags>$
==Some file==
 
 
0 1 2
<pre>
 
  +
-3 -2 -1
SECTION
 
   
SUBSTITUTE ("од") ("од:5") ("од") (-1 (adj));
 
   
 
To ADD the tag to the non-final sub-reading itself, use the SUB:N keyword after ADD:
   
 
ADD SUB:-1 (@→V) TARGET (pr) IF (*1 (v)) ;
^помладо/adj<pref><comp>+млад<adj><nt><sg><nom><ind>$ ^од/од<pr>$ ^30/30<num>$^./.<sent>$
 
</pre>
 
   
<pre>
 
MAP (@+FMAINV) TARGET VerbFin ;
 
   
  +
We might also want to say "require ''any'' main- or sub-reading to be tagged <code>pr</code>":
^n'eus/ne<adv>+bezañ<vblex><pri><impers><sp>/ne<adv>+kaout<vblex><pri><p1><pl>$ ^kador/kador<n><f><sg>$ ^ebet/ebet<adv>$^./.<sent>$
 
  +
 
ADD (@P←) TARGET (n) IF (-1/* (pr)) ;
  +
  +
or to say that all readings of the previous word are unambiguously pr (on one of the sub-readings):
  +
  +
ADD (@P←) TARGET (n) IF (-1C/* (pr)) ;
  +
  +
  +
You can now do
  +
  +
<pre>REMOVE SUB:* (pr) IF (1 (vblex));</pre>
  +
  +
to remove any reading which has a pr on some sub-reading if there's a following verb.
  +
  +
The old workaround was to do
  +
<pre>REMOVE (pr) IF (1 (vblex));
  +
REMOVE SUB:1 (pr) IF (1 (vblex));
  +
REMOVE SUB:2 (pr) IF (1 (vblex));
  +
REMOVE SUB:3 (pr) IF (1 (vblex));</pre>
  +
etc. as high as your analyser allowed.
  +
  +
==Wishlist==
  +
  +
===A special set for "has-subreading"===
  +
The same way that you can do
  +
<pre>LIST match-any = (*)</pre>
  +
it would be nice to be able to do e.g.
  +
<pre>LIST compound = (*/1)</pre>
  +
to match on any (sub)reading that has at least one subreading.
  +
  +
Note: This is typically meant to be used on the main reading, so that you can have rules like
  +
<pre>REMOVE compound IF …</pre>
  +
or
  +
<pre>MAP (@FOO) IF (-1 compound + Pl LINK …)</pre>
  +
(note that the <code>(*/1)</code> is a "tag" of the main reading here, so the + Pl means that the main reading is plural, and there is at least one subreading below the main reading).
  +
  +
===(a)+SUB:1(b) – requirements on both main and sub-readings at once===
  +
There is currently no way for SELECT/REMOVE to target tags in both the sub-reading and the reading. E.g. say you use the default mode where final readings are main, and your input is
 
<pre>
  +
^foobar/foo<vblex>+bar<n>/foo<n>+bar<n>/foo<vblex>+bar<vblex>/foo<n>+bar<vblex>$
 
</pre>
 
</pre>
  +
Now given a verb-verb compund is less likely than an anything-noun compound, you want to remove any reading where both sub-readings are verbs. If you try
  +
<pre>REMOVE SUB:1 (vblex) (0 (vblex)) ;</pre>
  +
it will wrongly remove the verb-noun reading as well, since the (0 (vblex)) matches on another reading's sub-reading. If you try
  +
<pre>REMOVE (vblex) (0/1 (vblex)) ;</pre>
  +
it will wrongly remove the noun-verb as well, since the (0/1 (vblex)) matches on another reading's sub-reading.
  +
  +
Similarly, there is no way to MAP/ADD tags to the main reading of a reading that has a certain sub-reading. Using the above example, there is no way to map @v to only the verb-verb compound without either hitting the noun-verb or the verb-noun as well.
  +
  +
  +
Possible syntax, similar to set intersection:
  +
<pre>REMOVE (vblex) + SUB:1 (vblex);</pre>
  +
  +
  +
This might make sense inside a context condition as well:
  +
<pre>REMOVE (vblex) IF (0 (n) + 0/1 (n));</pre>
  +
  +
Or even as a variable, assuming people don't name their sets "SUB:1":
  +
<pre>SET verb-verb-compound = (vblex) + SUB:1 (vblex);</pre>

Latest revision as of 21:20, 10 December 2015

This is now implemented in vislcg3: http://beta.visl.sdu.dk/cg3/chunked/subreadings.html


Why we need sub-readings[edit]

Typical input with sub-readings:

^foobar/foo+bar/fubar/flue+barge$

Right now, only the last sub-reading is used, in the above example, vislcg3 treats it as if it were

^foobar/bar/fubar/barge$

This works great for compounds where the stuff before the + is mostly inconsequential, while for other multiword expressions it is not so good... (Also, mapping tags are only put on the last sub-reading now.)

Wait can't we just split on the + with pretransfer before sending this to cg-proc?
No, because we first have to disambiguate between eg. ^foobar/foo+bar/fubar/flue+barge$ (what would that even look like if split? wouldn't work)

What we need[edit]

  • We may need to refer to a non-main sub-reading in order to disambiguate
  • We may want to put a mapping tag on a non-main sub-reading
  • And of course we want to be able to refer to the main sub-reading

Referring to the final sub-reading[edit]

Northern Sámi postpositions take genitive.

Input fragment:

^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Acc>/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen>$ 
^vuostá/vuostá<Po>/vuostá<Pr>/vuostá<N><Sg><Nom>$

Correct output:

^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen><@→P>$        # war.power.GEN
^vuostá/vuostá<Po><@←ADVL>$^                                       # against.PO

If the input noun were unambiguously nominative, the Po reading should not be selected, so we might have a rule somewhere with

REMOVE Po if (-1 (Nom))

but if this matched non-final sub-readings, we would get the wrong tagging here. By default, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there).

Referring to non-final sub-readings[edit]

Input:

^D'an/Da<pr>+an<det><def><sp>$
^emgann/emgann<n><m><sg>$ 
^ez/e<vpart><obj>/ael<n><m><pl>/mont<vblex><pri><p2><sg>/monet<vblex><pri><p2><sg>/e<pr>+da<det><pos><mf><sp>$
^an/an<det><def><sp>/mont<vblex><pri><p1><sg>/monet<vblex><pri><p1><sg>$

Correct output:

^D'an/Da<pr><@ADVL→>+an<det><def><sp><@→N>$       # to.the
^emgann/emgann<n><m><sg><@P←>$                    # battle
^ez/e<vpart><obj><@Pcle>$                         # PART
^an/mont<vblex><pri><p1><sg><@+FMAINV>$           # I.go
  • We want to refer to the <pr> sub-reading when mapping emgann as @P← (possibly also in disambiguation).
  • We want to MAP an @ADVL→ tag on the <pr> sub-reading (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer.

VISL CG-3 syntax[edit]

VISL CG-3 keeps the default behaviour that we always refer to only the last sub-reading unless explicitly mentioning sub-readings. But for some languages, you might want to prefer the first sub-reading to be main by default. VISL CG-3 caters to both preferences. From the manual:

The order of which is the primary reading vs. sub-readings depends on the grammar SUBREADINGS setting:

     SUBREADINGS = RTL ; # Default, right-to-left
     SUBREADINGS = LTR ; # Alternate, left-to-right


Then, to refer to a non-final sub-reading in the default RTL mode, we could say

 ADD (@ADV←) TARGET (n) IF (-2/1 (pr)) (-2 (n)) ;

to say that we require the next-to-final sub-reading of the cohort two positions left be a word that has the main reading n and next sub-reading pr. This would match if the input were e.g.

 ^forsooth/for<pr>+sooth<n>/forsooth<adv>$ ^he/prpers<prn>$ ^be/be<vblex>$

Since we only have two sub-readings here, we could also ask that the last sub-reading be pr, with the same effect:

 ADD (@ADV←) TARGET (n) IF (-2/-1 (pr)) (-2 (n)) ;


Parallell to regular CG word indexes, 0 is the "head". In RTL mode, this is the last sub-reading, while -1 is one sub-reading to the left of that. Positive numbers read from the left, so 1 is the first sub-reading from the left. For three sub-readings, that gives us the following indexing:

   ^foo<tags>+bar<tags>+fie<tags>$
      2        1         0
     -1       -2        -3

For LTR mode, the left sub-reading is the head with index 0, and counts go the other way:

   ^foo<tags>+bar<tags>+fie<tags>$
      0        1         2
     -3       -2        -1


To ADD the tag to the non-final sub-reading itself, use the SUB:N keyword after ADD:

 ADD SUB:-1 (@→V) TARGET (pr) IF (*1 (v)) ;


We might also want to say "require any main- or sub-reading to be tagged pr":

 ADD (@P←) TARGET (n) IF (-1/* (pr)) ;

or to say that all readings of the previous word are unambiguously pr (on one of the sub-readings):

 ADD (@P←) TARGET (n) IF (-1C/* (pr)) ;


You can now do

REMOVE SUB:* (pr) IF (1 (vblex));

to remove any reading which has a pr on some sub-reading if there's a following verb.

The old workaround was to do

REMOVE (pr) IF (1 (vblex));
REMOVE SUB:1 (pr) IF (1 (vblex));
REMOVE SUB:2 (pr) IF (1 (vblex));
REMOVE SUB:3 (pr) IF (1 (vblex));

etc. as high as your analyser allowed.

Wishlist[edit]

A special set for "has-subreading"[edit]

The same way that you can do

LIST match-any = (*)

it would be nice to be able to do e.g.

LIST compound = (*/1)

to match on any (sub)reading that has at least one subreading.

Note: This is typically meant to be used on the main reading, so that you can have rules like

REMOVE compound IF …

or

MAP (@FOO) IF (-1 compound + Pl LINK …)

(note that the (*/1) is a "tag" of the main reading here, so the + Pl means that the main reading is plural, and there is at least one subreading below the main reading).

(a)+SUB:1(b) – requirements on both main and sub-readings at once[edit]

There is currently no way for SELECT/REMOVE to target tags in both the sub-reading and the reading. E.g. say you use the default mode where final readings are main, and your input is

^foobar/foo<vblex>+bar<n>/foo<n>+bar<n>/foo<vblex>+bar<vblex>/foo<n>+bar<vblex>$

Now given a verb-verb compund is less likely than an anything-noun compound, you want to remove any reading where both sub-readings are verbs. If you try

REMOVE SUB:1 (vblex) (0 (vblex)) ;

it will wrongly remove the verb-noun reading as well, since the (0 (vblex)) matches on another reading's sub-reading. If you try

REMOVE (vblex) (0/1 (vblex)) ;

it will wrongly remove the noun-verb as well, since the (0/1 (vblex)) matches on another reading's sub-reading.

Similarly, there is no way to MAP/ADD tags to the main reading of a reading that has a certain sub-reading. Using the above example, there is no way to map @v to only the verb-verb compound without either hitting the noun-verb or the verb-noun as well.


Possible syntax, similar to set intersection:

REMOVE (vblex) + SUB:1 (vblex);


This might make sense inside a context condition as well:

REMOVE (vblex) IF (0 (n) + 0/1 (n));

Or even as a variable, assuming people don't name their sets "SUB:1":

SET verb-verb-compound = (vblex) + SUB:1 (vblex);