Difference between revisions of "Subreadings in Constraint Grammar"
(Created page with ' <pre> SECTION SUBSTITUTE ("од") ("од:5") ("од") (-1 (adj)); ^помладо/adj<pref><comp>+млад<adj><nt><sg><nom><ind>$ ^од/од<pr>$ ^30/30<num>$^./.<sent>$ </p…') |
|||
(57 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | '''This is now implemented in vislcg3: http://beta.visl.sdu.dk/cg3/chunked/subreadings.html''' |
||
− | <pre> |
||
− | SECTION |
||
+ | ==Why we need sub-readings== |
||
− | SUBSTITUTE ("од") ("од:5") ("од") (-1 (adj)); |
||
+ | Typical input with sub-readings: |
||
+ | ^foobar/foo+bar/fubar/flue+barge$ |
||
+ | Right now, only the last sub-reading is used, in the above example, vislcg3 treats it as if it were |
||
− | ^помладо/adj<pref><comp>+млад<adj><nt><sg><nom><ind>$ ^од/од<pr>$ ^30/30<num>$^./.<sent>$ |
||
− | </pre> |
||
+ | ^foobar/bar/fubar/barge$ |
||
− | <pre> |
||
− | MAP (@+FMAINV) TARGET VerbFin ; |
||
+ | This works great for compounds where the stuff before the + is mostly inconsequential, while for other multiword expressions it is not so good... |
||
− | ^n'eus/ne<adv>+bezañ<vblex><pri><impers><sp>/ne<adv>+kaout<vblex><pri><p1><pl>$ ^kador/kador<n><f><sg>$ ^ebet/ebet<adv>$^./.<sent>$ |
||
+ | (Also, mapping tags are only put on the last sub-reading now.) |
||
+ | |||
+ | : Wait can't we just split on the + with pretransfer ''before'' sending this to cg-proc? |
||
+ | :: No, because we first have to disambiguate between eg. ^foobar/foo+bar/fubar/flue+barge$ (what would that even look like if split? wouldn't work) |
||
+ | |||
+ | ==What we need== |
||
+ | * We may need to refer to a non-main sub-reading in order to disambiguate |
||
+ | * We may want to put a mapping tag on a non-main sub-reading |
||
+ | * And of course we want to be able to refer to the main sub-reading |
||
+ | |||
+ | ===Referring to the final sub-reading=== |
||
+ | Northern Sámi postpositions take genitive. |
||
+ | |||
+ | Input fragment: |
||
+ | |||
+ | ^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Acc>/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen>$ |
||
+ | ^vuostá/vuostá<Po>/vuostá<Pr>/vuostá<N><Sg><Nom>$ |
||
+ | |||
+ | Correct output: |
||
+ | |||
+ | ^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen><@→P>$ # war.power.GEN |
||
+ | ^vuostá/vuostá<Po><@←ADVL>$^ # against.PO |
||
+ | |||
+ | If the input noun were unambiguously nominative, the Po reading should not be selected, so we might have a rule somewhere with |
||
+ | |||
+ | REMOVE Po if (-1 (Nom)) |
||
+ | |||
+ | but if this matched non-final sub-readings, we would get the wrong tagging here. By default, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there). |
||
+ | |||
+ | ===Referring to non-final sub-readings=== |
||
+ | Input: |
||
+ | |||
+ | ^D'an/Da<pr>+an<det><def><sp>$ |
||
+ | ^emgann/emgann<n><m><sg>$ |
||
+ | ^ez/e<vpart><obj>/ael<n><m><pl>/mont<vblex><pri><p2><sg>/monet<vblex><pri><p2><sg>/e<pr>+da<det><pos><mf><sp>$ |
||
+ | ^an/an<det><def><sp>/mont<vblex><pri><p1><sg>/monet<vblex><pri><p1><sg>$ |
||
+ | |||
+ | Correct output: |
||
+ | |||
+ | ^D'an/Da<pr><@ADVL→>+an<det><def><sp><@→N>$ # to.the |
||
+ | ^emgann/emgann<n><m><sg><@P←>$ # battle |
||
+ | ^ez/e<vpart><obj><@Pcle>$ # PART |
||
+ | ^an/mont<vblex><pri><p1><sg><@+FMAINV>$ # I.go |
||
+ | |||
+ | * We want to '''refer''' to the <pr> sub-reading when mapping ''emgann'' as @P← (possibly also in disambiguation). |
||
+ | * We want to '''MAP''' an @ADVL→ tag on the <pr> sub-reading (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer. |
||
+ | |||
+ | ==VISL CG-3 syntax== |
||
+ | VISL CG-3 keeps the default behaviour that we always refer to only the last sub-reading unless explicitly mentioning sub-readings. But for some languages, you might want to prefer the first sub-reading to be main by default. VISL CG-3 caters to both preferences. From the manual: |
||
+ | |||
+ | ''The order of which is the primary reading vs. sub-readings depends on the grammar SUBREADINGS setting:'' |
||
+ | |||
+ | SUBREADINGS = RTL ; # Default, right-to-left |
||
+ | SUBREADINGS = LTR ; # Alternate, left-to-right |
||
+ | |||
+ | |||
+ | Then, to '''refer''' to a non-final sub-reading in the default RTL mode, we could say |
||
+ | |||
+ | ADD (@ADV←) TARGET (n) IF (-2/1 (pr)) (-2 (n)) ; |
||
+ | |||
+ | to say that we require the next-to-final sub-reading of the cohort two positions left be a word that has the main reading <code>n</code> and next sub-reading <code>pr</code>. This would match if the input were e.g. |
||
+ | |||
+ | ^forsooth/for<pr>+sooth<n>/forsooth<adv>$ ^he/prpers<prn>$ ^be/be<vblex>$ |
||
+ | |||
+ | Since we only have two sub-readings here, we could also ask that the last sub-reading be <code>pr</code>, with the same effect: |
||
+ | |||
+ | ADD (@ADV←) TARGET (n) IF (-2/-1 (pr)) (-2 (n)) ; |
||
+ | |||
+ | |||
+ | Parallell to regular CG word indexes, 0 is the "head". In RTL mode, this is the last sub-reading, while -1 is one sub-reading to the left of that. Positive numbers read from the left, so 1 is the first sub-reading from the left. For three sub-readings, that gives us the following indexing: |
||
+ | |||
+ | ^foo<tags>+bar<tags>+fie<tags>$ |
||
+ | 2 1 0 |
||
+ | -1 -2 -3 |
||
+ | |||
+ | For LTR mode, the left sub-reading is the head with index 0, and counts go the other way: |
||
+ | |||
+ | ^foo<tags>+bar<tags>+fie<tags>$ |
||
+ | 0 1 2 |
||
+ | -3 -2 -1 |
||
+ | |||
+ | |||
+ | To ADD the tag to the non-final sub-reading itself, use the SUB:N keyword after ADD: |
||
+ | |||
+ | ADD SUB:-1 (@→V) TARGET (pr) IF (*1 (v)) ; |
||
+ | |||
+ | |||
+ | We might also want to say "require ''any'' main- or sub-reading to be tagged <code>pr</code>": |
||
+ | |||
+ | ADD (@P←) TARGET (n) IF (-1/* (pr)) ; |
||
+ | |||
+ | or to say that all readings of the previous word are unambiguously pr (on one of the sub-readings): |
||
+ | |||
+ | ADD (@P←) TARGET (n) IF (-1C/* (pr)) ; |
||
+ | |||
+ | |||
+ | You can now do |
||
+ | |||
+ | <pre>REMOVE SUB:* (pr) IF (1 (vblex));</pre> |
||
+ | |||
+ | to remove any reading which has a pr on some sub-reading if there's a following verb. |
||
+ | |||
+ | The old workaround was to do |
||
+ | <pre>REMOVE (pr) IF (1 (vblex)); |
||
+ | REMOVE SUB:1 (pr) IF (1 (vblex)); |
||
+ | REMOVE SUB:2 (pr) IF (1 (vblex)); |
||
+ | REMOVE SUB:3 (pr) IF (1 (vblex));</pre> |
||
+ | etc. as high as your analyser allowed. |
||
+ | |||
+ | ==Wishlist== |
||
+ | |||
+ | ===A special set for "has-subreading"=== |
||
+ | The same way that you can do |
||
+ | <pre>LIST match-any = (*)</pre> |
||
+ | it would be nice to be able to do e.g. |
||
+ | <pre>LIST compound = (*/1)</pre> |
||
+ | to match on any (sub)reading that has at least one subreading. |
||
+ | |||
+ | Note: This is typically meant to be used on the main reading, so that you can have rules like |
||
+ | <pre>REMOVE compound IF …</pre> |
||
+ | or |
||
+ | <pre>MAP (@FOO) IF (-1 compound + Pl LINK …)</pre> |
||
+ | (note that the <code>(*/1)</code> is a "tag" of the main reading here, so the + Pl means that the main reading is plural, and there is at least one subreading below the main reading). |
||
+ | |||
+ | ===(a)+SUB:1(b) – requirements on both main and sub-readings at once=== |
||
+ | There is currently no way for SELECT/REMOVE to target tags in both the sub-reading and the reading. E.g. say you use the default mode where final readings are main, and your input is |
||
+ | <pre> |
||
+ | ^foobar/foo<vblex>+bar<n>/foo<n>+bar<n>/foo<vblex>+bar<vblex>/foo<n>+bar<vblex>$ |
||
</pre> |
</pre> |
||
+ | Now given a verb-verb compund is less likely than an anything-noun compound, you want to remove any reading where both sub-readings are verbs. If you try |
||
+ | <pre>REMOVE SUB:1 (vblex) (0 (vblex)) ;</pre> |
||
+ | it will wrongly remove the verb-noun reading as well, since the (0 (vblex)) matches on another reading's sub-reading. If you try |
||
+ | <pre>REMOVE (vblex) (0/1 (vblex)) ;</pre> |
||
+ | it will wrongly remove the noun-verb as well, since the (0/1 (vblex)) matches on another reading's sub-reading. |
||
+ | |||
+ | Similarly, there is no way to MAP/ADD tags to the main reading of a reading that has a certain sub-reading. Using the above example, there is no way to map @v to only the verb-verb compound without either hitting the noun-verb or the verb-noun as well. |
||
+ | |||
+ | |||
+ | Possible syntax, similar to set intersection: |
||
+ | <pre>REMOVE (vblex) + SUB:1 (vblex);</pre> |
||
+ | |||
+ | |||
+ | This might make sense inside a context condition as well: |
||
+ | <pre>REMOVE (vblex) IF (0 (n) + 0/1 (n));</pre> |
||
+ | |||
+ | Or even as a variable, assuming people don't name their sets "SUB:1": |
||
+ | <pre>SET verb-verb-compound = (vblex) + SUB:1 (vblex);</pre> |
Latest revision as of 21:20, 10 December 2015
This is now implemented in vislcg3: http://beta.visl.sdu.dk/cg3/chunked/subreadings.html
Contents
Why we need sub-readings[edit]
Typical input with sub-readings:
^foobar/foo+bar/fubar/flue+barge$
Right now, only the last sub-reading is used, in the above example, vislcg3 treats it as if it were
^foobar/bar/fubar/barge$
This works great for compounds where the stuff before the + is mostly inconsequential, while for other multiword expressions it is not so good... (Also, mapping tags are only put on the last sub-reading now.)
- Wait can't we just split on the + with pretransfer before sending this to cg-proc?
- No, because we first have to disambiguate between eg. ^foobar/foo+bar/fubar/flue+barge$ (what would that even look like if split? wouldn't work)
What we need[edit]
- We may need to refer to a non-main sub-reading in order to disambiguate
- We may want to put a mapping tag on a non-main sub-reading
- And of course we want to be able to refer to the main sub-reading
Referring to the final sub-reading[edit]
Northern Sámi postpositions take genitive.
Input fragment:
^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Acc>/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen>$ ^vuostá/vuostá<Po>/vuostá<Pr>/vuostá<N><Sg><Nom>$
Correct output:
^soahtefámu/soahti<N><Sg><Nom><Cmp>+fápmu<N><Sg><Gen><@→P>$ # war.power.GEN ^vuostá/vuostá<Po><@←ADVL>$^ # against.PO
If the input noun were unambiguously nominative, the Po reading should not be selected, so we might have a rule somewhere with
REMOVE Po if (-1 (Nom))
but if this matched non-final sub-readings, we would get the wrong tagging here. By default, non-final sub-readings are ignored, so the sme-nob CG's work fine (as do the nn-nb ones for compounding there).
Referring to non-final sub-readings[edit]
Input:
^D'an/Da<pr>+an<det><def><sp>$ ^emgann/emgann<n><m><sg>$ ^ez/e<vpart><obj>/ael<n><m><pl>/mont<vblex><pri><p2><sg>/monet<vblex><pri><p2><sg>/e<pr>+da<det><pos><mf><sp>$ ^an/an<det><def><sp>/mont<vblex><pri><p1><sg>/monet<vblex><pri><p1><sg>$
Correct output:
^D'an/Da<pr><@ADVL→>+an<det><def><sp><@→N>$ # to.the ^emgann/emgann<n><m><sg><@P←>$ # battle ^ez/e<vpart><obj><@Pcle>$ # PART ^an/mont<vblex><pri><p1><sg><@+FMAINV>$ # I.go
- We want to refer to the <pr> sub-reading when mapping emgann as @P← (possibly also in disambiguation).
- We want to MAP an @ADVL→ tag on the <pr> sub-reading (also a @→N tag on the determiner). These sub-readings are split into two units by pretransfer.
VISL CG-3 syntax[edit]
VISL CG-3 keeps the default behaviour that we always refer to only the last sub-reading unless explicitly mentioning sub-readings. But for some languages, you might want to prefer the first sub-reading to be main by default. VISL CG-3 caters to both preferences. From the manual:
The order of which is the primary reading vs. sub-readings depends on the grammar SUBREADINGS setting:
SUBREADINGS = RTL ; # Default, right-to-left SUBREADINGS = LTR ; # Alternate, left-to-right
Then, to refer to a non-final sub-reading in the default RTL mode, we could say
ADD (@ADV←) TARGET (n) IF (-2/1 (pr)) (-2 (n)) ;
to say that we require the next-to-final sub-reading of the cohort two positions left be a word that has the main reading n
and next sub-reading pr
. This would match if the input were e.g.
^forsooth/for<pr>+sooth<n>/forsooth<adv>$ ^he/prpers<prn>$ ^be/be<vblex>$
Since we only have two sub-readings here, we could also ask that the last sub-reading be pr
, with the same effect:
ADD (@ADV←) TARGET (n) IF (-2/-1 (pr)) (-2 (n)) ;
Parallell to regular CG word indexes, 0 is the "head". In RTL mode, this is the last sub-reading, while -1 is one sub-reading to the left of that. Positive numbers read from the left, so 1 is the first sub-reading from the left. For three sub-readings, that gives us the following indexing:
^foo<tags>+bar<tags>+fie<tags>$ 2 1 0 -1 -2 -3
For LTR mode, the left sub-reading is the head with index 0, and counts go the other way:
^foo<tags>+bar<tags>+fie<tags>$ 0 1 2 -3 -2 -1
To ADD the tag to the non-final sub-reading itself, use the SUB:N keyword after ADD:
ADD SUB:-1 (@→V) TARGET (pr) IF (*1 (v)) ;
We might also want to say "require any main- or sub-reading to be tagged pr
":
ADD (@P←) TARGET (n) IF (-1/* (pr)) ;
or to say that all readings of the previous word are unambiguously pr (on one of the sub-readings):
ADD (@P←) TARGET (n) IF (-1C/* (pr)) ;
You can now do
REMOVE SUB:* (pr) IF (1 (vblex));
to remove any reading which has a pr on some sub-reading if there's a following verb.
The old workaround was to do
REMOVE (pr) IF (1 (vblex)); REMOVE SUB:1 (pr) IF (1 (vblex)); REMOVE SUB:2 (pr) IF (1 (vblex)); REMOVE SUB:3 (pr) IF (1 (vblex));
etc. as high as your analyser allowed.
Wishlist[edit]
A special set for "has-subreading"[edit]
The same way that you can do
LIST match-any = (*)
it would be nice to be able to do e.g.
LIST compound = (*/1)
to match on any (sub)reading that has at least one subreading.
Note: This is typically meant to be used on the main reading, so that you can have rules like
REMOVE compound IF …
or
MAP (@FOO) IF (-1 compound + Pl LINK …)
(note that the (*/1)
is a "tag" of the main reading here, so the + Pl means that the main reading is plural, and there is at least one subreading below the main reading).
(a)+SUB:1(b) – requirements on both main and sub-readings at once[edit]
There is currently no way for SELECT/REMOVE to target tags in both the sub-reading and the reading. E.g. say you use the default mode where final readings are main, and your input is
^foobar/foo<vblex>+bar<n>/foo<n>+bar<n>/foo<vblex>+bar<vblex>/foo<n>+bar<vblex>$
Now given a verb-verb compund is less likely than an anything-noun compound, you want to remove any reading where both sub-readings are verbs. If you try
REMOVE SUB:1 (vblex) (0 (vblex)) ;
it will wrongly remove the verb-noun reading as well, since the (0 (vblex)) matches on another reading's sub-reading. If you try
REMOVE (vblex) (0/1 (vblex)) ;
it will wrongly remove the noun-verb as well, since the (0/1 (vblex)) matches on another reading's sub-reading.
Similarly, there is no way to MAP/ADD tags to the main reading of a reading that has a certain sub-reading. Using the above example, there is no way to map @v to only the verb-verb compound without either hitting the noun-verb or the verb-noun as well.
Possible syntax, similar to set intersection:
REMOVE (vblex) + SUB:1 (vblex);
This might make sense inside a context condition as well:
REMOVE (vblex) IF (0 (n) + 0/1 (n));
Or even as a variable, assuming people don't name their sets "SUB:1":
SET verb-verb-compound = (vblex) + SUB:1 (vblex);