Difference between revisions of "Apertium-recursive/Formalism"

From Apertium
Jump to navigation Jump to search
(Interpolation)
(update a bit)
Line 5: Line 5:
Rules consist of a node type, a weight (optional?), a pattern, and an output.
Rules consist of a node type, a weight (optional?), a pattern, and an output.


NP -> 2: @det @adj @n {3 2 1} ;
NP -> 2: @det @adj @n {3 _2 2 _1 1} ;


This gathers an det node, a adj node, and an n node and produces an NP node. Once all rules have been applied, the nodes they have gathered will be output according to their patterns. In this case in the order n adj det (the 3rd, the 2nd, the 1st).
This gathers an det node, a adj node, and an n node and produces an NP node. Once all rules have been applied, the nodes they have gathered will be output according to their patterns. In this case in the order n adj det (the 3rd, the 2nd, the 1st).
Line 13: Line 13:
Multiple rules which produce the same node type can be joined with pipes:
Multiple rules which produce the same node type can be joined with pipes:


NP -> 1: @det @n {2 1} |
NP -> 1: @det @n {2 _1 1} |
2: @num @n {1 2} ;
2: @num @n {1 _1 2} ;




Line 51: Line 51:
Any of these literals can be replaced with an attribute category using $
Any of these literals can be replaced with an attribute category using $


number = sg pl ;
potato@n.$number ! matches potato<n><sg> and potato<n><pl>
potato@n.$number ! matches potato<n><sg> and potato<n><pl>
vegetable = potato carrot radish ;
vegetable = potato carrot radish ;
$vegetable@n.sg ! matches potato<n><sg>, carrot<n><sg>, and radish<n><sg>
$vegetable@n.sg ! matches potato<n><sg>, carrot<n><sg>, and radish<n><sg>


The tags and attributes are neither ordered nor exhaustive.
The last one would probably also match potato<n><m><sg>, potato<sg><n>, and potato<x><sg><bloop><n>

potato@n.sg
! This matches all of the following:
! potato<n><sg>
! potato<n><m><sg>
! potato<sg><n>
! potato<sg><imp><n><o3pl>


=== Lexical Unit Output ===
=== Lexical Unit Output ===
Line 63: Line 71:
n: _.gender.number ;
n: _.gender.number ;
! output: lemma<n><[gender]><[number]>
! output: lemma<n><[gender]><[number]>
prn.pers: _.person.number ;
prn.pers: _.person.number ;
! output: lemma<prn><pers><[person]><[number]>
! output: lemma<prn><pers><[person]><[number]>
det.dem: _.<sp> ;
! output: lemma<det><def><sp>

Where _ represents the set of tags that were matched in choosing this rule, and everything else is the name of a category. Literal tags can be inserted using angle brackets.

=== Blanks ===

Blanks between lexical units are handled analogously to the chunking system. A _ in the output inserts a blank and _n inserts the blank originally after node n. If two nodes do not have a blank between them, they will be output as conjoined.

If an input needs to distinguish between conjoined and non-conjoined lexical units, that can be done with + and _. Ordinarily, the parser will ignore the difference.

@n @v ! matches "^a<n>$ ^b<v>$" or "^a<n>+b<v>$"
@n _ @v ! matches only "^a<n>$ ^b<v>$"
@n + @v ! matches only "^a<n>+b<v>$"


On an input like "^a<n>+b<v>$", _1 should probably be equivalent to _, but I'm not completely certain of this.
Where _ represents the set of tags that were matched in choosing this rule, and everything else is the name of a category. It's possible that someone might want to put literals in one of these patterns, so there should probably be a way of distinguishing them. Options include putting $ before each category name or putting literals in brackets.


=== Variables ===
=== Variables ===
Line 72: Line 96:
A node can have variables attached to it. Lexical units have variables corresponding to any attribute categories that match their tags.
A node can have variables attached to it. Lexical units have variables corresponding to any attribute categories that match their tags.


NP.number./case.gender/ -> @adj.$number @n.$number.$gender {2(case=$case) 1(case=$case)} ;
NP.number.case.gender -> @adj.$number @n.$number.$gender {2(case=$case) 1(case=$case)} ;


This rule specifies that NP has 3 variables associated with it. The / before case indicates it only appears in the target language and the / after gender indicates it only occurs in the source language. Since number has neither, it appears in both. The NP will initially have a value for $number which will be the number marking of the adjective and noun (which must match) and for $gender, which will be the gender tag of the noun. $case will initially be empty. If the value of $case is set by some other rule further up the tree, then the case tag will be set on both lexical units in the output phase, otherwise they will keep their default marking.
This rule specifies that NP has 3 variables associated with it. The NP will initially have a value for $number which will be the number marking of the adjective and noun (which must match) and for $gender, which will be the gender tag of the noun. $case will initially be empty. If the value of $case is set by some other rule further up the tree, then the case tag will be set on both lexical units in the output phase, otherwise they will keep their default marking.


Values can also be transferred between nodes in the output phase:
Values can also be transferred between nodes in the output phase:


VP -> NP @v {2(number=1.number, gender=1.gender) 1(case=nom)} ;
VP -> NP @v {2(number=1.number, gender=1.gender) _1 1(case=nom)} ;


This makes the verb agree with the subject in number and gender and sets the subject's case to <nom>.
This makes the verb agree with the subject in number and gender and sets the subject's case to <nom>.
Line 89: Line 113:
el@det.def.$gender.sg
el@det.def.$gender.sg


If a rule needs to specify which side of the translation a value comes from, that can be done like so:
=== Variable Conflicts ===


1.number/sl ! the $number of the source language
I have yet to deal how to have a rule with multiple variables that all reference the same attribute category. I have 3 potential ways of handling this:
1.number/tl ! the $number of the target language
1.number/an ! the $number of the anaphora


By default all values are copied to parent nodes, so /sl will work on things other than lexical units. For comparisons and output, the anaphora value will be used if it exists, otherwise the target value if it exists, otherwise the source value if it exists, otherwise the empty string.
! Desired pattern: adj1 adj2 n1 n2
! adj1 and n2 have the same case, as do adj2 and n1
! Option 1: variable subscripts
@adj.$case#a @adj.$case#b @n.$case#b @n.$case#a
! Option 2: require multiple attribute lists
case = nom acc dat ;
case2 = nom acc dat ;
@adj.$case @adj.$case2 @n.$case2 @n.$case
! Option 3: conditionals and variables that aren't attribute names
@adj.$case @adj(case=$othercase) @n(case=$othercase) @n.$case
! I say "conditionals" because this syntax makes it easy to have things like
@n(case=$othercase not $case)
@adj(case=$othercase not nom)


Variable passing can also specify particular side, if necessary.
I find it difficult to come up with situations in which this would be needed, but in the ones where it is, the conditionals are also wanted, so maybe Option 3 is best, or maybe there should be a way to specify restrictions outside of just the pattern, such as


NP.gender -> @n.$gender/tl @adj {2 _1 1} ;
! using Option 1
! this will copy only the target value of $gender from @n to NP
XP -> 3: @adj.$case#a @adj.$case#b @n.$case#b @n.$case#a ($case#a != $case#b) {1 4 2 3} ;
! note: this particular syntax conflicts with the current use of ! for comments


=== General Conditionals ===
=== Conditionals ===


Rule application can be further restricted with conditional statements:
One possibility to solve the variable problem would be something like the following:


NP -> @n @adj (1.gender = 2.gender, 1.number = 2.number) {2 _1 1} ;
NP -> @n @adj (1.gender/sl = 2.gender/sl, 1.number = 2.number) {2 _1 1} ;
! match a noun and an adjective, but only if they have the same number marking
! and the source language gender is the same


=== Output Conditionals ===
Where the part in parentheses specifies this rule will match a noun and an adjective if they have the same gender and number.


If the output of a rule is conditioned on what happens further up the tree, rather than just on the input, conditionals can be added to the output statements:
=== Blanks ===

mood = ind opt nec inf ;
VP.mood.person.number -> @v.$person.$number NP.acc {should@vaux _ 1(mood=inf) _1 2} ($mood = opt)
{could@vaux _ 1(mood=inf) _1 2} ($mood = nec)
{1 _1 2} ; ! no conditional, so functions as an elsewhere case


An elsewhere case is required, since otherwise there might be no output.
The current transfer system deal with blanks, so in the output section "_n" is the formatting after node "n", so {1 _1 2} is "change nothing". Adding blanks could be either "_", corresponding to the current system, or they could be inserted automatically. Alternatively, the transfer module could ignore blanks.


=== Attribute Maps ===
=== Attribute Maps ===

Revision as of 23:25, 25 May 2019

A proposal for a recursive transfer rule formalism.

Basic Rule Syntax

Rules consist of a node type, a weight (optional?), a pattern, and an output.

NP -> 2: @det @adj @n {3 _2 2 _1 1} ;

This gathers an det node, a adj node, and an n node and produces an NP node. Once all rules have been applied, the nodes they have gathered will be output according to their patterns. In this case in the order n adj det (the 3rd, the 2nd, the 1st).

The weight of a parse is the sum (?) of all the rules involved in producing it and the parse with the lowest weight is output. There should probably be an additional factor of how many unconsolidated pieces a parse has so we prefer more complete parses (that is, "NP cnj NP" as 3 separate nodes has a lower weight than the consolidated version, but we want the consolidated one).

Multiple rules which produce the same node type can be joined with pipes:

NP -> 1: @det @n {2 _1 1} |
      2: @num @n {1 _1 2} ;


Comments
  • When you say "output" do you mean immediately? or do you mean the AST will be built with that order in mind? - Francis Tyers (talk) 05:48, 13 March 2019 (CET)
    • It uses the patterns to build the tree bottom-up and then when that's done it applies the output sections top-down (that way the verb phrase can set case on the noun phrase which can then set case on the noun). Popcorndude (talk) 14:59, 13 March 2019 (CET)
  • I guess the weights should also be lexicalised, but a priori rule weights are probably also a good idea(?) - Francis Tyers (talk) 05:48, 13 March 2019 (CET)
    • Would something like this be a reasonable way of lexicalising the weights? (from the ambiguous rules example) Popcorndude (talk) 19:17, 14 March 2019 (CET)
de_nn1 = memoría ;
de_nn2 = traducción ;
de_nsn1 = hermana madre ;
DE-S -> @det.pos @n {1 2} ;
de_nofn1 = constitución guerra ;
NP -> 1: $de_nn1@n de@pr $de_nn2@n {3 1} |
      1: $de_nsn1@n de@pr DE-S {3 's@gen 1} |
      1: $de_nofn1@n de@pr @num {1 2 3} |
      3: @n de@pr @n {1 2 3} ;
  • Why isn't it @det @adj @n etc. (per below)? —Firespeaker (talk) 05:53, 13 March 2019 (CET)
    • Because I changed it partway through writing this page and forgot to fix this part. Popcorndude (talk) 14:59, 13 March 2019 (CET)

Attribute Lists

A list of attributes can be defined like this:

gender = m f GD ;
number = sg pl ND ;

Lexical Units

Lexical units are matched like this:

potato@n.sg ! matches "potato" with tags <n> and <sg>, possibly with others
@n          ! matches any noun

Any of these literals can be replaced with an attribute category using $

number = sg pl ;
potato@n.$number ! matches potato<n><sg> and potato<n><pl>
vegetable = potato carrot radish ;
$vegetable@n.sg  ! matches potato<n><sg>, carrot<n><sg>, and radish<n><sg>

The tags and attributes are neither ordered nor exhaustive.

potato@n.sg
! This matches all of the following:
! potato<n><sg>
! potato<n><m><sg>
! potato<sg><n>
! potato<sg><imp><n><o3pl>

Lexical Unit Output

The following rules specify how to output tags for particular parts of speech:

n: _.gender.number ;
! output: lemma<n><[gender]><[number]>

prn.pers: _.person.number ;
! output: lemma<prn><pers><[person]><[number]>

det.dem: _.<sp> ;
! output: lemma<det><def><sp>

Where _ represents the set of tags that were matched in choosing this rule, and everything else is the name of a category. Literal tags can be inserted using angle brackets.

Blanks

Blanks between lexical units are handled analogously to the chunking system. A _ in the output inserts a blank and _n inserts the blank originally after node n. If two nodes do not have a blank between them, they will be output as conjoined.

If an input needs to distinguish between conjoined and non-conjoined lexical units, that can be done with + and _. Ordinarily, the parser will ignore the difference.

@n @v   ! matches "^a<n>$ ^b<v>$" or "^a<n>+b<v>$"
@n _ @v ! matches only "^a<n>$ ^b<v>$"
@n + @v ! matches only "^a<n>+b<v>$"

On an input like "^a<n>+b<v>$", _1 should probably be equivalent to _, but I'm not completely certain of this.

Variables

A node can have variables attached to it. Lexical units have variables corresponding to any attribute categories that match their tags.

NP.number.case.gender -> @adj.$number @n.$number.$gender {2(case=$case) 1(case=$case)} ;

This rule specifies that NP has 3 variables associated with it. The NP will initially have a value for $number which will be the number marking of the adjective and noun (which must match) and for $gender, which will be the gender tag of the noun. $case will initially be empty. If the value of $case is set by some other rule further up the tree, then the case tag will be set on both lexical units in the output phase, otherwise they will keep their default marking.

Values can also be transferred between nodes in the output phase:

VP -> NP @v {2(number=1.number, gender=1.gender) _1 1(case=nom)} ;

This makes the verb agree with the subject in number and gender and sets the subject's case to <nom>.

The 3 possible assignments are "attr=literal", "attr=index.attr", and "attr=$var".

Similar patterns can be used if the output is a literal lexical unit with agreement:

el@det.def.[1.gender].sg
el@det.def.$gender.sg

If a rule needs to specify which side of the translation a value comes from, that can be done like so:

1.number/sl ! the $number of the source language
1.number/tl ! the $number of the target language
1.number/an ! the $number of the anaphora

By default all values are copied to parent nodes, so /sl will work on things other than lexical units. For comparisons and output, the anaphora value will be used if it exists, otherwise the target value if it exists, otherwise the source value if it exists, otherwise the empty string.

Variable passing can also specify particular side, if necessary.

NP.gender -> @n.$gender/tl @adj {2 _1 1} ;
! this will copy only the target value of $gender from @n to NP

Conditionals

Rule application can be further restricted with conditional statements:

NP -> @n @adj (1.gender/sl = 2.gender/sl, 1.number = 2.number) {2 _1 1} ;
! match a noun and an adjective, but only if they have the same number marking
! and the source language gender is the same

Output Conditionals

If the output of a rule is conditioned on what happens further up the tree, rather than just on the input, conditionals can be added to the output statements:

mood = ind opt nec inf ;
VP.mood.person.number -> @v.$person.$number NP.acc {should@vaux _ 1(mood=inf) _1 2} ($mood = opt)
                                                   {could@vaux _ 1(mood=inf) _1 2} ($mood = nec)
                                                   {1 _1 2} ; ! no conditional, so functions as an elsewhere case

An elsewhere case is required, since otherwise there might be no output.

Attribute Maps

This would be a way to convert certain sets of tags, either between two languages that have different sets of tenses, or between something like object agreement and number marking. (The following syntax is entirely provisional.)

object_agr = o1sg o1pl o2sg o2pl o3sg o3pl ;
number = sg pl ;
person = p1 p2 p3 ;

object_agr > person.number: o1sg p1.sg, o1pl p1.pl, o2sg p2.sg, o2pl p2.pl, o3sg p3.sg, o3pl p3.pl ;

VP -> @v NP {2(object_agr=1.object_agr) _1 1} ;

In this example, if the verb had <o2sg>, the noun would get object_agr=o2sg, person=p2, number=sg, with the first two tags probably being discarded on output.

tense = farpst nearpst pst prs fut nonpst ;

tense > tense: farpst pst, nearpst pst, prs nonpst, fut nonpst ;

In this example, no explicit assignment needs to take place and the 4 tenses of the source language (farpst, nearpst, prs, fut) would be automatically converted to the 2 of the target language (pst, nonpst).

Converting from 4 to 3 with something like

tense > tense: farpst pst, nearpst pst ;

should also work, the unchanged tags not needing to be explicitly mentioned.

Interpolation

Parsing clitics, such as User_talk:Popcorndude/Recursive_Transfer#Serbo-Croatian_clitics can be done using multiple output units

vbser n -> @n @vbser {2} {1} ;
NP -> @n @det {2 _1 1} ;
! should be able to handle "noun clitic determiner"

Outputting them, however, is more difficult. My current idea is to do something like this:

NP -> @det @n {2 _1 1};
VP -> NP @vbser {(_1 2)>1};

Where (_1 2)>1 means "put the space between the elements and element 2 after the first word of element 1". The corresponding syntax for a right-aligned clitic would be 1<(2 _1). New lexical units could also be put in the parentheses (even if there's only one thing being inserted, the parentheses should, I think, be mandatory for clarity).

I'm not sure whether this will cover all cases, but it should at least cover a lot of them.