Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Apertium-recursive/Formalism

From Apertium
< Apertium-recursive(Difference between revisions)
Jump to: navigation, search
(Basic Rule Syntax)
(tags no longer needed in left side of rule)
(3 intermediate revisions by one user not shown)
Line 3: Line 3:
 
=== Basic Rule Syntax ===
 
=== Basic Rule Syntax ===
   
Rules consist of a node type, a weight (optional?), a pattern, and an output.
+
Rules consist of a node type, an optional weight, a pattern, and an output.
   
NP -> 2: @det @adj @n {3 _2 2 _1 1} ;
+
NP -> 2.7: det adj.m n.*.sg {3 _2 2 _1 1} ;
  +
VP -> 1.0: NP vblex {2 _1 1} |
  +
NP vblex adv {2 _2 3 _1 1} ;
   
: Could support -> and —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 17:41, 29 May 2019 (CEST)
+
The arrow can be written as either <code>-></code> or <code>→</code>.
   
This gathers an det node, a adj node, and an n node and produces an NP node. Once all rules have been applied, the nodes they have gathered will be output according to their patterns. In this case in the order n adj det (the 3rd, the 2nd, the 1st).
+
The first rule matches lexical units or chunks with the sets of tags beginning with <code><det></code>, <code><adj><m></code>, and <code><n><*><sg></code>, respectively. It will then produce a chunk with the part-of-speech tag <code><NP></code> which contains those three nodes and the blanks between them in reverse order (n, adj, det).
   
The weight of a parse is the sum (?) of all the rules involved in producing it and the parse with the lowest weight is output. There should probably be an additional factor of how many unconsolidated pieces a parse has so we prefer more complete parses (that is, "NP cnj NP" as 3 separate nodes has a lower weight than the consolidated version, but we want the consolidated one).
+
The second and third rules both produce <code><VP></code> chunks, so that output is written only once and the other components are separated by a pipe character.
   
Multiple rules which produce the same node type can be joined with pipes:
+
Patterns are matched LRLM in stages. This is equivalent to applying all rules which can be applied before any rules can be applied to the outputs (see [[User:Popcorndude/Recursive_Transfer/Bytecode]] for the actual algorithm).
   
NP -> 1: @det @n {2 _1 1} |
+
NP -> n {1} |
2: @num @n {1 _1 2} ;
+
NP adj {1 _1 2} ;
  +
AP -> adj {1} ;
   
  +
If these rules were applied to an input of <code>n adj</code>, the first rule would match the noun and the third rule would match the adjective and the second rule would never apply because by the time it can match an <code>NP</code>, the next available token is not an <code>adj</code> but an <code>AP</code>.
   
; Comments:
+
When there are multiple rules of the same length which all match, the one with the highest weight is applied. If there is a tie, the one that appears first in the rules file is applied. Due to the way the patterns are compiled, if two rules have identical patterns, the first one will always be applied, regardless of weight. If the weight is omitted, it will default to 0. User-defined weights cannot be negative.
 
* When you say "output" do you mean immediately? or do you mean the AST will be built with that order in mind? - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 05:48, 13 March 2019 (CET)
 
** It uses the patterns to build the tree bottom-up and then when that's done it applies the output sections top-down (that way the verb phrase can set case on the noun phrase which can then set case on the noun). [[User:Popcorndude|Popcorndude]] ([[User talk:Popcorndude|talk]]) 14:59, 13 March 2019 (CET)
 
* I guess the weights should also be lexicalised, but a priori rule weights are probably also a good idea(?) - [[User:Francis Tyers|Francis Tyers]] ([[User talk:Francis Tyers|talk]]) 05:48, 13 March 2019 (CET)
 
** Would something like this be a reasonable way of lexicalising the weights? (from the [[User_talk:Popcorndude/Recursive_Transfer#Ambiguous_rules | ambiguous rules]] example) [[User:Popcorndude|Popcorndude]] ([[User talk:Popcorndude|talk]]) 19:17, 14 March 2019 (CET)
 
de_nn1 = memoría ;
 
de_nn2 = traducción ;
 
de_nsn1 = hermana madre ;
 
DE-S -> @det.pos @n {1 2} ;
 
de_nofn1 = constitución guerra ;
 
NP -> 1: $de_nn1@n de@pr $de_nn2@n {3 1} |
 
1: $de_nsn1@n de@pr DE-S {3 's@gen 1} |
 
1: $de_nofn1@n de@pr @num {1 2 3} |
 
3: @n de@pr @n {1 2 3} ;
 
* Why isn't it <code>@det @adj @n</code> etc. (per below)? —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 05:53, 13 March 2019 (CET)
 
** Because I changed it partway through writing this page and forgot to fix this part. [[User:Popcorndude|Popcorndude]] ([[User talk:Popcorndude|talk]]) 14:59, 13 March 2019 (CET)
 
   
 
=== Attribute Lists ===
 
=== Attribute Lists ===
Line 28: Line 29:
 
number = sg pl ND ;
 
number = sg pl ND ;
   
=== Lexical Units ===
+
An attribute list can also specify undefined and default values:
   
Lexical units are matched like this:
+
gender = (GD m) m f GD;
   
potato@n.sg ! matches "potato" with tags <n> and <sg>, possibly with others
+
This defines the <code>gender</code> category as before, but with the addition that if any rule tries to read the gender of a node that doesn't have a gender tag, the result will be <code><GD></code> rather than the empty string. It also creates a rule which will run last (the weight is -1.0) that will replace <code><GD></code> with <code><m></code>.
@n ! matches any noun
 
   
Any of these literals can be replaced with an attribute category using $
+
=== Tag Order ===
   
number = sg pl ;
+
The order of tags for each type of node must be defined like this:
potato@n.$number ! matches potato<n><sg> and potato<n><pl>
 
vegetable = potato carrot radish ;
 
$vegetable@n.sg ! matches potato<n><sg>, carrot<n><sg>, and radish<n><sg>
 
   
The tags and attributes are neither ordered nor exhaustive.
+
n: _.gender.number;
  +
adj: _.gender;
  +
NP: _.number;
   
potato@n.sg
+
Where <code>_</code> represents the lemma and the part of speech tag. Currently these patterns can only be defined based on a single part of speech tag, though it may eventually be possible to match based on multiple tags or to specify a different pattern for particular nodes.
! This matches all of the following:
 
! potato<n><sg>
 
! potato<n><m><sg>
 
! potato<sg><n>
 
! potato<sg><imp><n><o3pl>
 
   
=== Lexical Unit Output ===
+
To specify a literal tag in a pattern, put it in angle brackets:
   
The following rules specify how to output tags for particular parts of speech:
+
det: _.<def>.number;
   
n: _.gender.number ;
+
=== Patterns ===
! output: lemma<n><[gender]><[number]>
 
 
prn.pers: _.person.number ;
 
! output: lemma<prn><pers><[person]><[number]>
 
 
det.dem: _.<sp> ;
 
! output: lemma<det><def><sp>
 
   
Where _ represents the set of tags that were matched in choosing this rule, and everything else is the name of a category. Literal tags can be inserted using angle brackets.
+
An element of a pattern must match a single, literal part of speech tag. In order to match multiple part of speech tags, create a separate rule which matches each of them:
   
=== Blanks ===
+
NOM -> n {1} | np {1};
   
Blanks between lexical units are handled analogously to the chunking system. A _ in the output inserts a blank and _n inserts the blank originally after node n. If two nodes do not have a blank between them, they will be output as conjoined.
+
To match a lemma or pseudolemma, place it before the part of speech tag, separated by <code>@</code>:
   
If an input needs to distinguish between conjoined and non-conjoined lexical units, that can be done with + and _. Ordinarily, the parser will ignore the difference.
+
NP -> the@det n {2 _1 1};
   
@n @v ! matches "^a<n>$ ^b<v>$" or "^a<n>+b<v>$"
+
It is also possible to match a category of lemmas:
@n _ @v ! matches only "^a<n>$ ^b<v>$"
 
@n + @v ! matches only "^a<n>+b<v>$"
 
   
On an input like "^a<n>+b<v>$", _1 should probably be equivalent to _, but I'm not completely certain of this.
+
days = sunday monday tuesday wednesday thursday friday saturday;
  +
date -> $days@n the@det num.ord {2 _2 3 _1 1};
   
=== Variables ===
+
Tags besides part of speech can be matched as shown above.
   
A node can have variables attached to it. Lexical units have variables corresponding to any attribute categories that match their tags.
+
Pattern elements can also specify values for the tags of the chunk being output by the rule.
   
NP.number.case.gender -> @adj.$number @n.$number.$gender {2(case=$case) 1(case=$case)} ;
+
number = (ND sg) sg pl sp ND;
  +
NP: _.number;
  +
NP -> n.$number adj {1};
   
This rule specifies that NP has 3 variables associated with it. The NP will initially have a value for $number which will be the number marking of the adjective and noun (which must match) and for $gender, which will be the gender tag of the noun. $case will initially be empty. If the value of $case is set by some other rule further up the tree, then the case tag will be set on both lexical units in the output phase, otherwise they will keep their default marking.
+
This rule specifies that the number tag of the NP chunk should be copied from the noun. It will use the target language side if that is available. If not, it will proceed to the reference side, and then the source side. If all three of these are empty, it will use the default value <code><ND></code>. To require that a particular variable be taken from a particular side, put the side after a slash:
   
Values can also be transferred between nodes in the output phase:
+
NP: number;
  +
NP -> det.$number/ref n {1 _1 2};
   
VP -> NP @v {2(number=1.number, gender=1.gender) _1 1(case=nom)} ;
+
<code>/sl</code> refers to the source language, <code>/tl</code> to the target language, and <code>/ref</code> to anything added by anaphora resolution.
   
This makes the verb agree with the subject in number and gender and sets the subject's case to <nom>.
+
If a pattern element is contributing several tags to the chunk, the following shortcut is available:
   
The 3 possible assignments are "attr=literal", "attr=index.attr", and "attr=$var".
+
NP: _.number.gender;
  +
NP -> %n adj {2 _1 1};
   
Similar patterns can be used if the output is a literal lexical unit with agreement:
+
The <code>%</code> indicates the noun is the source of all chunk tags not elsewhere specified.
   
el@det.def.[1.gender].sg
+
To specify a literal value for a chunk tag, put it in square brackets after the pattern like this:
el@det.def.$gender.sg
+
  +
NP: _.gender.number;
  +
NP -> 0: NP cnjcoo NP [$gender=m, $number=pl] {1 _1 2 _2 3} |
  +
1: NP.f cnjcoo NP.f [$gender=f, $number=pl] {1 _1 2 _2 3} |
  +
2: NP.*.sg or@cnjcoo NP.*.sg [$gender=m, $number=sg] {1 _1 2 _2 3} |
  +
3: NP.f.sg or@cnjcoo NP.f.sg [$gender=f, $number=sg] {1 _1 2 _2 3} ;
  +
  +
That is, treat the gender of the phrase as masculine unless both elements are feminine and the number as singular unless the conjunction is "or" and both elements are singular.
  +
  +
The pattern only looks at the source language, but it is possible to add constraints:
  +
  +
conj_list = and or;
  +
NP: _.gender.number;
  +
NP -> %NP cnjcoo NP ((2.lem/tl in conj_list) and ~(3.gender = 1.gender)) {1 _1 2 _2 3};
  +
  +
This will only match the pattern if it is also the case that the target language lemma of the conjunction is "and" or "or" and the two NPs have different genders. The outermost parentheses are required and order of operations is not guaranteed to make any sense, so please put parentheses around everything.
  +
  +
=== Outputs ===
  +
  +
Output elements are written between curly braces and may be any of the following:
  +
  +
==== Blanks ====
  +
  +
An underscore represents a single space. An underscore followed by a number represents the superblank after that position, so <code>1 _ 2</code> is elements 1 and 2 separated by a space while <code>1 _1 2</code> is elements 1 and 2 separated by whatever separated them in the input.
  +
  +
==== Matched Elements ====
  +
  +
A number represents the input element in that position with its tags arranged according to the defined output pattern for its part of speech tag. It can be followed by a specification of where those tags should come from.
  +
  +
1
  +
! the first input element
  +
  +
1(gender=f)
  +
! the first input element with the gender tag <f>
  +
  +
1(gender=2.gender/ref)
  +
! the first input element with the gender tag of the reference side of the second input element
  +
  +
1(gender=$gender)
  +
! the first input element with the gender tag set to a placeholder to be filled on output with the gender tag of its parent chunk
   
If a rule needs to specify which side of the translation a value comes from, that can be done like so:
+
These elements can also be prefixed with <code>%</code> to specify that as many tags as possible should be placeholders for tags of the parent chunk.
   
1.number/sl ! the $number of the source language
+
==== Literal Lexical Units ====
1.number/tl ! the $number of the target language
 
1.number/an ! the $number of the anaphora
 
   
By default all values are copied to parent nodes, so /sl will work on things other than lexical units. For comparisons and output, the anaphora value will be used if it exists, otherwise the target value if it exists, otherwise the source value if it exists, otherwise the empty string.
+
A new lexical unit can be inserted like this:
   
Variable passing can also specify particular side, if necessary.
+
the@det.def.mf.sp
   
NP.gender -> @n.$gender/tl @adj {2 _1 1} ;
+
Placeholders can be included using <code>$</code>:
! this will copy only the target value of $gender from @n to NP
 
   
=== Conditionals ===
+
the@det.def.$gender.sp
   
Rule application can be further restricted with conditional statements:
+
And clips from other elements can be placed in square brackets:
   
NP -> @n @adj (1.gender/sl = 2.gender/sl, 1.number = 2.number) {2 _1 1} ;
+
the@det.def.[2.gender].[3.number/sl]
! match a noun and an adjective, but only if they have the same number marking
 
! and the source language gender is the same
 
   
=== Output Conditionals ===
+
==== Output Conditionals ====
   
If the output of a rule is conditioned on what happens further up the tree, rather than just on the input, conditionals can be added to the output statements:
+
It is possible to have multiple output clauses with conditions for which one to use. These conditions are written the same way as those in the patterns.
   
mood = ind opt nec inf ;
+
mood = ind opt nec inf;
VP.mood.person.number -> @v.$person.$number NP.acc {should@vaux _ 1(mood=inf) _1 2} ($mood = opt)
+
VP -> vblex {should@vaux _ 1(mood=inf)} (1.mood = opt)
{could@vaux _ 1(mood=inf) _1 2} ($mood = nec)
+
{could@vaud _ 1(mood=inf)} (1.mood = nec)
{1 _1 2} ; ! no conditional, so functions as an elsewhere case
+
{1} ;
   
An elsewhere case is required, since otherwise there might be no output.
+
Here the third option has no condition and thus functions as an elsewhere case. If multiple conditions are satisfied, the first one will be used.
   
=== Attribute Maps ===
+
=== Attribute Maps (not yet implemented) ===
   
 
This would be a way to convert certain sets of tags, either between two languages that have different sets of tenses, or between something like object agreement and number marking. (The following syntax is entirely provisional.)
 
This would be a way to convert certain sets of tags, either between two languages that have different sets of tenses, or between something like object agreement and number marking. (The following syntax is entirely provisional.)
Line 131: Line 132:
 
should also work, the unchanged tags not needing to be explicitly mentioned.
 
should also work, the unchanged tags not needing to be explicitly mentioned.
   
=== Interpolation ===
+
=== Interpolation (not yet implemented) ===
   
 
Parsing clitics, such as [[User_talk:Popcorndude/Recursive_Transfer#Serbo-Croatian_clitics]] can be done using multiple output units
 
Parsing clitics, such as [[User_talk:Popcorndude/Recursive_Transfer#Serbo-Croatian_clitics]] can be done using multiple output units

Revision as of 20:56, 13 June 2019

A proposal for a recursive transfer rule formalism.

Contents

Basic Rule Syntax

Rules consist of a node type, an optional weight, a pattern, and an output.

NP -> 2.7: det adj.m n.*.sg {3 _2 2 _1 1} ;
VP -> 1.0: NP vblex {2 _1 1} |
      NP vblex adv {2 _2 3 _1 1} ;

The arrow can be written as either -> or .

The first rule matches lexical units or chunks with the sets of tags beginning with <det>, <adj><m>, and <n><*><sg>, respectively. It will then produce a chunk with the part-of-speech tag <NP> which contains those three nodes and the blanks between them in reverse order (n, adj, det).

The second and third rules both produce <VP> chunks, so that output is written only once and the other components are separated by a pipe character.

Patterns are matched LRLM in stages. This is equivalent to applying all rules which can be applied before any rules can be applied to the outputs (see User:Popcorndude/Recursive_Transfer/Bytecode for the actual algorithm).

NP -> n {1} |
      NP adj {1 _1 2} ;
AP -> adj {1} ;

If these rules were applied to an input of n adj, the first rule would match the noun and the third rule would match the adjective and the second rule would never apply because by the time it can match an NP, the next available token is not an adj but an AP.

When there are multiple rules of the same length which all match, the one with the highest weight is applied. If there is a tie, the one that appears first in the rules file is applied. Due to the way the patterns are compiled, if two rules have identical patterns, the first one will always be applied, regardless of weight. If the weight is omitted, it will default to 0. User-defined weights cannot be negative.

Attribute Lists

A list of attributes can be defined like this:

gender = m f GD ;
number = sg pl ND ;

An attribute list can also specify undefined and default values:

gender = (GD m) m f GD;

This defines the gender category as before, but with the addition that if any rule tries to read the gender of a node that doesn't have a gender tag, the result will be <GD> rather than the empty string. It also creates a rule which will run last (the weight is -1.0) that will replace <GD> with <m>.

Tag Order

The order of tags for each type of node must be defined like this:

n: _.gender.number;
adj: _.gender;
NP: _.number;

Where _ represents the lemma and the part of speech tag. Currently these patterns can only be defined based on a single part of speech tag, though it may eventually be possible to match based on multiple tags or to specify a different pattern for particular nodes.

To specify a literal tag in a pattern, put it in angle brackets:

det: _.<def>.number;

Patterns

An element of a pattern must match a single, literal part of speech tag. In order to match multiple part of speech tags, create a separate rule which matches each of them:

NOM -> n {1} | np {1};

To match a lemma or pseudolemma, place it before the part of speech tag, separated by @:

NP -> the@det n {2 _1 1};

It is also possible to match a category of lemmas:

days = sunday monday tuesday wednesday thursday friday saturday;
date -> $days@n the@det num.ord {2 _2 3 _1 1};

Tags besides part of speech can be matched as shown above.

Pattern elements can also specify values for the tags of the chunk being output by the rule.

number = (ND sg) sg pl sp ND;
NP: _.number;
NP -> n.$number adj {1};

This rule specifies that the number tag of the NP chunk should be copied from the noun. It will use the target language side if that is available. If not, it will proceed to the reference side, and then the source side. If all three of these are empty, it will use the default value <ND>. To require that a particular variable be taken from a particular side, put the side after a slash:

NP: number;
NP -> det.$number/ref n {1 _1 2};

/sl refers to the source language, /tl to the target language, and /ref to anything added by anaphora resolution.

If a pattern element is contributing several tags to the chunk, the following shortcut is available:

NP: _.number.gender;
NP -> %n adj {2 _1 1};

The % indicates the noun is the source of all chunk tags not elsewhere specified.

To specify a literal value for a chunk tag, put it in square brackets after the pattern like this:

NP: _.gender.number;
NP -> 0: NP cnjcoo NP [$gender=m, $number=pl] {1 _1 2 _2 3} |
      1: NP.f cnjcoo NP.f [$gender=f, $number=pl] {1 _1 2 _2 3} |
      2: NP.*.sg or@cnjcoo NP.*.sg [$gender=m, $number=sg] {1 _1 2 _2 3} |
      3: NP.f.sg or@cnjcoo NP.f.sg [$gender=f, $number=sg] {1 _1 2 _2 3} ;

That is, treat the gender of the phrase as masculine unless both elements are feminine and the number as singular unless the conjunction is "or" and both elements are singular.

The pattern only looks at the source language, but it is possible to add constraints:

conj_list = and or;
NP: _.gender.number;
NP -> %NP cnjcoo NP ((2.lem/tl in conj_list) and ~(3.gender = 1.gender)) {1 _1 2 _2 3};

This will only match the pattern if it is also the case that the target language lemma of the conjunction is "and" or "or" and the two NPs have different genders. The outermost parentheses are required and order of operations is not guaranteed to make any sense, so please put parentheses around everything.

Outputs

Output elements are written between curly braces and may be any of the following:

Blanks

An underscore represents a single space. An underscore followed by a number represents the superblank after that position, so 1 _ 2 is elements 1 and 2 separated by a space while 1 _1 2 is elements 1 and 2 separated by whatever separated them in the input.

Matched Elements

A number represents the input element in that position with its tags arranged according to the defined output pattern for its part of speech tag. It can be followed by a specification of where those tags should come from.

1
! the first input element

1(gender=f)
! the first input element with the gender tag <f>

1(gender=2.gender/ref)
! the first input element with the gender tag of the reference side of the second input element

1(gender=$gender)
! the first input element with the gender tag set to a placeholder to be filled on output with the gender tag of its parent chunk

These elements can also be prefixed with % to specify that as many tags as possible should be placeholders for tags of the parent chunk.

Literal Lexical Units

A new lexical unit can be inserted like this:

the@det.def.mf.sp

Placeholders can be included using $:

the@det.def.$gender.sp

And clips from other elements can be placed in square brackets:

the@det.def.[2.gender].[3.number/sl]

Output Conditionals

It is possible to have multiple output clauses with conditions for which one to use. These conditions are written the same way as those in the patterns.

mood = ind opt nec inf;
VP -> vblex {should@vaux _ 1(mood=inf)} (1.mood = opt)
            {could@vaud _ 1(mood=inf)} (1.mood = nec)
            {1} ;

Here the third option has no condition and thus functions as an elsewhere case. If multiple conditions are satisfied, the first one will be used.

Attribute Maps (not yet implemented)

This would be a way to convert certain sets of tags, either between two languages that have different sets of tenses, or between something like object agreement and number marking. (The following syntax is entirely provisional.)

object_agr = o1sg o1pl o2sg o2pl o3sg o3pl ;
number = sg pl ;
person = p1 p2 p3 ;

object_agr > person.number: o1sg p1.sg, o1pl p1.pl, o2sg p2.sg, o2pl p2.pl, o3sg p3.sg, o3pl p3.pl ;

VP -> @v NP {2(object_agr=1.object_agr) _1 1} ;

In this example, if the verb had <o2sg>, the noun would get object_agr=o2sg, person=p2, number=sg, with the first two tags probably being discarded on output.

tense = farpst nearpst pst prs fut nonpst ;

tense > tense: farpst pst, nearpst pst, prs nonpst, fut nonpst ;

In this example, no explicit assignment needs to take place and the 4 tenses of the source language (farpst, nearpst, prs, fut) would be automatically converted to the 2 of the target language (pst, nonpst).

Converting from 4 to 3 with something like
tense > tense: farpst pst, nearpst pst ;

should also work, the unchanged tags not needing to be explicitly mentioned.

Interpolation (not yet implemented)

Parsing clitics, such as User_talk:Popcorndude/Recursive_Transfer#Serbo-Croatian_clitics can be done using multiple output units

vbser n -> @n @vbser {2} {1} ;
NP -> @n @det {2 _1 1} ;
! should be able to handle "noun clitic determiner"

Outputting them, however, is more difficult. My current idea is to do something like this:

NP -> @det @n {2 _1 1};
VP -> NP @vbser {(_1 2)>1};

Where (_1 2)>1 means "put the space between the elements and element 2 after the first word of element 1". The corresponding syntax for a right-aligned clitic would be 1<(2 _1). New lexical units could also be put in the parentheses (even if there's only one thing being inserted, the parentheses should, I think, be mandatory for clarity).

I'm not sure whether this will cover all cases, but it should at least cover a lot of them.

Personal tools