Difference between revisions of "Apertium-recursive/Formalism"

From Apertium
Jump to navigation Jump to search
(Tag rewrite rules)
Line 3: Line 3:
=== Basic Rule Syntax ===
=== Basic Rule Syntax ===


Rules consist of a node type, an optional weight, a pattern, and an output.
Rules consist of a node type, an optional weight, a pattern, optional variable setting, an optional condition, and an output, in that order.


NP -> 2.7: det adj.m n.*.sg {3 _2 2 _1 1} ;
NP -> det n {2 _1 1};
VP -> 1.0: NP vblex {2 _1 1} |
NP vblex adv {2 _2 3 _1 1} ;


This matches a determiner followed by a noun, combines them into an NP chunk, and at output time produces "noun determiner".
The arrow can be written as either <code>-></code> or <code>→</code>.


NP -> 1: n {1} |
The first rule matches lexical units or chunks with the sets of tags beginning with <code><det></code>, <code><adj><m></code>, and <code><n><*><sg></code>, respectively. It will then produce a chunk with the part-of-speech tag <code><NP></code> which contains those three nodes and the blanks between them in reverse order (n, adj, det).
2: n.*.def {the@det.def.sg _ 1};


The second and third rules both produce <code><VP></code> chunks, so that output is written only once and the other components are separated by a pipe character.
Here the first rule will match any noun, while the second will match a noun with a <code><def></code> tag. Since the second rule has a higher weight, the first rule will not be applied if they both match.


NP -> NP and@cnjcoo NP [$number=pl] {1 _1 2 _2 3};
When a rule matches, it takes the input nodes and gathers them into a chunk. At output time, the transformations described in the output section are applied.

Here the rule specifies that the resulting chunk will be marked with a <code><pl></code> tag.

AP -> adj and@cnjcoo adj (1.gender/sl = 3.gender/sl) {1 _1 2 _2 3};

This rule will not apply if the two adjectives have different genders.

The arrow can be written as either <code>-></code> or <code>→</code>.


The process by which rules are selected is described [[User:Popcorndude/Recursive_Transfer/Parser | here]].
The process by which rules are selected is described [[User:Popcorndude/Recursive_Transfer/Parser | here]].

Revision as of 20:35, 9 July 2019

A proposal for a recursive transfer rule formalism.

Basic Rule Syntax

Rules consist of a node type, an optional weight, a pattern, optional variable setting, an optional condition, and an output, in that order.

NP -> det n {2 _1 1};

This matches a determiner followed by a noun, combines them into an NP chunk, and at output time produces "noun determiner".

NP -> 1: n {1} |
      2: n.*.def {the@det.def.sg _ 1};

Here the first rule will match any noun, while the second will match a noun with a <def> tag. Since the second rule has a higher weight, the first rule will not be applied if they both match.

NP -> NP and@cnjcoo NP [$number=pl] {1 _1 2 _2 3};

Here the rule specifies that the resulting chunk will be marked with a <pl> tag.

AP -> adj and@cnjcoo adj (1.gender/sl = 3.gender/sl) {1 _1 2 _2 3};

This rule will not apply if the two adjectives have different genders.

The arrow can be written as either -> or .

The process by which rules are selected is described here.

Attribute Lists

A list of attributes can be defined like this:

gender = m f GD ;
number = sg pl ND ;

An attribute list can also specify undefined and default values:

gender = (GD m) m f GD;

This defines the gender category as before, but with the addition that if any rule tries to read the gender of a node that doesn't have a gender tag, the result will be <GD> rather than the empty string. It also states that any remaining <GD> tags will be replaced with <m> tags in the output step.

Tag Order

The order of tags for each type of node must be defined like this:

n: _.gender.number;
adj: _.gender;
NP: _.number;

Where _ represents the lemma and the part of speech tag. Note that it is currently only possible to specify single tags as patterns. However, it is possible to specify that a different pattern should be used (see the output section below). Note also that the lemma queue is automatically appended to the pattern.

To specify a literal tag in a pattern, put it in angle brackets:

det: _.<def>.number;

Patterns

An element of a pattern must match a single, literal part of speech tag. In order to match multiple part of speech tags, create a separate rule which matches each of them:

NOM -> n {1} | np {1};

To match a lemma or pseudolemma, place it before the part of speech tag, separated by @:

NP -> the@det n {2 _1 1};

It is also possible to match a category of lemmas:

days = sunday monday tuesday wednesday thursday friday saturday;
date -> $days@n the@det num.ord {2 _2 3 _1 1};

Tags besides part of speech can be matched as shown above.

Pattern elements can also specify values for the tags of the chunk being output by the rule.

number = (ND sg) sg pl sp ND;
NP: _.number;
NP -> n.$number adj {1};

This rule specifies that the number tag of the NP chunk should be copied from the noun. It will use the target language side if that is available. If not, it will proceed to the reference side, and then the source side. If all three of these are empty, it will use the default value <ND>. To require that a particular variable be taken from a particular side, put the side after a slash:

NP: number;
NP -> det.$number/ref n {1 _1 2};

/sl refers to the source language, /tl to the target language, and /ref to anything added by anaphora resolution.

If a pattern element is contributing several tags to the chunk, the following shortcut is available:

NP: _.number.gender;
NP -> %n adj {2 _1 1};

The % indicates the noun is the source of all chunk tags not elsewhere specified.

To specify a literal value for a chunk tag, put it in square brackets after the pattern like this:

NP: _.gender.number;
NP -> 0: NP cnjcoo NP [$gender=m, $number=pl] {1 _1 2 _2 3} |
      1: NP.f cnjcoo NP.f [$gender=f, $number=pl] {1 _1 2 _2 3} |
      2: NP.*.sg or@cnjcoo NP.*.sg [$gender=m, $number=sg] {1 _1 2 _2 3} |
      3: NP.f.sg or@cnjcoo NP.f.sg [$gender=f, $number=sg] {1 _1 2 _2 3} ;

That is, treat the gender of the phrase as masculine unless both elements are feminine and the number as singular unless the conjunction is "or" and both elements are singular.

The pattern only looks at the source language, but it is possible to add constraints:

conj_list = and or;
NP: _.gender.number;
NP -> %NP cnjcoo NP ((2.lem/tl in conj_list) and ~(3.gender = 1.gender)) {1 _1 2 _2 3};

This will only match the pattern if it is also the case that the target language lemma of the conjunction is "and" or "or" and the two NPs have different genders. The outermost parentheses are required and order of operations is not guaranteed to make any sense, so please put parentheses around everything.

Outputs

Output elements are written between curly braces and may be any of the following:

Blanks

An underscore represents a single space. An underscore followed by a number represents the superblank after that position, so 1 _ 2 is elements 1 and 2 separated by a space while 1 _1 2 is elements 1 and 2 separated by whatever separated them in the input.

Matched Elements

A number represents the input element in that position with its tags arranged according to the defined output pattern for its part of speech tag. It can be followed by a specification of where those tags should come from.

1
! the first input element

1(gender=f)
! the first input element with the gender tag <f>

1(gender=2.gender/ref)
! the first input element with the gender tag of the reference side of the second input element

1(gender=$gender)
! the first input element with the gender tag set to a placeholder to be filled on output with the gender tag of its parent chunk

These elements can also be prefixed with % to specify that as many tags as possible should be placeholders for tags of the parent chunk.

These elements can be conjoined using +:

1(gender=f) + 2

This will generate something like ^blah<n><f>+bloop<adj>$.

By default, the order of the output tags is based on the output pattern corresponding to the part of speech tag in the pattern. However, it is possible to override this using square brackets:

vblex: _.tense.person.number;
vbinf: _.<inf>;

V -> vblex.inf {1};
  ! result: ^whatever<vblex><inf><{person}><{number}>$

V -> vblex.inf {1[vbinf]};
  ! result: ^whatever<vblex><inf>$

Note that the part of speech tag of the output is in all cases the part of speech tag of the input. To avoid this behavior (for example, if you want to change the part of speech tag), write an output rule like the following:

adj: lemh.<adj>.number;

Literal Lexical Units

A new lexical unit can be inserted like this:

the@det.def.mf.sp

Placeholders can be included using $:

the@det.def.$gender.sp

And clips from other elements can be placed in square brackets:

the@det.def.[2.gender].[3.number/sl]

Output Conditionals

It is possible to have multiple output clauses with conditions for which one to use. These conditions are written the same way as those in the patterns.

mood = ind opt nec inf;
VP -> vblex {should@vaux _ 1(mood=inf)} (1.mood = opt)
            {could@vaud _ 1(mood=inf)} (1.mood = nec)
            {1} ;

Here the third option has no condition and thus functions as an elsewhere case. If multiple conditions are satisfied, the first one will be used.

Tag Rewrite Rules

This is a way to convert certain sets of tags, either between two languages that have different sets of tenses, or between something like object agreement and number marking.

object_agr = o1sg o1pl o2sg o2pl o3sg o3pl ;
number = sg pl ;
person = p1 p2 p3 ;

object_agr > person: o1sg p1, o1pl p1, o2sg p2, o2pl p2, o3sg p3, o3pl p3 ;
object_agr > number: o1sg sg, o1pl pl, o2sg sg, o2pl pl, o3sg sg, o3pl pl ;

VP -> @v NP {2(number=1.object_agr) _1 1} ;

In this example, if the verb had <o2sg>, it would be converted to <sg> when it was set as the number attribute of the noun.

tense = farpst nearpst pst prs fut nonpst ;

tense > tense: farpst pst, nearpst pst, prs nonpst, fut nonpst ;

In this example, no explicit assignment needs to take place and the 4 tenses of the source language (farpst, nearpst, prs, fut) would be automatically converted to the 2 of the target language (pst, nonpst).

Converting from 4 to 3 with something like

tense > tense: farpst pst, nearpst pst ;

will also work, the unchanged tags not needing to be explicitly mentioned.

When an attribute category is being mapped to itself, such as in the tense example above, the replacement is always performed. As a result, if a tag appears on the left side of a change and the right side of another, the results may be incorrect. For example:

tense > tense: midpst pst, pst pri;

This rule might convert <midpst> to either <pst> or <pri> in different situations.

However, when a rule maps between different categories, as in the object agreement example, the transformation will not happen invisibly. That is, if you have 1 in the output, a tense > tense conversion will happen, but a object_agr > number one won't. This is because the compiler does not have enough information to know what attributes that node has which can be clipped and thus does not know what it is converting from.

In order for this to be fully automatic, the number element in the relevant output pattern would have to compile to something which checked number and then every attribute that could map to number until it found one. While this behavior could be added if desired, I initially deemed it too complicated and simply required that in such situations the rule author has to write 1(number=1.object_agr) to trigger the object_agr > number conversion.

It is also possible to explicitly convert a value, for example when doing comparisons:

1.object_agr>number
1.object_agr>person

Interpolation (not yet implemented)

Parsing clitics, such as User_talk:Popcorndude/Recursive_Transfer#Serbo-Croatian_clitics can be done using multiple output units

vbser n -> @n @vbser {2} {1} ;
NP -> @n @det {2 _1 1} ;
! should be able to handle "noun clitic determiner"

Outputting them, however, is more difficult. My current idea is to do something like this:

NP -> @det @n {2 _1 1};
VP -> NP @vbser {(_1 2)>1};

Where (_1 2)>1 means "put the space between the elements and element 2 after the first word of element 1". The corresponding syntax for a right-aligned clitic would be 1<(2 _1). New lexical units could also be put in the parentheses (even if there's only one thing being inserted, the parentheses should, I think, be mandatory for clarity).

I'm not sure whether this will cover all cases, but it should at least cover a lot of them.