Difference between revisions of "Apertium-recursive"

From Apertium
Jump to navigation Jump to search
 
(2 intermediate revisions by one other user not shown)
Line 72: Line 72:


<chunk>
<chunk>
<source> [lemma] [tags] </source>
&lt;source&gt; [lemma] [tags] &lt;/source&gt;
<target> [lemma] [tags] </target>
<target> [lemma] [tags] </target>
<reference> [lemma] [tags] </reference>
<reference> [lemma] [tags] </reference>
Line 78: Line 78:
</chunk>
</chunk>


<code><source></code> and <code><reference></code> are optional and can be used to pass extra information to subsequent rules (which can access them with <code><clip side="sl" .../></code> as usual).
<code>&lt;source&gt;</code> and <code>&lt;reference&gt;</code> are optional and can be used to pass extra information to subsequent rules (which can access them with <code>&lt;clip side="sl" .../&gt;</code> as usual).


=== Postchunk rules ===
=== Postchunk rules ===


Rather than being conditioned on chunk names, output-time (postchunk-like) rules are written in <code><output-action></code> blocks following the <code><action></code> blocks in the rules they correspond to.
Rather than being conditioned on chunk names, output-time (postchunk-like) rules are written in <code>&lt;output-action&gt;</code> blocks following the <code>&lt;action&gt;</code> blocks in the rules they correspond to.


The following aspects of standard transfer files are unsupported or have potentially unpredictable results and should be avoided:
The following aspects of standard transfer files are unsupported or have potentially unpredictable results and should be avoided:
Line 88: Line 88:
=== Chunks With the Same Tags as LUs ===
=== Chunks With the Same Tags as LUs ===


A number of pairs have rules that match punctuation and reset variables. Unfortunately, these rules often produce chunks with the same part of speech tag (e.g. match <code><sent></code> and output <code><sent></code>), which can lead to infinite recursion since the rule can match its own output.
A number of pairs have rules that match punctuation and reset variables. Unfortunately, these rules often produce chunks with the same part of speech tag (e.g. match <code>&lt;sent&gt;</code> and output <code>&lt;sent&gt;</code>), which can lead to infinite recursion since the rule can match its own output.


With <code>trx-comp</code> this is possible but should be avoided.
With <code>trx-comp</code> this is possible but should be avoided.
Line 149: Line 149:


[[Category:Transfer]]
[[Category:Transfer]]
[[Category:Recursive transfer|*]]

Latest revision as of 06:13, 1 June 2023

Apertium-recursive is an alternative to apertium-transfer, apertium-interchunk, and apertium-postchunk. It uses a GLR parser rather than chunking and so can apply rules recursively. Rules can be written in a format almost identical to that of apertium-transfer or in a somewhat Yacc-like format created for this purpose.

Installing[edit]

Download from https://github.com/apertium/apertium-recursive

./autogen.sh
make
make install

Incorporating Into a Pair[edit]

The following applies regardless of whether the YACC-derived syntax or the XML is used.

Note that all of these changes can be made automatically with the --rebuild option of Apertium-init.

Makefile.am[edit]

Add $(PREFIX1).rtx.bin and $(PREFIX2).rtx.bin to TARGETS_COMMON.

If you have the latest version of Apertium installed, the following recipes will be included by @ap_include@ and you don't need to add them.

$(PREFIX1).rtx.bin: $(BASENAME).$(PREFIX1).rtx
	rtx-comp $< $@

$(PREFIX2).rtx.bin: $(BASENAME).$(PREFIX2).rtx
	rtx-comp $< $@

modes.xml[edit]

Replace

     <program name="apertium-transfer -b">
       <file name="apertium-eng-kir.eng-kir.t1x"/>
       <file name="eng-kir.t1x.bin"/>
     </program>
     <program name="apertium-interchunk">
       <file name="apertium-eng-kir.eng-kir.t2x"/>
       <file name="eng-kir.t2x.bin"/>
     </program>
     <program name="apertium-postchunk">
       <file name="apertium-eng-kir.eng-kir.t3x"/>
       <file name="eng-kir.t3x.bin"/>
     </program>

with

     <program name="rtx-proc">
       <file name="eng-kir.rtx.bin"/>
     </program>

If the pair uses apertium-anaphora, use rtx-proc -a rather than rtx-proc.

configure.ac[edit]

PKG_CHECK_MODULES(APERTIUM_RECURSIVE, apertium-recursive >= 1.0.0)

Differences From .t*x[edit]

weight and firstChunk[edit]

Rules can have a weight attribute which corresponds to the weights in .rtx files.

In order to generate lookahead paths for the pattern transducer, the compiler needs to be able to determine what type of chunk it will output. It determines this by looking at the firstChunk attribute. The attribute contains the part of speech tag of the first chunk output by the rule. If more than one is output, separate them by spaces:

<rule c="noun phrase" firstChunk="NP">
<rule c="sentence" firstChunk="S">
<rule c="sentence or relative clause" firstChunk="S DP">

If firstChunk is not specified, the rule will be treated as if it can generate any chunk, which may lead to significant slowdowns in rtx-proc.

Chunk syntax[edit]

Chunks have the following structure:

<chunk>
  <source> [lemma] [tags] </source>
  <target> [lemma] [tags] </target>
  <reference> [lemma] [tags] </reference>
  <contents> [chunks, LUs, and blanks] </contents>
</chunk>

<source> and <reference> are optional and can be used to pass extra information to subsequent rules (which can access them with <clip side="sl" .../> as usual).

Postchunk rules[edit]

Rather than being conditioned on chunk names, output-time (postchunk-like) rules are written in <output-action> blocks following the <action> blocks in the rules they correspond to.

The following aspects of standard transfer files are unsupported or have potentially unpredictable results and should be avoided:

Chunks With the Same Tags as LUs[edit]

A number of pairs have rules that match punctuation and reset variables. Unfortunately, these rules often produce chunks with the same part of speech tag (e.g. match <sent> and output <sent>), which can lead to infinite recursion since the rule can match its own output.

With trx-comp this is possible but should be avoided.

Chunks Not Containing Blanks[edit]

If the contents of a chunk are not alternating LUs/chunks and blanks, postchunk rules may not be able to handle them properly. So if there is a postchunk rule involved, always put

<chunk> <lu>...</lu>  <lu>...</lu> </chunk>

rather than

<chunk> <lu>...</lu> <lu>...</lu> </chunk>

even if you don't intend to output the blank.

Similarly, if the order of elements inside and <out> isn't "LU/chunk blank LU/chunk ... blank LU/chunk", later rules may fail to match properly.

Lexicalized Weights[edit]

rtx-comp and trx-comp can both accept files of lexically specific weights for rules in the following format:

rule_name	weight	pattern
rule_name	weight	pattern
...

Where the three columns are separated by tabs, the rule name matches one of the rules in the .rtx file or the id attribute of one of the rules in the XML file, the weight is a positive floating point number, and the pattern is the same length as the pattern of the corresponding rule. Higher weights are preferred and the default weight is 0.

The patterns are of the form "lemma@tags lemma@tags ..." where the terms are separated by spaces and the syntax of the components follows the same rules as elsewhere, that is "ba*@n.*" will match "bag<n><m><sg>" and "bagel<n>". (Note that rtx-comp allows * to match 0 tags).

Note that the patterns given are added to the transducer unmodified, so if they are less specific than the original rules, the results may be incorrect.

To incorporate lexicalizations, compile the ruleset with one of:

rtx-comp -l lex_file rtx_file bin_file
trx-comp -l lex_file xml_file xml_post_file bin_file

Example[edit]

Rules[edit]

NP -> "gen" n de@pr n { 3 + 's@gen _ 1 } |
      "of"  n de@pr n { 1 _ 2 _ 3 } |
      "n-n" n de@pr n { 3 _ 1 } ;

Lexicalizations[edit]

n-n	1.0	memoría@n.* de@pr traducción@n.*
gen	1.0	her*@n.* de@pr vec*@n.*
of	1.0	constitución@n.* de@pr 1812@n.*

Further Documentation[edit]

See Also[edit]