Apertium-recursive

From Apertium
Revision as of 20:46, 31 July 2019 by Popcorndude (talk | contribs) (add lexicalization)
Jump to navigation Jump to search

Apertium-recursive is an alternative to apertium-transfer, apertium-interchunk, and apertium-postchunk. It uses a GLR parser rather than chunking and so can apply rules recursively. Rules can be written in a format almost identical to that of apertium-transfer or in a somewhat Yacc-like format created for this purpose.

Installing

Download from https://github.com/apertium/apertium-recursive

./autogen.sh
make
make install

Incorporating Into a Pair

The following instructions are for the Yacc-like syntax. To use XML, replace references to rtx-comp with trx-comp.

Makefile.am

Add $(PREFIX1).rtx.bin and $(PREFIX2).rtx.bin to TARGETS_COMMON.

$(PREFIX1).rtx.bin: $(BASENAME).$(PREFIX1).rtx
	rtx-comp $< $@

$(PREFIX2).rtx.bin: $(BASENAME).$(PREFIX2).rtx
	rtx-comp $< $@

modes.xml

Replace

     <program name="apertium-transfer -b">
       <file name="apertium-eng-kir.eng-kir.t1x"/>
       <file name="eng-kir.t1x.bin"/>
     </program>
     <program name="apertium-interchunk">
       <file name="apertium-eng-kir.eng-kir.t2x"/>
       <file name="eng-kir.t2x.bin"/>
     </program>
     <program name="apertium-postchunk">
       <file name="apertium-eng-kir.eng-kir.t3x"/>
       <file name="eng-kir.t3x.bin"/>
     </program>

with

     <program name="rtx-proc">
       <file name="eng-kir.rtx.bin"/>
     </program>

If the pair uses apertium-anaphora, use rtx-proc -a rather than rtx-proc.

configure.ac

AC_PATH_PROG([RTXCOMP], [rtx-comp], [false], [$PATH$PATH_SEPARATOR$with_rtx_comp/bin])
AS_IF([test x$RTXCOMP = xfalse], [AC_MSG_ERROR([You don't have rtx-comp installed])])

AC_PATH_PROG([RTXPROC], [rtx-proc], [false], [$PATH$PATH_SEPARATOR$with_rtx_proc/bin])
AS_IF([test x$RTXPROC = xfalse], [AC_MSG_ERROR([You don't have rtx-proc installed])])

Differences From .t*x

The following aspects of standard transfer files are unsupported or have potentially unpredictable results and should be avoided.

Chunks With the Same Tags as LUs

A number of pairs have rules that match punctuation and reset variables. Unfortunately, these rules often produce chunks with the same part of speech tag (e.g. match <sent> and output <sent>), which can lead to infinite recursion since the rule can match its own output.

With trx-comp this is possible but should be avoided.

Literal Chunks

In interchunk, new chunks are sometimes inserted like this:

<chunk>
 <lit v="det"/>
 <lit-tag v="DET.def"/>
 <lit v="{^the"/>
 <lit-tag v="det.def.mf.sp"/>
 <lit v="$}"/>
</chunk>

trx-comp will make some effort to deal with this, but results are not guaranteed and the curly braces may show up in the output. Instead write the above as:

<chunk name="det">
 <tags>
  <tag><lit-tag v="DET"/></tag>
  <tag><lit-tag v="def"/></tag>
 </tags>
 <lit v="the"/>
 <lit-tag v="det.def.mf.sp"/>
</chunk>

or even

<lu>
 <lit v="the"/>
 <lit-tag v="det.def.mf.sp"/>
</lu>

depending on what you're doing with it. (Note that trx-comp doesn't check the syntactic distinctions between .t1x and .t2x files.)

Chunks Not Containing Blanks

If the contents of a chunk are not alternating LUs/chunks and blanks, postchunk rules may not be able to handle them properly. So if there is a postchunk rule involved, always put

<chunk> <lu>...</lu>  <lu>...</lu> </chunk>

rather than

<chunk> <lu>...</lu> <lu>...</lu> </chunk>

even if you don't intend to output the blank.

Similarly, if the order of elements inside and <out> isn't "LU/chunk blank LU/chunk ... blank LU/chunk", later rules may fail to match properly.

Lexicalized Weights

rtx-comp can accept a file of lexically specific weights for rules in the following format:

rule_name	weight	pattern
rule_name	weight	pattern
...

Where the three columns are separated by tabs, the rule name matches one of the rules in the .rtx file, the weight is a positive floating point number, and the pattern is the same length as the pattern of the corresponding rule.

The patterns are of the form "lemma@tags lemma@tags ..." where the terms are separated by spaces and the syntax of the components follows the same rules as elsewhere, that is "ba*@n.*" will match "bag<n><m><sg>" and "bagel<n>". (Note that rtx-comp allows * to 0 tags).

Note that the patterns given are added to the transducer unmodified, so if they are less specific than the original rules, the results may be incorrect.

To incorporate lexicalizations, compile the ruleset with:

rtx-comp -l lex_file rtx_file bin_file

Example

Rules

NP -> "gen" n de@pr n { 3 + 's@gen _ 1 } |
      "of"  n de@pr n { 1 _ 2 _ 3 } |
      "n-n" n de@pr n { 3 _ 1 } ;

Lexicalizations

n-n	1.0	memoría@n.* de@pr traducción@n.*
gen	1.0	her*@n.* de@pr vec*@n.*
of	1.0	constitución@n.* de@pr 1812@n.*

See Also