Apertium-recursive

From Apertium
Revision as of 20:20, 31 July 2019 by Popcorndude (talk | contribs) (add links)
Jump to navigation Jump to search

Apertium-recursive is an alternative to apertium-transfer, apertium-interchunk, and apertium-postchunk. It uses a GLR parser rather than chunking and so can apply rules recursively. Rules can be written in a format almost identical to that of apertium-transfer or in a somewhat Yacc-like format created for this purpose.

Installing

Download from https://github.com/apertium/apertium-recursive

./autogen.sh
make
make install

Incorporating Into a Pair

The following instructions are for the Yacc-like syntax. To use XML, replace references to rtx-comp with trx-comp.

Makefile.am

Add $(PREFIX1).rtx.bin and $(PREFIX2).rtx.bin to TARGETS_COMMON.

$(PREFIX1).rtx.bin: $(BASENAME).$(PREFIX1).rtx
	rtx-comp $< $@

$(PREFIX2).rtx.bin: $(BASENAME).$(PREFIX2).rtx
	rtx-comp $< $@

modes.xml

Replace

     <program name="apertium-transfer -b">
       <file name="apertium-eng-kir.eng-kir.t1x"/>
       <file name="eng-kir.t1x.bin"/>
     </program>
     <program name="apertium-interchunk">
       <file name="apertium-eng-kir.eng-kir.t2x"/>
       <file name="eng-kir.t2x.bin"/>
     </program>
     <program name="apertium-postchunk">
       <file name="apertium-eng-kir.eng-kir.t3x"/>
       <file name="eng-kir.t3x.bin"/>
     </program>

with

     <program name="rtx-proc">
       <file name="eng-kir.rtx.bin"/>
     </program>

If the pair uses apertium-anaphora, use rtx-proc -a rather than rtx-proc.

configure.ac

AC_PATH_PROG([RTXCOMP], [rtx-comp], [false], [$PATH$PATH_SEPARATOR$with_rtx_comp/bin])
AS_IF([test x$RTXCOMP = xfalse], [AC_MSG_ERROR([You don't have rtx-comp installed])])

AC_PATH_PROG([RTXPROC], [rtx-proc], [false], [$PATH$PATH_SEPARATOR$with_rtx_proc/bin])
AS_IF([test x$RTXPROC = xfalse], [AC_MSG_ERROR([You don't have rtx-proc installed])])

Differences From .t*x

The following aspects of standard transfer files are unsupported or have potentially unpredictable results and should be avoided.

Chunks With the Same Tags as LUs

A number of pairs have rules that match punctuation and reset variables. Unfortunately, these rules often produce chunks with the same part of speech tag (e.g. match <sent> and output <sent>), which can lead to infinite recursion since the rule can match its own output.

With trx-comp this is possible but should be avoided.

Literal Chunks

In interchunk, new chunks are sometimes inserted like this:

<chunk>
 <lit v="det"/>
 <lit-tag v="DET.def"/>
 <lit v="{^the"/>
 <lit-tag v="det.def.mf.sp"/>
 <lit v="$}"/>
</chunk>

trx-comp will make some effort to deal with this, but results are not guaranteed and the curly braces may show up in the output. Instead write the above as:

<chunk name="det">
 <tags>
  <tag><lit-tag v="DET"/></tag>
  <tag><lit-tag v="def"/></tag>
 </tags>
 <lit v="the"/>
 <lit-tag v="det.def.mf.sp"/>
</chunk>

or even

<lu>
 <lit v="the"/>
 <lit-tag v="det.def.mf.sp"/>
</lu>

depending on what you're doing with it. (Note that trx-comp doesn't check the syntactic distinctions between .t1x and .t2x files.)

Chunks Not Containing Blanks

If the contents of a chunk are not alternating LUs/chunks and blanks, postchunk rules may not be able to handle them properly. So if there is a postchunk rule involved, always put

<chunk> <lu>...</lu>  <lu>...</lu> </chunk>

rather than

<chunk> <lu>...</lu> <lu>...</lu> </chunk>

even if you don't intend to output the blank.

Similarly, if the order of elements inside and <out> isn't "LU/chunk blank LU/chunk ... blank LU/chunk", later rules may fail to match properly.

See Also