N-Stage transfer
The idea of n-Stage transfer is to extend the apertium-interchunk
so that it can "merge" chunks, for example NP CC NP → NP
This is something like the idea of cascaded finite-state chunking, as described by Abney (1995).
Examples
- 1. The girl with the telescope shouted at the boy who saw the dog in the field.
The current chunk-based transfer would normally chunk this into:
[The girl] [with] [the telescope] [shouted] [at] [the boy] [who] [saw] [the dog] [in] [the field] NP PREP NP V PREP NP REL V NP PREP NP
This is quite a shallow analysis, with more stages of chunking, we could unify some of those chunks into more coherent phrases. So for example the next stage might be to unify PREP NP → PP
then NP PP → NP
, then V NP → VP
and then NP REL VP → NP
. We'd end up with a more coherent and "deep" analysis which might look something like
The girl with the telescope shouted at the boy who saw the dog in the field DET NOM PREP DET NOM V PREP DET NOM REL V DET NOM PREP DET NOM * NP PREP NP V PREP NP REL V NP PREP NP (PREP NP → PP) NP PP V PP REL V NP PP (NP PP → NP) NP V PP REL V NP PP (V NP → VP) NP V PP REL VP (NP REL VP → NP) NP V NP
This would not give us any more "transfer power", as the rules would still be finite-state, and non-recursive, but it would make certain tasks easier. Probably we wouldn't reach 5 levels of interchunk, but even having one more level could help a lot.
- 2. My country's largest shopping centres
The current transfer chunks this into:
[My country]['s] [largest shopping centres] NP GEN NP
An intermediate stage of the transfer could have a rule to join NP + GEN + NP to create a single NP chunk. This will avoid the huge work of having to specify in the first stage the many different word combinations that may form a NP.
In this example, the head of the new NP would be the second original NP, that means that the morphological information of the new chunk would be that of "largest shopping centres" (plural) and not that of "my country" (singular). This information is important so that the next stage of the transfer (the current interchunk module) can perform some concordance operations:
[My country]['s] [largest shopping centres] [will prepare] (...) 1st stage: NP<sg> GEN NP<pl> V 2nd stage: NP<pl> V 3rd stage: NP<pl> V<pl>
Test implementation
from Jacob Nordfalk <jacob.nordfalk@gmail.com> til Apertium-stuff <apertium-stuff@lists.sourceforge.net> dato 12. apr. 2009 02.51 subject Experimental 4-stage transfer introduced in Apertium! sendt af gmail.com skjul detaljer 02.51 (2 minutter siden) Svar Dear all, I took the freedom to add support for simple n-stage transfer, by adding 5 simple lines of code to Apertium today. The lines I added to Apertium can be seen here: http://apertium.svn.sourceforge.net/viewvc/apertium?view=rev&revision=9616 Basically I added support for a new part: <clip pos="3" part="x_pgcontent"/> which gives the CONTENT INSIDE a chunk. so from ^adj_nom<SN><nom>{^granda<adj><sg><2>$ ^hundo<n><m><sg><2>$}$ it gets ^granda<adj><sg><2>$ ^hundo<n><m><sg><2>$ This can be used to merge chunks. I have put an example here: http://apertium.svn.sourceforge.net/viewvc/apertium?view=rev&revision=9613 Try big black cat's nice blue eyes belaj bluaj okuloj de granda nigra kato <SN>{granda nigra kato} <GEN>{de} SN{belaj bluaj okuloj} -> <SN>{belaj bluaj okuloj de granda nigra kato} Of course the name part="x_pgcontent" is temporary (eXperimental) and shoudn't be counted on, But I will ask you to please leave it in Apertium until someone makes some kind of improved ('proper') n-stage transfer support. As Esperanto's grammar is simple this very simple n-stage tranfer support is satisfactory for most tasks (and can really make a big difference in this language pair). In the meanwhile I will use this to improve the English-Esperanto pair, and gain experience about n-stage tranfer to share with Apertium community (and for you to try it you must svn up apertium and install with the patch, of course), I have already idintified the following poblems: - Case handling (Big black cat's nice blue eyes -> belaj bluaj okuloj de Granda nigra kato ) - Tag reference handling (an option to unpack <2>'s and <3>'s). This should not always happen, as its sometimes good to keep the <2>'s and <3>'s and sometimes its not. I can give some examples of this on request. Jacob
References
- Steven Abney. (1996) "Partial Parsing via Finite-State Cascades". J. of Natural Language Engineering, 2(4): 337-344.