Cascaded Interchunk

From Apertium
Revision as of 18:44, 15 January 2012 by BrendenD14 (talk | contribs)
Jump to navigation Jump to search

Under Construction

Cascaded interchunk is a class of interchunk (see "advanced transfer" in the official documentation). Interchunk transfer uses 3 steps and cascaded interchunk uses more than 3.

Chunking

Chunking is based on source language patterns. It is used in all language pairs.

  • First, words are grouped into chunks.
  • Then, the chunks are formed by reading lexical forms left-to-right and building the longest-matching sequence that matches one of the patterns in the rule file.
  • From this, a ‘pseudo lemma’ is made with a tag containing the type – can be any kind of phrase such as ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).

Chunks for an English phrase may look like:

SN (The dog)    SV (liked)    SN (the boy)

"The dog" is a noun phrase and so is "the boy" so they are chunked as such because there is a rule with a pattern matching the corresponding sequences of lexical forms.

"liked" is a verb and so is chunked as a verb phrase and not as a noun phrase.


This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See Parse tree on Wikipedia. 1-stage transfer, 3-stage transfer, and n-stage cascaded interchunk are all shallow.

Chunking: An Explanation

When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases), "SV" (verb phrases), or any other type of phrase.

Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later. The patterns themselves are also chunks, just longer.

Chunks are normally given a 'pseudo lemma' that describes the pattern that matched them ('the dog', 'my friend' will be put in a chunk with lemma 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the interchunk in the same way that the chunker treats lexical forms.

So, taking 'big cat', you would get:

^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$

The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk tag number #'. CD means 'Case to be Determined (it's not fully established, as GD and ND are, but it's the logical one to use). Comparable to the way in which GD and ND are used in other language pairs to refer to gender to be determined or number to be determined.

So, with a simple SN SV SN, you can have a rule that outputs the same things in the same order, but changes the 'CD' of SN number 1 to 'nom', and of SN number 2 to 'acc'. This can also be done in regular interchunk.

See Chunking: A full example for more.

Chunking: An Example

I saw a signal

becomes after morphological analysis and part-of-speech disambiguation by the tagger

^prpers<prn><subj><p1><mf><sg>$ 
^see<vblex><past>$ 
^a<det><ind><sg>$ 
^signal<n><sg>$.

which is chunked into

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$ 
^verb<SV><past>{^vidi<vblex><past>$}$ 
^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.

and transformed by rule SN SV SN<nom> -> SN SV SN<acc>

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
^verb<SV><past>{^vidi<vblex><past>$}$
^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.

Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking:

^prpers<prn><subj><p1><mf><sg>$ 
^vidi<vblex><past>$ 
^signalo<n><sg><acc>$.