Cascaded Interchunk

Chunking[edit]

Chunking is based on source language patterns. It is used in all language pairs.

First, words are grouped into chunks.

Then, the chunks are formed by reading lexical forms left-to-right and building the longest-matching sequence that matches one of the patterns in the rule file.

From this, a ‘pseudo lemma’ is made with a tag containing the type – can be any kind of phrase such as ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).

Chunks for an English phrase may look like:

SN (The dog)    SV (liked)    SN (the boy)

"The dog" is a noun phrase and so is "the boy" so they are chunked as such because there is a rule with a pattern matching the corresponding sequences of lexical forms.

"liked" is a verb and so is chunked as a verb phrase and not as a noun phrase.

This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See Parse tree on Wikipedia. 1-stage transfer, 3-stage transfer, and n-stage cascaded interchunk are all shallow.

Chunking: An Explanation[edit]

When a phrase is chunked, it is divided into portions that are related that are "SN" (noun phrases), "SV" (verb phrases), or any other type of phrase.

Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later. The patterns themselves are also chunks, just longer.

Chunks are normally given a 'pseudo lemma' that describes the pattern that matched them ('the dog', 'my friend' will be put in a chunk with lemma 'det_nom', etc.), the first tag added is the phrase type; after that, other tags that the developer feels are neccessary are addded. Essentially, phrase chunks are treated in the interchunk in the same way that the chunker treats lexical forms.

So, taking 'big cat', you would get:

^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$

from the chunker.

The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk tag number #'. CD means 'Case to be Determined', analogously to the way in which GD and ND are used in other language pairs to refer to gender to be determined or number to be determined.

pairs to refer to gender to be determined or number to be determined" So, with a simple SN SV SN, you can have a rule that outputs the same things in the same order, but changes the 'CD' of SN number 1 to 'nom', and of SN number 2 to 'acc'. This can also be done in regular interchunk.

See Chunking: A full example for more.

Chunking: An Example[edit]

I saw a signal

becomes after morphological analysis and part-of-speech disambiguation by the tagger

^prpers<prn><subj><p1><mf><sg>$ 
^see<vblex><past>$ 
^a<det><ind><sg>$ 
^signal<n><sg>$.

which is chunked into

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$ 
^verb<SV><past>{^vidi<vblex><past>$}$ 
^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.

by the chunking module

and transformed by rule SN SV SN<nom> -> SN SV SN<acc>

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
^verb<SV><past>{^vidi<vblex><past>$}$
^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.

Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when using the postchunk module for unchunking:

^prpers<prn><subj><p1><mf><sg>$ 
^vidi<vblex><past>$ 
^signalo<n><sg><acc>$.

The translation is generated from this. In this case the translation is, "Mi vidis signalon".

Cascaded Interchunk

Contents

Chunking[edit]

Chunking: An Explanation[edit]

Chunking: An Example[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools