Difference between revisions of "Cascaded Interchunk"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
'''Under Construction'''

Revision as of 19:57, 13 January 2012

Under Construction


Chunking is based on source language patterns. It is used in language pairs such as English-Esperanto.

  • First, words are reordered into chunks.
  • Then, the chunks are reordered by matching patterns like adj+noun or adj+adj+noun.
  • From this, a ‘pseudo lemma’ is made with a tag containing the type – normally ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).
  • Basically after this, the translation is done with these pseudo words breaking the language down to its roots.

Chunks for an English phrase may look like:

SN (The dog)    SV (played with)    SN (the boy)

"The dog" is a noun phrase and so is "the boy" so they are chunked as such.

"played with" is a verb phrase and so is chunked as such and not as a noun phrase.

This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See Parse tree on Wikipedia.

Chunking: An Expanation

When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases) or "SV" (verb phrases).

Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later.

Chunks are normally given a 'pseudo lemma' that matches the pattern that matched them ('the dog', 'my friend' will be put in a chunk called 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the same way that the morphological analyser treats lexemes ('surface forms').

So, taking 'big cat', you would get:

^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$

The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk tag number #'. CD means 'Case to be Determined (it's not fully established, as GD and ND are, but it's the logical one to use).

So, with a simple SN SV SN, you can have a rule that outputs the same things in the same order, but changes the 'CD' of SN number 1 to 'nom', and of SN number 2 to 'acc'.

See Chunking: A full example for more.

Chunking: An Example

I saw a signal

becomes after tagger disambiguation


which is transfered and chunked into


and transformed by rule SN SV SN<nom> -> SN SV SN<acc>


Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking: