Difference between revisions of "Cascaded Interchunk"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
{{TOCD}}
{{TOCD}}
'''Under Construction'''
'''Under Construction'''

Cascaded interchunk is a class of interchunk (see "advanced transfer" in the official documentation). Interchunk transfer uses 3 steps and cascaded interchunk uses more than 3.

==Chunking==
==Chunking==


Chunking is based on source language patterns. It is used in language pairs such as English-Esperanto.
Chunking is based on source language patterns. It is used in all language pairs.


*First, words are reordered into chunks.
*First, words are grouped into chunks.
*Then, the chunks are reordered by matching patterns like adj+noun or adj+adj+noun.
*Then, the chunks are formed by reading lexical forms left-to-right and building the longest-matching sequence that matches one of the patterns in the rule file.

*From this, a ‘pseudo lemma’ is made with a tag containing the type – normally ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).


*From this, a ‘pseudo lemma’ is made with a tag containing the type – can be any kind of phrase such as ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).
*Basically after this, the translation is done with these pseudo words breaking the language down to its roots.


Chunks for an English phrase may look like:
Chunks for an English phrase may look like:


<pre>
<pre>
SN (The dog) SV (played with) SN (the boy)
SN (The dog) SV (liked) SN (the boy)
</pre>
</pre>


<nowiki>"The dog" is a noun phrase and so is "the boy" so they are chunked as such.</nowiki>
<nowiki>"The dog" is a noun phrase and so is "the boy" so they are chunked as such because there is a rule with a pattern matching the corresponding sequences of lexical forms.</nowiki>


"played with" is a verb phrase and so is chunked as such and not as a noun phrase.
"liked" is a verb and so is chunked as a verb phrase and not as a noun phrase.




This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See [http://en.wikipedia.org/wiki/Parse_tree Parse tree on Wikipedia].
This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See [http://en.wikipedia.org/wiki/Parse_tree Parse tree on Wikipedia]. 1-stage transfer, 3-stage transfer, and n-stage cascaded interchunk are all shallow.


== Chunking: An Expanation ==
== Chunking: An Explanation ==


When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases) or "SV" (verb phrases).
When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases), "SV" (verb phrases), or any other type of phrase.


Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later.
Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later. The patterns themselves are also chunks, just longer.


Chunks are normally given a 'pseudo lemma' that matches the pattern that matched them ('the dog', 'my friend' will be put in a chunk called 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the same way that the morphological analyser treats lexemes ('surface forms').
Chunks are normally given a 'pseudo lemma' that describes the pattern that matched them ('the dog', 'my friend' will be put in a chunk with lemma 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the interchunk in the same way that the chunker treats lexical forms.


So, taking 'big cat', you would get:
So, taking 'big cat', you would get:
Line 41: Line 42:
The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk
The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk
tag number #'. CD means 'Case to be Determined (it's not fully
tag number #'. CD means 'Case to be Determined (it's not fully
established, as GD and ND are, but it's the logical one to use).
established, as GD and ND are, but it's the logical one to use). Comparable to the way in which GD and ND are used in other language pairs to refer to gender to be determined or number to be determined.


So, with a simple SN SV SN, you can have a rule that outputs the same
So, with a simple SN SV SN, you can have a rule that outputs the same
things in the same order, but changes the 'CD' of SN number 1 to
things in the same order, but changes the 'CD' of SN number 1 to
'nom', and of SN number 2 to 'acc'.
'nom', and of SN number 2 to 'acc'. This can also be done in regular interchunk.


See [[Chunking: A full example]] for more.
See [[Chunking: A full example]] for more.
Line 55: Line 56:
</pre>
</pre>


becomes after tagger disambiguation
becomes after morphological analysis and part-of-speech disambiguation by the tagger
<pre>
<pre>
^prpers<prn><subj><p1><mf><sg>$
^prpers<prn><subj><p1><mf><sg>$
Line 63: Line 64:
</pre>
</pre>


which is transfered and chunked into
which is chunked into
<pre>
<pre>
^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$

Revision as of 18:44, 15 January 2012

Under Construction

Cascaded interchunk is a class of interchunk (see "advanced transfer" in the official documentation). Interchunk transfer uses 3 steps and cascaded interchunk uses more than 3.

Chunking

Chunking is based on source language patterns. It is used in all language pairs.

  • First, words are grouped into chunks.
  • Then, the chunks are formed by reading lexical forms left-to-right and building the longest-matching sequence that matches one of the patterns in the rule file.
  • From this, a ‘pseudo lemma’ is made with a tag containing the type – can be any kind of phrase such as ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).

Chunks for an English phrase may look like:

SN (The dog)    SV (liked)    SN (the boy)

"The dog" is a noun phrase and so is "the boy" so they are chunked as such because there is a rule with a pattern matching the corresponding sequences of lexical forms.

"liked" is a verb and so is chunked as a verb phrase and not as a noun phrase.


This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See Parse tree on Wikipedia. 1-stage transfer, 3-stage transfer, and n-stage cascaded interchunk are all shallow.

Chunking: An Explanation

When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases), "SV" (verb phrases), or any other type of phrase.

Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later. The patterns themselves are also chunks, just longer.

Chunks are normally given a 'pseudo lemma' that describes the pattern that matched them ('the dog', 'my friend' will be put in a chunk with lemma 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the interchunk in the same way that the chunker treats lexical forms.

So, taking 'big cat', you would get:

^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$

The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk tag number #'. CD means 'Case to be Determined (it's not fully established, as GD and ND are, but it's the logical one to use). Comparable to the way in which GD and ND are used in other language pairs to refer to gender to be determined or number to be determined.

So, with a simple SN SV SN, you can have a rule that outputs the same things in the same order, but changes the 'CD' of SN number 1 to 'nom', and of SN number 2 to 'acc'. This can also be done in regular interchunk.

See Chunking: A full example for more.

Chunking: An Example

I saw a signal

becomes after morphological analysis and part-of-speech disambiguation by the tagger

^prpers<prn><subj><p1><mf><sg>$ 
^see<vblex><past>$ 
^a<det><ind><sg>$ 
^signal<n><sg>$.

which is chunked into

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$ 
^verb<SV><past>{^vidi<vblex><past>$}$ 
^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.

and transformed by rule SN SV SN<nom> -> SN SV SN<acc>

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
^verb<SV><past>{^vidi<vblex><past>$}$
^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.

Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking:

^prpers<prn><subj><p1><mf><sg>$ 
^vidi<vblex><past>$ 
^signalo<n><sg><acc>$.