Difference between revisions of "Cascaded Interchunk"

Latest revision as of 14:12, 6 July 2012

Chunking[edit]

Chunking is based on source language patterns. It is used in all language pairs.

First, words are grouped into chunks.

Then, the chunks are formed by reading lexical forms left-to-right and building the longest-matching sequence that matches one of the patterns in the rule file.

From this, a ‘pseudo lemma’ is made with a tag containing the type – can be any kind of phrase such as ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).

Chunks for an English phrase may look like:

SN (The dog)    SV (liked)    SN (the boy)

"The dog" is a noun phrase and so is "the boy" so they are chunked as such because there is a rule with a pattern matching the corresponding sequences of lexical forms.

"liked" is a verb and so is chunked as a verb phrase and not as a noun phrase.

This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See Parse tree on Wikipedia. 1-stage transfer, 3-stage transfer, and n-stage cascaded interchunk are all shallow.

Chunking: An Explanation[edit]

When a phrase is chunked, it is divided into portions that are related that are "SN" (noun phrases), "SV" (verb phrases), or any other type of phrase.

Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later. The patterns themselves are also chunks, just longer.

Chunks are normally given a 'pseudo lemma' that describes the pattern that matched them ('the dog', 'my friend' will be put in a chunk with lemma 'det_nom', etc.), the first tag added is the phrase type; after that, other tags that the developer feels are neccessary are addded. Essentially, phrase chunks are treated in the interchunk in the same way that the chunker treats lexical forms.

So, taking 'big cat', you would get:

^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$

from the chunker.

The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk tag number #'. CD means 'Case to be Determined', analogously to the way in which GD and ND are used in other language pairs to refer to gender to be determined or number to be determined.

pairs to refer to gender to be determined or number to be determined" So, with a simple SN SV SN, you can have a rule that outputs the same things in the same order, but changes the 'CD' of SN number 1 to 'nom', and of SN number 2 to 'acc'. This can also be done in regular interchunk.

See Chunking: A full example for more.

Chunking: An Example[edit]

I saw a signal

becomes after morphological analysis and part-of-speech disambiguation by the tagger

^prpers<prn><subj><p1><mf><sg>$ 
^see<vblex><past>$ 
^a<det><ind><sg>$ 
^signal<n><sg>$.

which is chunked into

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$ 
^verb<SV><past>{^vidi<vblex><past>$}$ 
^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.

by the chunking module

and transformed by rule SN SV SN<nom> -> SN SV SN<acc>

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
^verb<SV><past>{^vidi<vblex><past>$}$
^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.

Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when using the postchunk module for unchunking:

^prpers<prn><subj><p1><mf><sg>$ 
^vidi<vblex><past>$ 
^signalo<n><sg><acc>$.

The translation is generated from this. In this case the translation is, "Mi vidis signalon".

@@ Line 1: / Line 1: @@
 {{TOCD}}
+Cascaded interchunk is a class of interchunk (see "advanced transfer" in the official documentation). Interchunk transfer uses 3 steps and cascaded interchunk uses more than 3.
-'''Under Construction'''
 ==Chunking==
-Chunking is based on source language patterns. It is used in language pairs such as English-Esperanto.
+Chunking is based on source language patterns. It is used in all language pairs.
-*First, words are reordered into chunks.
+*First, words are grouped into chunks.
-*Then, the chunks are reordered by matching patterns like adj+noun or adj+adj+noun.
+*Then, the chunks are formed by reading lexical forms left-to-right and building the longest-matching sequence that matches one of the patterns in the rule file.
-*From this, a ‘pseudo lemma’ is made with a tag containing the type – normally ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).
+*From this, a ‘pseudo lemma’ is made with a tag containing the type – can be any kind of phrase such as ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).
-*Basically after this, the translation is done with these pseudo words breaking the language down to its roots.
 Chunks for an English phrase may look like:
 <pre>
-SN (The dog)    SV (played with)    SN (the boy)
+SN (The dog)    SV (liked)    SN (the boy)
 </pre>
-<nowiki>"The dog" is a noun phrase and so is "the boy" so they are chunked as such.</nowiki>
+<nowiki>"The dog" is a noun phrase and so is "the boy" so they are chunked as such because there is a rule with a pattern matching the corresponding sequences of lexical forms.</nowiki>
-"played with" is a verb phrase and so is chunked as such and not as a noun phrase.
+"liked" is a verb and so is chunked as a verb phrase and not as a noun phrase.
-This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See [http://en.wikipedia.org/wiki/Parse_tree Parse tree on Wikipedia].
+This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See [http://en.wikipedia.org/wiki/Parse_tree Parse tree on Wikipedia]. 1-stage transfer, 3-stage transfer, and n-stage cascaded interchunk are all shallow.
-== Chunking: An Expanation ==
+== Chunking: An Explanation ==
-When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases) or "SV" (verb phrases).
+When a phrase is chunked, it is divided into portions that are related that are "SN" (noun phrases), "SV" (verb phrases), or any other type of phrase.
-Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later.
+Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later. The patterns themselves are also chunks, just longer.
-Chunks are normally given a 'pseudo lemma' that matches the pattern that matched them ('the dog', 'my friend' will be put in a chunk called 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the same way that the morphological analyser treats lexemes ('surface forms').
+Chunks are normally given a 'pseudo lemma' that describes the pattern that matched them ('the dog', 'my friend' will be put in a chunk with lemma 'det_nom', etc.), the first tag added is the phrase type; after that, other tags that the developer feels are neccessary are addded. Essentially, phrase chunks are treated in the interchunk in the same way that the chunker treats lexical forms.
 So, taking 'big cat', you would get:
@@ Line 38: / Line 37: @@
 ^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$
 </pre>
+from the chunker.
 The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk
-tag number #'. CD means 'Case to be Determined (it's not fully
+tag number #'. CD means 'Case to be Determined', analogously to the way in which GD and ND are used in other language pairs to refer to gender to be determined or number to be determined.
-established, as GD and ND are, but it's the logical one to use).
+pairs to refer to gender to be determined or number to be determined"
 So, with a simple SN SV SN, you can have a rule that outputs the same
 things in the same order, but changes the 'CD' of SN number 1 to
-'nom', and of SN number 2 to 'acc'.
+'nom', and of SN number 2 to 'acc'. This can also be done in regular interchunk.
 See [[Chunking: A full example]] for more.
@@ Line 55: / Line 55: @@
 </pre>
-becomes after tagger disambiguation
+becomes after morphological analysis and part-of-speech disambiguation by the tagger
 <pre>
 ^prpers<prn><subj><p1><mf><sg>$
@@ Line 63: / Line 63: @@
 </pre>
-which is transfered and chunked into
+which is chunked into
 <pre>
 ^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
@@ Line 69: / Line 69: @@
 ^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.
 </pre>
+by the chunking module
 and transformed by rule SN SV SN<nom> -> SN SV SN<acc>
@@ Line 76: / Line 77: @@
 ^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.
 </pre>
-Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking:
+Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when using the postchunk module for unchunking:
 <pre>
 ^prpers<prn><subj><p1><mf><sg>$
@@ Line 82: / Line 83: @@
 ^signalo<n><sg><acc>$.
 </pre>
+The translation is generated from this. In this case the translation is, "Mi vidis signalon".
+[[Category:Documentation in English]]

Difference between revisions of "Cascaded Interchunk"

Latest revision as of 14:12, 6 July 2012

Contents

Chunking[edit]

Chunking: An Explanation[edit]

Chunking: An Example[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools