Difference between revisions of "Cascaded Interchunk"

Revision as of 19:43, 12 January 2012

Chunking

Chunking is based on source language patterns. It is used in language pairs such as English-Esperanto.

First, words are reordered into chunks.

Then, the chunks are reordered by matching patterns like adj+noun or adj+adj+noun.

From this, a ‘pseudo lemma’ is made with a tag containing the type – normally ‘SN’ (Noun Phrase) or ‘SV’ (Verb Phrase).

Basically after this, the translation is done with these pseudo words breaking the language down to its roots.

Chunks for an English phrase may look like:

SN (The dog)    SV (played with)    SN (the boy)

"The dog" is a noun phrase and so is "the boy" so they are chunked as such.

"played with" is a verb phrase and so is chunked as such and not as a noun phrase.

This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See Parse tree on Wikipedia.

Chunking: An Expanation

When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases) or "SV" (verb phrases).

Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later.

Chunks are normally given a 'pseudo lemma' that matches the pattern that matched them ('the dog', 'my friend' will be put in a chunk called 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the same way that the morphological analyser treats lexemes ('surface forms').

So, taking 'big cat', you would get:

^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$

The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk tag number #'. CD means 'Case to be Determined (it's not fully established, as GD and ND are, but it's the logical one to use).

So, with a simple SN SV SN, you can have a rule that outputs the same things in the same order, but changes the 'CD' of SN number 1 to 'nom', and of SN number 2 to 'acc'.

See Chunking: A full example for more.

Chunking: An Example

I saw a signal

becomes after tagger disambiguation

^prpers<prn><subj><p1><mf><sg>$ 
^see<vblex><past>$ 
^a<det><ind><sg>$ 
^signal<n><sg>$.

which is transfered and chunked into

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$ 
^verb<SV><past>{^vidi<vblex><past>$}$ 
^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.

and transformed by rule SN SV SN<nom> -> SN SV SN<acc>

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
^verb<SV><past>{^vidi<vblex><past>$}$
^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.

Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking:

^prpers<prn><subj><p1><mf><sg>$ 
^vidi<vblex><past>$ 
^signalo<n><sg><acc>$.

Difference between revisions of "Cascaded Interchunk"

Revision as of 19:43, 12 January 2012

Contents

Chunking

Chunking: An Expanation

Chunking: An Example

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 25: / Line 25: @@
 This method is used in shallow transfer translation engines such as Apertium because it doesn't use parse trees (which are normally used in "deep transfer"). See [http://en.wikipedia.org/wiki/Parse_tree Parse tree on Wikipedia].
+== Chunking: An Expanation ==
+When a phrase is chunked, it is divided into portions that are related that are either "SN" (noun phrases) or "SV" (verb phrases).
+Two rules are needed to make those chunks: further chunking rules can match 'the big dog' 'my favorite friend' 'the best hotel' etc. as SN; 'was leaving', 'had been leaving', 'must have been leaving' as SV. First, consider these patterns separately, but tag the chunks with whatever information will be useful later.
+Chunks are normally given a 'pseudo lemma' that matches the pattern that matched them ('the dog', 'my friend' will be put in a chunk called 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, phrase chunks are treated in the same way that the morphological analyser treats lexemes ('surface forms').
+So, taking 'big cat', you would get:
+<pre>
+^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$
+</pre>
+The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk
+tag number #'. CD means 'Case to be Determined (it's not fully
+established, as GD and ND are, but it's the logical one to use).
+So, with a simple SN SV SN, you can have a rule that outputs the same
+things in the same order, but changes the 'CD' of SN number 1 to
+'nom', and of SN number 2 to 'acc'.
+See [[Chunking: A full example]] for more.
+== Chunking: An Example ==
+<pre>
+I saw a signal
+</pre>
+becomes after tagger disambiguation
+<pre>
+^prpers<prn><subj><p1><mf><sg>$
+^see<vblex><past>$
+^a<det><ind><sg>$
+^signal<n><sg>$.
+</pre>
+which is transfered and chunked into
+<pre>
+^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
+^verb<SV><past>{^vidi<vblex><past>$}$
+^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.
+</pre>
+and transformed by rule SN SV SN<nom> -> SN SV SN<acc>
+<pre>
+^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
+^verb<SV><past>{^vidi<vblex><past>$}$
+^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.
+</pre>
+Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking:
+<pre>
+^prpers<prn><subj><p1><mf><sg>$
+^vidi<vblex><past>$
+^signalo<n><sg><acc>$.
+</pre>