Difference between revisions of "Chunking"

From Apertium
Jump to navigation Jump to search
(Link to French page)
 
(10 intermediate revisions by 6 users not shown)
Line 1: Line 1:
  +
[[Fragmentation|En français]]
==Short intro==
 
<pre>
 
jacobn> But really I have a big problem about all this "shallow transfer".
 
   
  +
{{TOCD}}
<spectie> shallow transfer = no parse trees
 
  +
==Shallow transfer ==
<spectie> basically
 
  +
Shallow transfer means there is no parse trees (which are used in "deep transfer").
<jimregan2> yep
 
 
But then how is reordering of the phrase then going to happen?
 
<jacobn> HOW is reordering of the phrase then going to happen!!
 
jimregan2> we use chunking
 
   
<jimregan2> first we reorder words in the chunk, then we reorder chunks
+
By chunking (in three-stages): First we reorder words in the chunk, then we reorder chunks.
 
* first, we match phrase patterns, like adj+noun or adj+adj+noun
 
 
* from these, we make a 'pseudo lemma', with a tag containing the type - normally 'SN' (noun phrase) or SV (verb phrase)
<jacobn> Pls tell me 'bout it or point to a web page
 
 
* then, we translate based on these pseudo words breaking the language down to its bare essentials, basically
<jimregan2> it's easy enough
 
<jimregan2> first, we match phrase patterns
 
<jimregan2> adj+noun
 
<jimregan2> adj+adj+noun
 
<jimregan2> from these, we make a 'pseudo lemma', with a tag containing the type - normally 'SN' (noun phrase) or SV (verb phrase)
 
<jimregan2> then, we translate based on these pseudo words
 
<jimregan2> breaking the language down to its bare essentials, basically
 
<jimregan2> at the moment, I'm taking the 'hard wired' parts of the english to spanish chunker, and adapting it for french
 
<jimregan2> changing 'más' to 'plus' in a macro, etc.
 
 
<spectie> but the chunks cannot be recursive
 
</pre>
 
   
==Longer intro==
+
==Chunking explained==
 
Our rules are based on the source language patterns; we need to use
 
Our rules are based on the source language patterns; we need to use
 
chunking for f.ex. English-Esperanto, so the first task is to identify those
 
chunking for f.ex. English-Esperanto, so the first task is to identify those
 
patterns.
 
patterns.
 
<pre>
 
 
the man sees the girl
 
the man sees the girl
 
</pre>
 
 
Chunking:
 
Chunking:
   
  +
<pre>
 
SN(the man) SV(sees) SN(the girl)
 
SN(the man) SV(sees) SN(the girl)
  +
</pre>
   
 
(Normally, in English those are 'NP' and 'VP' for 'noun phrase' and
 
(Normally, in English those are 'NP' and 'VP' for 'noun phrase' and
Line 47: Line 34:
 
tag the chunks with whatever information will be useful later.
 
tag the chunks with whatever information will be useful later.
   
 
So the chunks are normally given a 'pseudo lemma' that matches the
So
 
 
The chunks are normally given a 'pseudo lemma' that matches the
 
 
pattern that matched them ('the man', 'my friend' will be put in a
 
pattern that matched them ('the man', 'my friend' will be put in a
 
chunk called 'det_nom', etc.), the first tag added is the phrase type;
 
chunk called 'det_nom', etc.), the first tag added is the phrase type;
Line 57: Line 42:
   
 
So, taking 'big cat', we would get:
 
So, taking 'big cat', we would get:
  +
<pre>
 
^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$
 
^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$
  +
</pre>
   
(the numbers in the lemma tags mean 'take the information from chunk
+
The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk
tag number #', CD means 'Case to be Determined - it's not fully
+
tag number #'. CD means 'Case to be Determined (it's not fully
 
established, as GD and ND are, but it's the logical one to use).
 
established, as GD and ND are, but it's the logical one to use).
   
so, with a simple SN SV SN, we can have a rule that outputs the same
+
So, with a simple SN SV SN, we can have a rule that outputs the same
 
things in the same order, but changes the 'CD' of SN number 1 to
 
things in the same order, but changes the 'CD' of SN number 1 to
'nom', and of SN number 2 to 'acc'. All very simple.
+
'nom', and of SN number 2 to 'acc'.
   
===Now, a note.===
 
   
The next kind of thing we should think about is the type of sentence
 
part that goes like this:
 
   
  +
==Example==
'the man you saw'
 
  +
<pre>
'the man the girl saw'
 
  +
I saw a signal
  +
</pre>
   
  +
becomes after tagger disambiguation
I don't know if we have to change word order here - probably not - but
 
  +
<pre>
the nominative and accusative are SNs 1 and 2 respectively.
 
  +
^prpers<prn><subj><p1><mf><sg>$
  +
^see<vblex><past>$
  +
^a<det><ind><sg>$
  +
^signal<n><sg>$.
  +
</pre>
   
  +
which is transfered and chunked into
But think about this:
 
  +
<pre>
 
  +
^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
'the man my brother became'
 
  +
^verb<SV><past>{^vidi<vblex><past>$}$
 
  +
^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.
Adding accusative here is wrong, so what can we do about it? Not much.
 
  +
</pre>
Maybe in this specific instance, sure, but generally, we can only take
 
the common cases and hope for the best. There's been plenty of work
 
into statistical parsing, subject identification, etc., but it's still
 
not much better than picking the common cases, and hoping for the
 
best.
 
 
This is why we always tell people to have their translations checked
 
by a native speaker :)
 
   
  +
and transformed by rule SN SV SN<nom> -> SN SV SN<acc>
  +
<pre>
  +
^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
  +
^verb<SV><past>{^vidi<vblex><past>$}$
  +
^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.
  +
</pre>
  +
Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking:
  +
<pre>
  +
^prpers<prn><subj><p1><mf><sg>$
  +
^vidi<vblex><past>$
  +
^signalo<n><sg><acc>$.
  +
</pre>
   
 
==See also==
 
==See also==
  +
* [[Chunking: A full example]]
 
* [[Apertium stream format#Chunks]]
 
* [[Apertium stream format#Chunks]]
 
* [[Preparing to use apertium-transfer-tools]]
 
* [[Preparing to use apertium-transfer-tools]]
  +
* [[English and Esperanto]]
   
 
==External links==
 
==External links==
 
* [http://en.wikipedia.org/wiki/Chunking_(computational_linguistics) wikipedia]
 
* [http://en.wikipedia.org/wiki/Chunking_(computational_linguistics) wikipedia]
* [http://nltk.org/doc/en/chunk.html Chunking] (Natural Language Toolkit)
+
* [http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html Chunking] (Natural Language Toolkit)
 
* [http://crfchunker.sourceforge.net/ CRFChunker] (Conditional Random Fields English Phrase Chunker)
 
* [http://crfchunker.sourceforge.net/ CRFChunker] (Conditional Random Fields English Phrase Chunker)
 
* [http://jtextpro.sourceforge.net/ JTextPro] (A Java-based Text Processing Toolkit)
 
* [http://jtextpro.sourceforge.net/ JTextPro] (A Java-based Text Processing Toolkit)
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Writing transfer rules]]
 
  +
[[Category:Documentation in English]]
== Headline text ==
 

Latest revision as of 06:55, 8 October 2014

En français

Shallow transfer[edit]

Shallow transfer means there is no parse trees (which are used in "deep transfer"). But then how is reordering of the phrase then going to happen?

By chunking (in three-stages): First we reorder words in the chunk, then we reorder chunks.

  • first, we match phrase patterns, like adj+noun or adj+adj+noun
  • from these, we make a 'pseudo lemma', with a tag containing the type - normally 'SN' (noun phrase) or SV (verb phrase)
  • then, we translate based on these pseudo words breaking the language down to its bare essentials, basically

Chunking explained[edit]

Our rules are based on the source language patterns; we need to use chunking for f.ex. English-Esperanto, so the first task is to identify those patterns.

the man sees the girl

Chunking:

SN(the man) SV(sees) SN(the girl)

(Normally, in English those are 'NP' and 'VP' for 'noun phrase' and 'verb phrase' respectively, but we'll stick to the established convention in apertium)

Two rules are needed to make those chunks: further chunking rules can match 'the tall man' 'my favourite Spanish friend' 'the prettiest Polish girl' etc. as SN; 'was going', 'had been going', 'must have been going' as SV. We first consider these patterns separately, but tag the chunks with whatever information will be useful later.

So the chunks are normally given a 'pseudo lemma' that matches the pattern that matched them ('the man', 'my friend' will be put in a chunk called 'det_nom', etc.), the first tag added is the phrase type; after that, tags that are needed in the next set of rules. Essentially, we're treating phrase chunks in the same way that the morphological analyser treats lexemes ('surface forms').

So, taking 'big cat', we would get:

^adj_nom<SN><sg><CD>{^granda<ad><2><3>$ ^kato<n><2><3>$}$

The numbers in the lemma tags (here <2><3>) mean 'take the information from chunk tag number #'. CD means 'Case to be Determined (it's not fully established, as GD and ND are, but it's the logical one to use).

So, with a simple SN SV SN, we can have a rule that outputs the same things in the same order, but changes the 'CD' of SN number 1 to 'nom', and of SN number 2 to 'acc'.


Example[edit]

I saw a signal

becomes after tagger disambiguation

^prpers<prn><subj><p1><mf><sg>$ 
^see<vblex><past>$ 
^a<det><ind><sg>$ 
^signal<n><sg>$.

which is transfered and chunked into

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$ 
^verb<SV><past>{^vidi<vblex><past>$}$ 
^nom<SN><sg><nom>{^signalo<n><2><3><4>$}$.

and transformed by rule SN SV SN<nom> -> SN SV SN<acc>

^prnpers<SN><p1><mf><sg>{^prpers<prn><subj><2><3><4>$}$
^verb<SV><past>{^vidi<vblex><past>$}$
^nom<SN><sg><acc>{^signalo<n><2><3><4>$}$.

Note how the chunk has now tags nom<SN><sg><acc> and therefore ^signalo<n><2><3><4>$ gets these tags when unchunking:

^prpers<prn><subj><p1><mf><sg>$ 
^vidi<vblex><past>$ 
^signalo<n><sg><acc>$. 

See also[edit]

External links[edit]