Difference between revisions of "Constructing a TSX file with a Constraint Grammar"

From Apertium
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
  +
#redirect[[Apertium and Constraint Grammar]]
{{TOCD}}
 
Constraint Grammar (CG) is a method of POS-tagging ambiguous text. The apertium-tagger has a basic form of this in addition to probabilistic tagging. It should be possible to use an existing CG to write, or improve an existing [[TSX]] (tagger definition file) in the form of using it to create sets of forbid/enforce rules.
 
 
==Terminology==
 
 
* ''cohort'' — a [[surface form]] of a word, along with its analyses (possible [[lexical unit]]s).
 
::Apertium equivalent: <code>^words/word<n><pl>/word<vblex><pres><p3><sg>$</code>
 
* ''baseform'' &mdash; the [[lemma]] of a word.
 
* ''reading'' &mdash; a single analysis of a word.
 
::Apertium equivalent: <code>^word<n><pl>$
 
* ''wordform'' &mdash; a [[surface form]] of a word.
 
 
==Labels==
 
 
Coarse tag "labels" in Constraint Grammar (CG) are specified either as {{sc|list}} or {{sc|set}}. Sometimes however, these are not complete sets, so may need to be combined.
 
 
For example:
 
 
<pre>
 
LIST A-N-CC = A N CC ;
 
LIST A-pos = (A Pos) ;
 
LIST %etter/fram/opp% = ("etter" Pr) ("fram" Pr) ("frem" Pr) ("opp" Pr) ;
 
</pre>
 
 
Is three lists, expressed in TSX format as below:
 
 
<pre>
 
<def-label name="A-N-CC">
 
<tags-item tags="adj.*"/>
 
<tags-item tags="n.*"/>
 
<tags-item tags="cnjcoo"/>
 
</def-label>
 
<def-label name="A-pos">
 
<tags-item tags="adj.pos.*"/>
 
</def-label>
 
<def-label name="%etter/fram/opp%">
 
<tags-item lemma="etter" tags="pr"/>
 
<tags-item lemma="fram" tags="pr"/>
 
<tags-item lemma="frem" tags="pr"/>
 
<tags-item lemma="opp" tags="pr"/>
 
</def-label>
 
</pre>
 
 
etc. Note that this may cause some problems, so it might be best to attempt this using only ambiguous tags to start with.
 
 
==Constraints==
 
 
Constraint Grammar uses a series of hand-written constraints in order to POS-tag ambiguous words.
 
 
===Forbid rules===
 
 
The operation analogous to a ''forbid rule'' is {{sc|remove}}.
 
 
<pre>
 
# 3526
 
"<bare>" REMOVE (CS) IF
 
(-1 CS)
 
;
 
</pre>
 
 
This means that it works on the lemma "bare", which can be a subordinating conjunction, verb or adverb. It says to forbid the string "bare bare" where both lexical units are subordinating conjunctions. In TSX format:
 
 
<pre>
 
<forbid>
 
<label-sequence>
 
<label-item label="bare-CS">
 
<label-item label="bare-CS">
 
</label-sequence>
 
</forbid>
 
</pre>
 
 
Presuming we have a label definition of:
 
 
<pre>
 
<def-label name="bare-CS">
 
<tags-item lemma="bare" tags="conjsub"/>
 
</def-label>
 
</pre>
 
 
===Enforce rules===
 
 
The operation analogous to an ''enforce rule'' is {{sc|select}}, which "selects a reading, if it contains a TARGETed tag. In practice, selection is equivalent to a removal of all other readings."
 
 
<pre>
 
# 2866
 
SELECT (A Sg Neu Indef) IF
 
(0 %rundt%)
 
(1 Det-Qnt)
 
;
 
</pre>
 
 
Means enforce <code>adj.sg.nt.indef</code> if the lemma of the word is "rundt" and the lexical unit to the left is a quantifier <code>det.qnt</code>
 
 
In order to convert this into Apertium format one would need to take all of the coarse tags which are not <code>det.qnt</code> and make them into label sequences as below:
 
 
<pre>
 
<forbid>
 
<label-sequence>
 
<label-item label="%rundt%">
 
<label-item label="A-pos">
 
</label-sequence>
 
 
...
 
 
</forbid>
 
</pre>
 
 
===Prefer tags===
 
 
==Further reading==
 
 
* [http://beta.visl.sdu.dk/cg3.html vislcg3 documentation] ([http://beta.visl.sdu.dk/cg3/single/ single page])
 
* [http://beta.visl.sdu.dk/cg2_howto.html VISL: Basic how-to for vislcg (vislcg2)]
 
 
Note that vislcg3 is the version which is actively developed.
 
 
[[Category:Documentation]]
 

Latest revision as of 10:28, 23 March 2009