Constraint Grammar

En français

Terminology

See also: Apertium stream format

cohort — a surface form of a word, along with its analyses (possible lexical units), an ambiguous lexical unit.

Apertium equivalent: ^words/word<n><pl>/word<vblex><pres><p3><sg>$

baseform — the lemma of a word.
reading — a single analysis of a word.

Apertium equivalent: ^word<n><pl>$

wordform — a surface form of a word.

Basic Rule Format

Sets

Sets are defined like this:

LIST VERB = vblex vbser ;     # matches <vblex> or <vbser>
LIST NSG = (n sg) ;           # matches <n><sg>
LIST TO = "to" ;              # matches the lemma "to"
LIST CASE = (n nom) (n acc) ; # matches <n><nom> or <n><acc>

Context Patterns

Context patterns look like this:

([LOCATION][MODIFIERS] [PATTERN])

PATTERN can be a lemma, set of tags, or the name of a set.

Symbol	Meaning
0	The current word
1	The word following the current word
-1	The word preceding the current word
2	The word 2 words after the current word
C	Every reading this position must match the pattern (normally only 1 has to)
*	In that position or further in that direction

(0 (v))      # the current word must have a verb reading
(1 VERB)     # the following word matches the set "VERB"
(-1 "to")    # the previous word must be "to"
(2C (n))     # every reading of the word after the next one must be a noun
(1* (pr))    # the current word has a preposition after it
(-2* (pron)) # there is a pronoun at least two words before the current word

Rules

Rules look like this:

SELECT [FORM] IF [CONTEXT] ;
REMOVE [FORM] IF [CONTEXT] ;

Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns.

SELECT VERB IF (1 (det)) ;    # prefer reading from set "VERB" if following word has <det> tag
REMOVE (n) IF (-1 (adv)) ;    # disprefer <n> reading if preceding word has <adv> tag
SELECT (p1) IF (-1* (prn p1)) # prefer 1st person reading for verbs
               (0 (v)) ;      # preceded at any distance by a 1st person pronoun

Note on parenthesis

The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags tag1 and tag2, then we can have rules like this:

LIST set1 = tag1 ;
LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2
LIST set3 = tag1 tag2 ;   # matches a word with tag1 or tag2
LIST word = "hello" ;

SELECT:1a (tag1) (1 word) ;
SELECT:1b  set1  (1 word) ;   # equivalent to 1a

SELECT:2a (tag1 tag2) (1 word) ;
SELECT:2b  set2       (1 word) ;   # equivalent to 2a

SELECT:3a tag1 (1 word) ;
SELECT:3b tag2 (1 word) ;
SELECT:3c set3 (1 word) ;   # equivalent to 3a and 3b combined

SELECT:1c  set1  (1 ("hello")) ; # equivalent to 1a (or 1b)

Languages using CG in Apertium

and many others. The following languages currently (2014-06-27) have CG's of over 100 rules:

3888 apertium-nno (based on the Oslo-Bergen tagger)
3649 apertium-sme (from Giellatekno)
2275 apertium-nob (based on the Oslo-Bergen tagger)
1552 apertium-est
1524 apertium-fin (based on Fred Karlsson's)
850 apertium-dan
594 apertium-gle 1207 in gle-eng.rlx
453 apertium-fao (from Giellatekno)
298 apertium-spa
279 apertium-bre [1]
255 apertium-cat
205 apertium-hbs [2], also hbs-mkd.rlx with syntax rules
190 apertium-isl
131 apertium-cym [3]
76 apertium-tur
127 apertium-eng
118 apertium-mkd
123 apertium-tat [4]
308 apertium-rus [5]
150 apertium-kaz

When is CG needed?

Currently some of the CG rules written in the above language pairs may be written as forbid rules in the TSX format used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.

Editor support

CG-3 IDE – the official vislcg3 CG IDE
Gedit syntax highlighting (also for any other editor that uses gtksourceview)
Emacs emacs mode for editing and testing CG grammars (highlighting + IDE-like features)

External links

VISL CG-3 Development Information + documentation and downloads
Basic Tutorial for VISL CG-3
cg-mode for emacs, gives basic syntax highlighting and indentation
Kevin Donnelly's CG tutorial
Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117 shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the TSX format used by apertium-tagger.

Constraint Grammar

Contents

Terminology

Basic Rule Format

Sets

Context Patterns

Rules

Note on parenthesis

Languages using CG in Apertium

When is CG needed?

Editor support

See also

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools