Difference between revisions of "Constraint Grammar"

From Apertium
Jump to navigation Jump to search
(Add CG syntax cheatsheet)
Line 12: Line 12:
::Apertium equivalent: <code>^word<n><pl>$</code>
::Apertium equivalent: <code>^word<n><pl>$</code>
* ''wordform'' &mdash; a [[surface form]] of a word.
* ''wordform'' &mdash; a [[surface form]] of a word.

==Basic Rule Format==

===Sets===

Sets are defined like this:

LIST verb = vblex vbser ; # matches <vblex> or <vbser>
LIST Nsg = (n sg) ; # matches <n><sg>
LIST to = "to" ; # matches the lemma "to"
LIST case = (n nom) (n acc) ; # matches <n><nom> or <n><acc>

===Rules===

Rules look like this:

SELECT [FORM] [CONTEXT] ;
REMOVE [FORM] [CONTEXT] ;

Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns.

===Context Patterns===

Context patterns look like this:

([LOCATION][MODIFIERS] [PATTERN])

PATTERN can be a lemma, set of tags, or the name of a set.

{|
| Symbol || Meaning
|-
| 0 || The current word
|-
| 1 || The word following the current word
|-
| -1 || The word preceding the current word
|-
| 2 || The word 2 words after the current word
|-
| C || Every reading this position must match the pattern (normally only 1 has to)
|-
| * || In that position or further in that direction
|}

(0 (v)) # the current word must have a verb reading
(-1 "to") # the previous word must be "to"
(2C (n)) # every reading of the word after the next one must be a noun
(1* (pr)) # the current word has a preposition after it
(-2* (pron)) # there is a pronoun at least two words before the current word


==Note on parenthesis==
==Note on parenthesis==

Revision as of 15:09, 5 March 2019

En français

Constraint Grammar is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson).

Terminology

See also: Apertium stream format
Apertium equivalent: ^words/word<n><pl>/word<vblex><pres><p3><sg>$
  • baseform — the lemma of a word.
  • reading — a single analysis of a word.
Apertium equivalent: ^word<n><pl>$

Basic Rule Format

Sets

Sets are defined like this:

LIST verb = vblex vbser ;     # matches <vblex> or <vbser>
LIST Nsg = (n sg) ;           # matches <n><sg>
LIST to = "to" ;              # matches the lemma "to"
LIST case = (n nom) (n acc) ; # matches <n><nom> or <n><acc>

Rules

Rules look like this:

SELECT [FORM] [CONTEXT] ;
REMOVE [FORM] [CONTEXT] ;

Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns.

Context Patterns

Context patterns look like this:

([LOCATION][MODIFIERS] [PATTERN])

PATTERN can be a lemma, set of tags, or the name of a set.

Symbol Meaning
0 The current word
1 The word following the current word
-1 The word preceding the current word
2 The word 2 words after the current word
C Every reading this position must match the pattern (normally only 1 has to)
* In that position or further in that direction
(0 (v))      # the current word must have a verb reading
(-1 "to")    # the previous word must be "to"
(2C (n))     # every reading of the word after the next one must be a noun
(1* (pr))    # the current word has a preposition after it
(-2* (pron)) # there is a pronoun at least two words before the current word

Note on parenthesis

The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags tag1 and tag2, then we can have rules like this:

LIST set1 = tag1 ;
LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2
LIST set3 = tag1 tag2 ;   # matches a word with tag1 or tag2
LIST word = "hello" ;

SELECT:1a (tag1) (1 word) ;
SELECT:1b  set1  (1 word) ;   # equivalent to 1a

SELECT:2a (tag1 tag2) (1 word) ;
SELECT:2b  set2       (1 word) ;   # equivalent to 2a

SELECT:3a tag1 (1 word) ;
SELECT:3b tag2 (1 word) ;
SELECT:3c set3 (1 word) ;   # equivalent to 3a and 3b combined

SELECT:1c  set1  (1 ("hello")) ; # equivalent to 1a (or 1b)

Languages using CG in Apertium

and many others. The following languages currently (2014-06-27) have CG's of over 100 rules:

When is CG needed?

Currently some of the CG rules written in the above language pairs may be written as forbid rules in the TSX format used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.

Editor support

  • CG-3 IDE – the official vislcg3 CG IDE
  • Gedit syntax highlighting (also for any other editor that uses gtksourceview)
  • Emacs emacs mode for editing and testing CG grammars (highlighting + IDE-like features)

See also

External links