Difference between revisions of "Constraint Grammar"

From Apertium
Jump to navigation Jump to search
(New page: ==Terminology== * ''cohort'' — a surface form of a word, along with its analyses (possible lexical units). ::Apertium equivalent: <code>^words/word<n><pl>/word<vblex><pres...)
 
(39 intermediate revisions by 7 users not shown)
Line 1: Line 1:
  +
[[Contraintes grammaticales|En français]]
   
  +
{{TOCD}}
  +
'''Constraint Grammar''' is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson).
   
 
==Terminology==
 
==Terminology==
  +
{{see-also|Apertium stream format}}
 
* ''cohort'' &mdash; a [[surface form]] of a word, along with its analyses (possible [[lexical unit]]s).
+
* ''cohort'' &mdash; a [[surface form]] of a word, along with its analyses (possible [[lexical unit]]s), an ''ambiguous'' lexical unit.
 
::Apertium equivalent: <code>^words/word<n><pl>/word<vblex><pres><p3><sg>$</code>
 
::Apertium equivalent: <code>^words/word<n><pl>/word<vblex><pres><p3><sg>$</code>
 
* ''baseform'' &mdash; the [[lemma]] of a word.
 
* ''baseform'' &mdash; the [[lemma]] of a word.
 
* ''reading'' &mdash; a single analysis of a word.
 
* ''reading'' &mdash; a single analysis of a word.
::Apertium equivalent: <code>^word<n><pl>$
+
::Apertium equivalent: <code>^word<n><pl>$</code>
 
* ''wordform'' &mdash; a [[surface form]] of a word.
 
* ''wordform'' &mdash; a [[surface form]] of a word.
  +
  +
==Basic Rule Format==
  +
  +
===Sets===
  +
  +
Sets are defined like this:
  +
  +
LIST VERB = vblex vbser ; # matches <vblex> or <vbser>
  +
LIST NSG = (n sg) ; # matches <n><sg>
  +
LIST TO = "to" ; # matches the lemma "to"
  +
LIST CASE = (n nom) (n acc) ; # matches <n><nom> or <n><acc>
  +
  +
===Context Patterns===
  +
  +
Context patterns look like this:
  +
  +
([LOCATION][MODIFIERS] [PATTERN])
  +
  +
PATTERN can be a lemma, set of tags, or the name of a set.
  +
  +
{|
  +
| Symbol || Meaning
  +
|-
  +
| 0 || The current word
  +
|-
  +
| 1 || The word following the current word
  +
|-
  +
| -1 || The word preceding the current word
  +
|-
  +
| 2 || The word 2 words after the current word
  +
|-
  +
| C || Every reading this position must match the pattern (normally only 1 has to)
  +
|-
  +
| * || In that position or further in that direction
  +
|}
  +
  +
(0 (v)) # the current word must have a verb reading
  +
(1 VERB) # the following word matches the set "VERB"
  +
(-1 "to") # the previous word must be "to"
  +
(2C (n)) # every reading of the word after the next one must be a noun
  +
(1* (pr)) # the current word has a preposition after it
  +
(-2* (pron)) # there is a pronoun at least two words before the current word
  +
  +
  +
===Rules===
  +
  +
Rules look like this:
  +
  +
SELECT [FORM] IF [CONTEXT] ;
  +
REMOVE [FORM] IF [CONTEXT] ;
  +
  +
Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns.
  +
  +
SELECT VERB IF (1 (det)) ; # prefer reading from set "VERB" if following word has <det> tag
  +
REMOVE (n) IF (-1 (adv)) ; # disprefer <n> reading if preceding word has <adv> tag
  +
SELECT (p1) IF (-1* (prn p1)) # prefer 1st person reading for verbs
  +
(0 (v)) ; # preceded at any distance by a 1st person pronoun
  +
  +
==Note on parenthesis==
  +
The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags <code>tag1</code> and <code>tag2</code>, then we can have rules like this:
  +
  +
LIST set1 = tag1 ;
  +
LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2
  +
LIST set3 = tag1 tag2 ; # matches a word with tag1 or tag2
  +
LIST word = "hello" ;
  +
  +
SELECT:1a (tag1) (1 word) ;
  +
SELECT:1b set1 (1 word) ; # equivalent to 1a
  +
  +
SELECT:2a (tag1 tag2) (1 word) ;
  +
SELECT:2b set2 (1 word) ; # equivalent to 2a
  +
  +
SELECT:3a tag1 (1 word) ;
  +
SELECT:3b tag2 (1 word) ;
  +
SELECT:3c set3 (1 word) ; # equivalent to 3a and 3b combined
  +
  +
SELECT:1c set1 (1 ("hello")) ; # equivalent to 1a (or 1b)
  +
  +
==Languages using CG in Apertium==
  +
* [[Breton]]
  +
* [[Welsh]]
  +
* [[Norwegian Nynorsk and Norwegian Bokmål]]
  +
* [[Sámi languages]]
  +
* [[Irish Gaelic]]
  +
  +
and many others. The following languages currently (2014-06-27) have CG's of over 100 rules:
  +
  +
* 3888 [[apertium-nno]] (based on the Oslo-Bergen tagger)
  +
* 3649 [[apertium-sme]] (from Giellatekno)
  +
* 2275 [[apertium-nob]] (based on the Oslo-Bergen tagger)
  +
* 1552 [[apertium-est]]
  +
* 1524 [[apertium-fin]] (based on Fred Karlsson's)
  +
* 850 [[apertium-dan]]
  +
* 594 [[apertium-gle]] [https://github.com/apertium/apertium-gle-eng/blob/master/apertium-gle-eng.gle-eng.rlx 1207 in gle-eng.rlx]
  +
* 453 [[apertium-fao]] (from Giellatekno)
  +
* 298 [[apertium-spa]]
  +
* 279 [[apertium-bre]] [https://github.com/apertium/apertium-bre/blob/master/apertium-bre.bre.rlx]
  +
* 255 [[apertium-cat]]
  +
* 205 [[apertium-hbs]] [https://github.com/apertium/apertium-hbs/blob/master/apertium-hbs.hbs.rlx], also [https://github.com/apertium/apertium-hbs-mkd/blob/master/apertium-hbs-mkd.hbs-mkd.rlx hbs-mkd.rlx with syntax rules]
  +
* 190 [[apertium-isl]]
  +
* 131 [[apertium-cym]] [https://github.com/apertium/apertium-cym/blob/master/apertium-cym.cym.rlx]
  +
* {{#lst:apertium-tur/stats|rlx_rules}} [[apertium-tur]]
  +
* 127 [[apertium-eng]]
  +
* 118 [[apertium-mkd]]
  +
* {{#lst:apertium-tat/stats|rlx_rules}} [[apertium-tat]] [https://github.com/apertium/apertium-tat/blob/master/apertium-tat.tat.rlx]
  +
* {{#lst:apertium-rus/stats|rlx_rules}} [[apertium-rus]] [https://github.com/apertium/apertium-rus/blob/master/apertium-rus.rus.rlx]
  +
* {{#lst:apertium-kaz/stats|rlx_rules}} [[apertium-kaz]]
  +
  +
==When is CG needed?==
  +
  +
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the [[TSX format]] used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.
  +
  +
==Editor support==
  +
* [http://beta.visl.sdu.dk/cg3ide.html CG-3 IDE] – the official vislcg3 CG IDE
  +
* [https://github.com/goavki/syntxfile_gedit_CG/ Gedit] syntax highlighting (also for any other editor that uses gtksourceview)
  +
* [[Emacs#CG|Emacs]] emacs mode for editing and testing CG grammars (highlighting + IDE-like features)
   
 
==See also==
 
==See also==
   
* [[Apertium and Constraint Grammar]]
+
* [[Apertium and Constraint Grammar]] -- installation and use
  +
* [[Introduksjon til føringsgrammatikk]] -- a HOWTO, in Norwegian bokmål
  +
* [[Rule-based finite-state disambiguation]] -- GsoC 2012 project by [[User:Krvoje]], a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST
  +
* [[Constraint Grammar/Speed]] – some tips on speeding up your rules
  +
* [[Constraint Grammar/Optimisation]] – ideas on how to optimise the vislcg3 engine
   
 
==External links==
 
==External links==
   
* [http://beta.visl.sdu.dk/cg3.html VISL CG-3 Development Information]
+
* [http://beta.visl.sdu.dk/cg3.html VISL CG-3 Development Information] + documentation and downloads
  +
* [http://beta.visl.sdu.dk/cg3_howto.pdf Basic Tutorial for VISL CG-3]
  +
* [http://github.com/unhammer/cg-mode cg-mode for emacs, gives basic syntax highlighting and indentation]
  +
* [http://kevindonnelly.org.uk/2010/05/constraint-grammar-tutorial/ Kevin Donnelly's CG tutorial]
  +
* [http://www.lrec-conf.org/proceedings/lrec2012/pdf/1075_Paper.pdf Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117] shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the [[TSX format]] used by apertium-tagger.
   
[[Category:Tools]]
+
[[Category:Constraint Grammar|*]]
  +
[[Category:Documentation in English]]

Revision as of 13:30, 10 September 2020

En français

Constraint Grammar is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson).

Terminology

See also: Apertium stream format
Apertium equivalent: ^words/word<n><pl>/word<vblex><pres><p3><sg>$
  • baseform — the lemma of a word.
  • reading — a single analysis of a word.
Apertium equivalent: ^word<n><pl>$

Basic Rule Format

Sets

Sets are defined like this:

LIST VERB = vblex vbser ;     # matches <vblex> or <vbser>
LIST NSG = (n sg) ;           # matches <n><sg>
LIST TO = "to" ;              # matches the lemma "to"
LIST CASE = (n nom) (n acc) ; # matches <n><nom> or <n><acc>

Context Patterns

Context patterns look like this:

([LOCATION][MODIFIERS] [PATTERN])

PATTERN can be a lemma, set of tags, or the name of a set.

Symbol Meaning
0 The current word
1 The word following the current word
-1 The word preceding the current word
2 The word 2 words after the current word
C Every reading this position must match the pattern (normally only 1 has to)
* In that position or further in that direction
(0 (v))      # the current word must have a verb reading
(1 VERB)     # the following word matches the set "VERB"
(-1 "to")    # the previous word must be "to"
(2C (n))     # every reading of the word after the next one must be a noun
(1* (pr))    # the current word has a preposition after it
(-2* (pron)) # there is a pronoun at least two words before the current word


Rules

Rules look like this:

SELECT [FORM] IF [CONTEXT] ;
REMOVE [FORM] IF [CONTEXT] ;

Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns.

SELECT VERB IF (1 (det)) ;    # prefer reading from set "VERB" if following word has <det> tag
REMOVE (n) IF (-1 (adv)) ;    # disprefer <n> reading if preceding word has <adv> tag
SELECT (p1) IF (-1* (prn p1)) # prefer 1st person reading for verbs
               (0 (v)) ;      # preceded at any distance by a 1st person pronoun

Note on parenthesis

The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags tag1 and tag2, then we can have rules like this:

LIST set1 = tag1 ;
LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2
LIST set3 = tag1 tag2 ;   # matches a word with tag1 or tag2
LIST word = "hello" ;

SELECT:1a (tag1) (1 word) ;
SELECT:1b  set1  (1 word) ;   # equivalent to 1a

SELECT:2a (tag1 tag2) (1 word) ;
SELECT:2b  set2       (1 word) ;   # equivalent to 2a

SELECT:3a tag1 (1 word) ;
SELECT:3b tag2 (1 word) ;
SELECT:3c set3 (1 word) ;   # equivalent to 3a and 3b combined

SELECT:1c  set1  (1 ("hello")) ; # equivalent to 1a (or 1b)

Languages using CG in Apertium

and many others. The following languages currently (2014-06-27) have CG's of over 100 rules:

When is CG needed?

Currently some of the CG rules written in the above language pairs may be written as forbid rules in the TSX format used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.

Editor support

  • CG-3 IDE – the official vislcg3 CG IDE
  • Gedit syntax highlighting (also for any other editor that uses gtksourceview)
  • Emacs emacs mode for editing and testing CG grammars (highlighting + IDE-like features)

See also

External links