Difference between revisions of "Constraint Grammar"
(40 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
[[Contraintes grammaticales|En français]] |
|||
'''Constraint Grammar''' is a tool that can be used to POS-tag ambiguous text. |
|||
{{TOCD}} |
|||
'''Constraint Grammar''' is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson). |
|||
==Terminology== |
==Terminology== |
||
{{see-also|Apertium stream format}} |
{{see-also|Apertium stream format}} |
||
* ''cohort'' — a [[surface form]] of a word, along with its analyses (possible [[lexical unit]]s). |
* ''cohort'' — a [[surface form]] of a word, along with its analyses (possible [[lexical unit]]s), an ''ambiguous'' lexical unit. |
||
::Apertium equivalent: <code>^words/word<n><pl>/word<vblex><pres><p3><sg>$</code> |
::Apertium equivalent: <code>^words/word<n><pl>/word<vblex><pres><p3><sg>$</code> |
||
* ''baseform'' — the [[lemma]] of a word. |
* ''baseform'' — the [[lemma]] of a word. |
||
* ''reading'' — a single analysis of a word. |
* ''reading'' — a single analysis of a word. |
||
::Apertium equivalent: <code>^word<n><pl>$ |
::Apertium equivalent: <code>^word<n><pl>$</code> |
||
* ''wordform'' — a [[surface form]] of a word. |
* ''wordform'' — a [[surface form]] of a word. |
||
==Basic Rule Format== |
|||
===Sets=== |
|||
Sets are defined like this: |
|||
LIST VERB = vblex vbser ; # matches <vblex> or <vbser> |
|||
LIST NSG = (n sg) ; # matches <n><sg> |
|||
LIST TO = "to" ; # matches the lemma "to" |
|||
LIST CASE = (n nom) (n acc) ; # matches <n><nom> or <n><acc> |
|||
===Context Patterns=== |
|||
Context patterns look like this: |
|||
([LOCATION][MODIFIERS] [PATTERN]) |
|||
PATTERN can be a lemma, set of tags, or the name of a set. |
|||
{| |
|||
| Symbol || Meaning |
|||
|- |
|||
| 0 || The current word |
|||
|- |
|||
| 1 || The word following the current word |
|||
|- |
|||
| -1 || The word preceding the current word |
|||
|- |
|||
| 2 || The word 2 words after the current word |
|||
|- |
|||
| C || Every reading this position must match the pattern (normally only 1 has to) |
|||
|- |
|||
| * || In that position or further in that direction |
|||
|} |
|||
(0 (v)) # the current word must have a verb reading |
|||
(1 VERB) # the following word matches the set "VERB" |
|||
(-1 "to") # the previous word must be "to" |
|||
(2C (n)) # every reading of the word after the next one must be a noun |
|||
(1* (pr)) # the current word has a preposition after it |
|||
(-2* (pron)) # there is a pronoun at least two words before the current word |
|||
===Rules=== |
|||
Rules look like this: |
|||
SELECT [FORM] IF [CONTEXT] ; |
|||
REMOVE [FORM] IF [CONTEXT] ; |
|||
Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns. |
|||
SELECT VERB IF (1 (det)) ; # prefer reading from set "VERB" if following word has <det> tag |
|||
REMOVE (n) IF (-1 (adv)) ; # disprefer <n> reading if preceding word has <adv> tag |
|||
SELECT (p1) IF (-1* (prn p1)) # prefer 1st person reading for verbs |
|||
(0 (v)) ; # if any preceding word is a 1st person pronoun |
|||
==Note on parenthesis== |
|||
The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags <code>tag1</code> and <code>tag2</code>, then we can have rules like this: |
|||
LIST set1 = tag1 ; |
|||
LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2 |
|||
LIST set3 = tag1 tag2 ; # matches a word with tag1 or tag2 |
|||
LIST word = "hello" ; |
|||
SELECT:1a (tag1) (1 word) ; |
|||
SELECT:1b set1 (1 word) ; # equivalent to 1a |
|||
SELECT:2a (tag1 tag2) (1 word) ; |
|||
SELECT:2b set2 (1 word) ; # equivalent to 2a |
|||
SELECT:3a tag1 (1 word) ; |
|||
SELECT:3b tag2 (1 word) ; |
|||
SELECT:3c set3 (1 word) ; # equivalent to 3a and 3b combined |
|||
SELECT:1c set1 (1 ("hello")) ; # equivalent to 1a (or 1b) |
|||
==Subreadings== |
|||
If you're dealing with compounds, the specific parts are called subreadings. |
|||
By default the last subreading is treated as the primary one. If you want the first one to be primary, add |
|||
SUBREADINGS = LTR ; |
|||
If you want to select the value of a subreading, you can use <code>SUB:N</code> (where <code>N</code> is the position of the subreading) |
|||
# input ^dog's/dog<n><sg>+'s<gen>/dog<n><sg>+has<vbmod><pres>$ ^eaten/eat<vblex><pp>$ |
|||
SELECT SUB:1 (vbmod) IF (1 (vblex pp)) ; |
|||
# for the first subreading, choose "has" if the next word is a participle |
|||
If you want to refer to a subreading in a pattern you can use <code>1/N</code> |
|||
# input ^I've/I<prn>+have<vbmod>$ ^hit/hit<vblex><pres>/hit<vblex><past>/hit<vblex><pp>$ |
|||
SELECT (vblex pp) IF (-1/1 (vbmod)) ; |
|||
# if the first subreading of the preceding word is <vbmod>, then we're probably a participle |
|||
# rather than a finite verb |
|||
==Languages using CG in Apertium== |
|||
* [[Breton]] |
|||
* [[Welsh]] |
|||
* [[Norwegian Nynorsk and Norwegian Bokmål]] |
|||
* [[Sámi languages]] |
|||
* [[Irish Gaelic]] |
|||
and many others. The following languages currently (2014-06-27) have CG's of over 100 rules: |
|||
* 3888 [[apertium-nno]] (based on the Oslo-Bergen tagger) |
|||
* 3649 [[apertium-sme]] (from Giellatekno) |
|||
* 2275 [[apertium-nob]] (based on the Oslo-Bergen tagger) |
|||
* 1552 [[apertium-est]] |
|||
* 1524 [[apertium-fin]] (based on Fred Karlsson's) |
|||
* 850 [[apertium-dan]] |
|||
* 594 [[apertium-gle]] [https://github.com/apertium/apertium-gle-eng/blob/master/apertium-gle-eng.gle-eng.rlx 1207 in gle-eng.rlx] |
|||
* 453 [[apertium-fao]] (from Giellatekno) |
|||
* 298 [[apertium-spa]] |
|||
* 279 [[apertium-bre]] [https://github.com/apertium/apertium-bre/blob/master/apertium-bre.bre.rlx] |
|||
* 255 [[apertium-cat]] |
|||
* 205 [[apertium-hbs]] [https://github.com/apertium/apertium-hbs/blob/master/apertium-hbs.hbs.rlx], also [https://github.com/apertium/apertium-hbs-mkd/blob/master/apertium-hbs-mkd.hbs-mkd.rlx hbs-mkd.rlx with syntax rules] |
|||
* 190 [[apertium-isl]] |
|||
* 131 [[apertium-cym]] [https://github.com/apertium/apertium-cym/blob/master/apertium-cym.cym.rlx] |
|||
* {{#lst:apertium-tur/stats|rlx_rules}} [[apertium-tur]] |
|||
* 127 [[apertium-eng]] |
|||
* 118 [[apertium-mkd]] |
|||
* {{#lst:apertium-tat/stats|rlx_rules}} [[apertium-tat]] [https://github.com/apertium/apertium-tat/blob/master/apertium-tat.tat.rlx] |
|||
* {{#lst:apertium-rus/stats|rlx_rules}} [[apertium-rus]] [https://github.com/apertium/apertium-rus/blob/master/apertium-rus.rus.rlx] |
|||
* {{#lst:apertium-kaz/stats|rlx_rules}} [[apertium-kaz]] |
|||
==When is CG needed?== |
|||
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the [[TSX format]] used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair. |
|||
==Editor support== |
|||
* [http://beta.visl.sdu.dk/cg3ide.html CG-3 IDE] – the official vislcg3 CG IDE |
|||
* [https://github.com/goavki/syntxfile_gedit_CG/ Gedit] syntax highlighting (also for any other editor that uses gtksourceview) |
|||
* [https://github.com/apertium/tree-sitter-apertium/tree/master/tree-sitter-cg tree-sitter] grammar for Atom |
|||
* [[Emacs#CG|Emacs]] emacs mode for editing and testing CG grammars (highlighting + IDE-like features) |
|||
==See also== |
==See also== |
||
* [[Apertium and Constraint Grammar]] |
* [[Apertium and Constraint Grammar]] -- installation and use |
||
* [[Introduksjon til føringsgrammatikk]] -- a HOWTO, in Norwegian bokmål |
|||
* [[Constructing a TSX file with a Constraint Grammar]] |
|||
* [[Rule-based finite-state disambiguation]] -- GsoC 2012 project by [[User:Krvoje]], a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST |
|||
* [[Constraint Grammar/Speed]] – some tips on speeding up your rules |
|||
* [[Constraint Grammar/Optimisation]] – ideas on how to optimise the vislcg3 engine |
|||
==External links== |
==External links== |
||
* [http://beta.visl.sdu.dk/cg3.html VISL CG-3 Development Information] |
* [http://beta.visl.sdu.dk/cg3.html VISL CG-3 Development Information] + documentation and downloads |
||
* [http://beta.visl.sdu.dk/cg3_howto.pdf Basic Tutorial for VISL CG-3] |
|||
* [http://github.com/unhammer/cg-mode cg-mode for emacs, gives basic syntax highlighting and indentation] |
|||
* [http://kevindonnelly.org.uk/2010/05/constraint-grammar-tutorial/ Kevin Donnelly's CG tutorial] |
|||
* [http://www.lrec-conf.org/proceedings/lrec2012/pdf/1075_Paper.pdf Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117] shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the [[TSX format]] used by apertium-tagger. |
|||
* [https://gist.github.com/unhammer/f793d118e11dbc629a55d4e85d198a3d visualise CG dependency tree] with graphviz dot |
|||
[[Category: |
[[Category:Constraint Grammar|*]] |
||
[[Category:Documentation in English]] |
Latest revision as of 12:06, 11 September 2024
Constraint Grammar is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson).
Terminology[edit]
- See also: Apertium stream format
- cohort — a surface form of a word, along with its analyses (possible lexical units), an ambiguous lexical unit.
- Apertium equivalent:
^words/word<n><pl>/word<vblex><pres><p3><sg>$
- Apertium equivalent:
- baseform — the lemma of a word.
- reading — a single analysis of a word.
- Apertium equivalent:
^word<n><pl>$
- Apertium equivalent:
- wordform — a surface form of a word.
Basic Rule Format[edit]
Sets[edit]
Sets are defined like this:
LIST VERB = vblex vbser ; # matches <vblex> or <vbser> LIST NSG = (n sg) ; # matches <n><sg> LIST TO = "to" ; # matches the lemma "to" LIST CASE = (n nom) (n acc) ; # matches <n><nom> or <n><acc>
Context Patterns[edit]
Context patterns look like this:
([LOCATION][MODIFIERS] [PATTERN])
PATTERN can be a lemma, set of tags, or the name of a set.
Symbol | Meaning |
0 | The current word |
1 | The word following the current word |
-1 | The word preceding the current word |
2 | The word 2 words after the current word |
C | Every reading this position must match the pattern (normally only 1 has to) |
* | In that position or further in that direction |
(0 (v)) # the current word must have a verb reading (1 VERB) # the following word matches the set "VERB" (-1 "to") # the previous word must be "to" (2C (n)) # every reading of the word after the next one must be a noun (1* (pr)) # the current word has a preposition after it (-2* (pron)) # there is a pronoun at least two words before the current word
Rules[edit]
Rules look like this:
SELECT [FORM] IF [CONTEXT] ; REMOVE [FORM] IF [CONTEXT] ;
Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns.
SELECT VERB IF (1 (det)) ; # prefer reading from set "VERB" if following word has <det> tag REMOVE (n) IF (-1 (adv)) ; # disprefer <n> reading if preceding word has <adv> tag SELECT (p1) IF (-1* (prn p1)) # prefer 1st person reading for verbs (0 (v)) ; # if any preceding word is a 1st person pronoun
Note on parenthesis[edit]
The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags tag1
and tag2
, then we can have rules like this:
LIST set1 = tag1 ; LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2 LIST set3 = tag1 tag2 ; # matches a word with tag1 or tag2 LIST word = "hello" ; SELECT:1a (tag1) (1 word) ; SELECT:1b set1 (1 word) ; # equivalent to 1a SELECT:2a (tag1 tag2) (1 word) ; SELECT:2b set2 (1 word) ; # equivalent to 2a SELECT:3a tag1 (1 word) ; SELECT:3b tag2 (1 word) ; SELECT:3c set3 (1 word) ; # equivalent to 3a and 3b combined SELECT:1c set1 (1 ("hello")) ; # equivalent to 1a (or 1b)
Subreadings[edit]
If you're dealing with compounds, the specific parts are called subreadings.
By default the last subreading is treated as the primary one. If you want the first one to be primary, add
SUBREADINGS = LTR ;
If you want to select the value of a subreading, you can use SUB:N
(where N
is the position of the subreading)
# input ^dog's/dog<n><sg>+'s<gen>/dog<n><sg>+has<vbmod><pres>$ ^eaten/eat<vblex><pp>$ SELECT SUB:1 (vbmod) IF (1 (vblex pp)) ; # for the first subreading, choose "has" if the next word is a participle
If you want to refer to a subreading in a pattern you can use 1/N
# input ^I've/I<prn>+have<vbmod>$ ^hit/hit<vblex><pres>/hit<vblex><past>/hit<vblex><pp>$ SELECT (vblex pp) IF (-1/1 (vbmod)) ; # if the first subreading of the preceding word is <vbmod>, then we're probably a participle # rather than a finite verb
Languages using CG in Apertium[edit]
and many others. The following languages currently (2014-06-27) have CG's of over 100 rules:
- 3888 apertium-nno (based on the Oslo-Bergen tagger)
- 3649 apertium-sme (from Giellatekno)
- 2275 apertium-nob (based on the Oslo-Bergen tagger)
- 1552 apertium-est
- 1524 apertium-fin (based on Fred Karlsson's)
- 850 apertium-dan
- 594 apertium-gle 1207 in gle-eng.rlx
- 453 apertium-fao (from Giellatekno)
- 298 apertium-spa
- 279 apertium-bre [1]
- 255 apertium-cat
- 205 apertium-hbs [2], also hbs-mkd.rlx with syntax rules
- 190 apertium-isl
- 131 apertium-cym [3]
- 76 apertium-tur
- 127 apertium-eng
- 118 apertium-mkd
- 123 apertium-tat [4]
- 308 apertium-rus [5]
- 150 apertium-kaz
When is CG needed?[edit]
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the TSX format used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.
Editor support[edit]
- CG-3 IDE – the official vislcg3 CG IDE
- Gedit syntax highlighting (also for any other editor that uses gtksourceview)
- tree-sitter grammar for Atom
- Emacs emacs mode for editing and testing CG grammars (highlighting + IDE-like features)
See also[edit]
- Apertium and Constraint Grammar -- installation and use
- Introduksjon til føringsgrammatikk -- a HOWTO, in Norwegian bokmål
- Rule-based finite-state disambiguation -- GsoC 2012 project by User:Krvoje, a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST
- Constraint Grammar/Speed – some tips on speeding up your rules
- Constraint Grammar/Optimisation – ideas on how to optimise the vislcg3 engine
External links[edit]
- VISL CG-3 Development Information + documentation and downloads
- Basic Tutorial for VISL CG-3
- cg-mode for emacs, gives basic syntax highlighting and indentation
- Kevin Donnelly's CG tutorial
- Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117 shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the TSX format used by apertium-tagger.
- visualise CG dependency tree with graphviz dot