Difference between revisions of "Constraint Grammar"
(Category:Documentation in English) |
|||
(16 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
+ | [[Contraintes grammaticales|En français]] |
||
+ | |||
{{TOCD}} |
{{TOCD}} |
||
− | '''Constraint Grammar''' is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno) |
+ | '''Constraint Grammar''' is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson). |
==Terminology== |
==Terminology== |
||
Line 10: | Line 12: | ||
::Apertium equivalent: <code>^word<n><pl>$</code> |
::Apertium equivalent: <code>^word<n><pl>$</code> |
||
* ''wordform'' — a [[surface form]] of a word. |
* ''wordform'' — a [[surface form]] of a word. |
||
+ | |||
+ | ==Basic Rule Format== |
||
+ | |||
+ | ===Sets=== |
||
+ | |||
+ | Sets are defined like this: |
||
+ | |||
+ | LIST VERB = vblex vbser ; # matches <vblex> or <vbser> |
||
+ | LIST NSG = (n sg) ; # matches <n><sg> |
||
+ | LIST TO = "to" ; # matches the lemma "to" |
||
+ | LIST CASE = (n nom) (n acc) ; # matches <n><nom> or <n><acc> |
||
+ | |||
+ | ===Context Patterns=== |
||
+ | |||
+ | Context patterns look like this: |
||
+ | |||
+ | ([LOCATION][MODIFIERS] [PATTERN]) |
||
+ | |||
+ | PATTERN can be a lemma, set of tags, or the name of a set. |
||
+ | |||
+ | {| |
||
+ | | Symbol || Meaning |
||
+ | |- |
||
+ | | 0 || The current word |
||
+ | |- |
||
+ | | 1 || The word following the current word |
||
+ | |- |
||
+ | | -1 || The word preceding the current word |
||
+ | |- |
||
+ | | 2 || The word 2 words after the current word |
||
+ | |- |
||
+ | | C || Every reading this position must match the pattern (normally only 1 has to) |
||
+ | |- |
||
+ | | * || In that position or further in that direction |
||
+ | |} |
||
+ | |||
+ | (0 (v)) # the current word must have a verb reading |
||
+ | (1 VERB) # the following word matches the set "VERB" |
||
+ | (-1 "to") # the previous word must be "to" |
||
+ | (2C (n)) # every reading of the word after the next one must be a noun |
||
+ | (1* (pr)) # the current word has a preposition after it |
||
+ | (-2* (pron)) # there is a pronoun at least two words before the current word |
||
+ | |||
+ | |||
+ | ===Rules=== |
||
+ | |||
+ | Rules look like this: |
||
+ | |||
+ | SELECT [FORM] IF [CONTEXT] ; |
||
+ | REMOVE [FORM] IF [CONTEXT] ; |
||
+ | |||
+ | Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns. |
||
+ | |||
+ | SELECT VERB IF (1 (det)) ; # prefer reading from set "VERB" if following word has <det> tag |
||
+ | REMOVE (n) IF (-1 (adv)) ; # disprefer <n> reading if preceding word has <adv> tag |
||
+ | SELECT (p1) IF (-1* (prn p1)) # prefer 1st person reading for verbs |
||
+ | (0 (v)) ; # preceded at any distance by a 1st person pronoun |
||
==Note on parenthesis== |
==Note on parenthesis== |
||
Line 37: | Line 96: | ||
* [[Sámi languages]] |
* [[Sámi languages]] |
||
* [[Irish Gaelic]] |
* [[Irish Gaelic]] |
||
+ | |||
+ | and many others. The following languages currently (2014-06-27) have CG's of over 100 rules: |
||
+ | |||
+ | * 3888 [[apertium-nno]] (based on the Oslo-Bergen tagger) |
||
+ | * 3649 [[apertium-sme]] (from Giellatekno) |
||
+ | * 2275 [[apertium-nob]] (based on the Oslo-Bergen tagger) |
||
+ | * 1552 [[apertium-est]] |
||
+ | * 1524 [[apertium-fin]] (based on Fred Karlsson's) |
||
+ | * 850 [[apertium-dan]] |
||
+ | * 594 [[apertium-gle]] [https://github.com/apertium/apertium-gle-eng/blob/master/apertium-gle-eng.gle-eng.rlx 1207 in gle-eng.rlx] |
||
+ | * 453 [[apertium-fao]] (from Giellatekno) |
||
+ | * 298 [[apertium-spa]] |
||
+ | * 279 [[apertium-bre]] [https://github.com/apertium/apertium-bre/blob/master/apertium-bre.bre.rlx] |
||
+ | * 255 [[apertium-cat]] |
||
+ | * 205 [[apertium-hbs]] [https://github.com/apertium/apertium-hbs/blob/master/apertium-hbs.hbs.rlx], also [https://github.com/apertium/apertium-hbs-mkd/blob/master/apertium-hbs-mkd.hbs-mkd.rlx hbs-mkd.rlx with syntax rules] |
||
+ | * 190 [[apertium-isl]] |
||
+ | * 131 [[apertium-cym]] [https://github.com/apertium/apertium-cym/blob/master/apertium-cym.cym.rlx] |
||
+ | * {{#lst:apertium-tur/stats|rlx_rules}} [[apertium-tur]] |
||
+ | * 127 [[apertium-eng]] |
||
+ | * 118 [[apertium-mkd]] |
||
+ | * {{#lst:apertium-tat/stats|rlx_rules}} [[apertium-tat]] [https://github.com/apertium/apertium-tat/blob/master/apertium-tat.tat.rlx] |
||
+ | * {{#lst:apertium-rus/stats|rlx_rules}} [[apertium-rus]] [https://github.com/apertium/apertium-rus/blob/master/apertium-rus.rus.rlx] |
||
+ | * {{#lst:apertium-kaz/stats|rlx_rules}} [[apertium-kaz]] |
||
+ | |||
+ | ==When is CG needed?== |
||
+ | |||
+ | Currently some of the CG rules written in the above language pairs may be written as forbid rules in the [[TSX format]] used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair. |
||
+ | |||
+ | ==Editor support== |
||
+ | * [http://beta.visl.sdu.dk/cg3ide.html CG-3 IDE] – the official vislcg3 CG IDE |
||
+ | * [https://github.com/goavki/syntxfile_gedit_CG/ Gedit] syntax highlighting (also for any other editor that uses gtksourceview) |
||
+ | * [[Emacs#CG|Emacs]] emacs mode for editing and testing CG grammars (highlighting + IDE-like features) |
||
==See also== |
==See also== |
||
Line 42: | Line 133: | ||
* [[Apertium and Constraint Grammar]] -- installation and use |
* [[Apertium and Constraint Grammar]] -- installation and use |
||
* [[Introduksjon til føringsgrammatikk]] -- a HOWTO, in Norwegian bokmål |
* [[Introduksjon til føringsgrammatikk]] -- a HOWTO, in Norwegian bokmål |
||
+ | * [[Rule-based finite-state disambiguation]] -- GsoC 2012 project by [[User:Krvoje]], a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST |
||
+ | * [[Constraint Grammar/Speed]] – some tips on speeding up your rules |
||
+ | * [[Constraint Grammar/Optimisation]] – ideas on how to optimise the vislcg3 engine |
||
==External links== |
==External links== |
||
Line 49: | Line 143: | ||
* [http://github.com/unhammer/cg-mode cg-mode for emacs, gives basic syntax highlighting and indentation] |
* [http://github.com/unhammer/cg-mode cg-mode for emacs, gives basic syntax highlighting and indentation] |
||
* [http://kevindonnelly.org.uk/2010/05/constraint-grammar-tutorial/ Kevin Donnelly's CG tutorial] |
* [http://kevindonnelly.org.uk/2010/05/constraint-grammar-tutorial/ Kevin Donnelly's CG tutorial] |
||
+ | * [http://www.lrec-conf.org/proceedings/lrec2012/pdf/1075_Paper.pdf Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117] shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the [[TSX format]] used by apertium-tagger. |
||
[[Category:Constraint Grammar|*]] |
[[Category:Constraint Grammar|*]] |
Revision as of 13:30, 10 September 2020
Constraint Grammar is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson).
Terminology
- See also: Apertium stream format
- cohort — a surface form of a word, along with its analyses (possible lexical units), an ambiguous lexical unit.
- Apertium equivalent:
^words/word<n><pl>/word<vblex><pres><p3><sg>$
- Apertium equivalent:
- baseform — the lemma of a word.
- reading — a single analysis of a word.
- Apertium equivalent:
^word<n><pl>$
- Apertium equivalent:
- wordform — a surface form of a word.
Basic Rule Format
Sets
Sets are defined like this:
LIST VERB = vblex vbser ; # matches <vblex> or <vbser> LIST NSG = (n sg) ; # matches <n><sg> LIST TO = "to" ; # matches the lemma "to" LIST CASE = (n nom) (n acc) ; # matches <n><nom> or <n><acc>
Context Patterns
Context patterns look like this:
([LOCATION][MODIFIERS] [PATTERN])
PATTERN can be a lemma, set of tags, or the name of a set.
Symbol | Meaning |
0 | The current word |
1 | The word following the current word |
-1 | The word preceding the current word |
2 | The word 2 words after the current word |
C | Every reading this position must match the pattern (normally only 1 has to) |
* | In that position or further in that direction |
(0 (v)) # the current word must have a verb reading (1 VERB) # the following word matches the set "VERB" (-1 "to") # the previous word must be "to" (2C (n)) # every reading of the word after the next one must be a noun (1* (pr)) # the current word has a preposition after it (-2* (pron)) # there is a pronoun at least two words before the current word
Rules
Rules look like this:
SELECT [FORM] IF [CONTEXT] ; REMOVE [FORM] IF [CONTEXT] ;
Where FORM is a lemma, set of tags, or the name of a set and CONTEXT is a set of patterns.
SELECT VERB IF (1 (det)) ; # prefer reading from set "VERB" if following word has <det> tag REMOVE (n) IF (-1 (adv)) ; # disprefer <n> reading if preceding word has <adv> tag SELECT (p1) IF (-1* (prn p1)) # prefer 1st person reading for verbs (0 (v)) ; # preceded at any distance by a 1st person pronoun
Note on parenthesis
The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags tag1
and tag2
, then we can have rules like this:
LIST set1 = tag1 ; LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2 LIST set3 = tag1 tag2 ; # matches a word with tag1 or tag2 LIST word = "hello" ; SELECT:1a (tag1) (1 word) ; SELECT:1b set1 (1 word) ; # equivalent to 1a SELECT:2a (tag1 tag2) (1 word) ; SELECT:2b set2 (1 word) ; # equivalent to 2a SELECT:3a tag1 (1 word) ; SELECT:3b tag2 (1 word) ; SELECT:3c set3 (1 word) ; # equivalent to 3a and 3b combined SELECT:1c set1 (1 ("hello")) ; # equivalent to 1a (or 1b)
Languages using CG in Apertium
and many others. The following languages currently (2014-06-27) have CG's of over 100 rules:
- 3888 apertium-nno (based on the Oslo-Bergen tagger)
- 3649 apertium-sme (from Giellatekno)
- 2275 apertium-nob (based on the Oslo-Bergen tagger)
- 1552 apertium-est
- 1524 apertium-fin (based on Fred Karlsson's)
- 850 apertium-dan
- 594 apertium-gle 1207 in gle-eng.rlx
- 453 apertium-fao (from Giellatekno)
- 298 apertium-spa
- 279 apertium-bre [1]
- 255 apertium-cat
- 205 apertium-hbs [2], also hbs-mkd.rlx with syntax rules
- 190 apertium-isl
- 131 apertium-cym [3]
- 76 apertium-tur
- 127 apertium-eng
- 118 apertium-mkd
- 123 apertium-tat [4]
- 308 apertium-rus [5]
- 150 apertium-kaz
When is CG needed?
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the TSX format used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.
Editor support
- CG-3 IDE – the official vislcg3 CG IDE
- Gedit syntax highlighting (also for any other editor that uses gtksourceview)
- Emacs emacs mode for editing and testing CG grammars (highlighting + IDE-like features)
See also
- Apertium and Constraint Grammar -- installation and use
- Introduksjon til føringsgrammatikk -- a HOWTO, in Norwegian bokmål
- Rule-based finite-state disambiguation -- GsoC 2012 project by User:Krvoje, a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST
- Constraint Grammar/Speed – some tips on speeding up your rules
- Constraint Grammar/Optimisation – ideas on how to optimise the vislcg3 engine
External links
- VISL CG-3 Development Information + documentation and downloads
- Basic Tutorial for VISL CG-3
- cg-mode for emacs, gives basic syntax highlighting and indentation
- Kevin Donnelly's CG tutorial
- Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117 shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the TSX format used by apertium-tagger.