Difference between revisions of "Constraint Grammar"
Firespeaker (talk | contribs) |
|||
Line 64: | Line 64: | ||
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the [[TSX format]] used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair. |
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the [[TSX format]] used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair. |
||
==Editor support== |
|||
* [http://beta.visl.sdu.dk/cg3ide.html CG-3 IDE] – the official vislcg3 CG IDE |
|||
* [https://github.com/goavki/syntxfile_gedit_CG/ Gedit] syntax highlighting (also for any other editor that uses gtksourceview) |
|||
⚫ | |||
==See also== |
==See also== |
||
Line 70: | Line 75: | ||
* [[Introduksjon til føringsgrammatikk]] -- a HOWTO, in Norwegian bokmål |
* [[Introduksjon til føringsgrammatikk]] -- a HOWTO, in Norwegian bokmål |
||
* [[Rule-based finite-state disambiguation]] -- GsoC 2012 project by [[User:Krvoje]], a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST |
* [[Rule-based finite-state disambiguation]] -- GsoC 2012 project by [[User:Krvoje]], a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST |
||
⚫ | |||
* [[Constraint Grammar/Speed]] – some tips on speeding up your rules |
* [[Constraint Grammar/Speed]] – some tips on speeding up your rules |
||
* [[Constraint Grammar/Optimisation]] – ideas on how to optimise the vislcg3 engine |
* [[Constraint Grammar/Optimisation]] – ideas on how to optimise the vislcg3 engine |
Revision as of 09:21, 4 December 2016
Constraint Grammar is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno), Faroese (also from Giellatekno), Finnish (by Fred Karlsson).
Terminology
- See also: Apertium stream format
- cohort — a surface form of a word, along with its analyses (possible lexical units), an ambiguous lexical unit.
- Apertium equivalent:
^words/word<n><pl>/word<vblex><pres><p3><sg>$
- Apertium equivalent:
- baseform — the lemma of a word.
- reading — a single analysis of a word.
- Apertium equivalent:
^word<n><pl>$
- Apertium equivalent:
- wordform — a surface form of a word.
Note on parenthesis
The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags tag1
and tag2
, then we can have rules like this:
LIST set1 = tag1 ; LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2 LIST set3 = tag1 tag2 ; # matches a word with tag1 or tag2 LIST word = "hello" ; SELECT:1a (tag1) (1 word) ; SELECT:1b set1 (1 word) ; # equivalent to 1a SELECT:2a (tag1 tag2) (1 word) ; SELECT:2b set2 (1 word) ; # equivalent to 2a SELECT:3a tag1 (1 word) ; SELECT:3b tag2 (1 word) ; SELECT:3c set3 (1 word) ; # equivalent to 3a and 3b combined SELECT:1c set1 (1 ("hello")) ; # equivalent to 1a (or 1b)
Languages using CG in Apertium
and many others. The following languages currently (2014-06-27) have CG's of over 100 rules:
- 3888 apertium-nno (based on the Oslo-Bergen tagger)
- 3649 apertium-sme (from Giellatekno)
- 2275 apertium-nob (based on the Oslo-Bergen tagger)
- 1552 apertium-est
- 1524 apertium-fin (based on Fred Karlsson's)
- 850 apertium-dan
- 594 apertium-gle
- 453 apertium-fao (from Giellatekno)
- 298 apertium-spa
- 279 apertium-bre
- 255 apertium-cat
- 205 apertium-hbs
- 190 apertium-isl
- 131 apertium-cym
- 76 apertium-tur
- 127 apertium-eng
- 118 apertium-mkd
- 150 apertium-kaz
When is CG needed?
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the TSX format used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.
Editor support
- CG-3 IDE – the official vislcg3 CG IDE
- Gedit syntax highlighting (also for any other editor that uses gtksourceview)
- Emacs emacs mode for editing and testing CG grammars (highlighting + IDE-like features)
See also
- Apertium and Constraint Grammar -- installation and use
- Introduksjon til føringsgrammatikk -- a HOWTO, in Norwegian bokmål
- Rule-based finite-state disambiguation -- GsoC 2012 project by User:Krvoje, a "CG light" (or, a more apertiummy CG) with rules in XML compiled to an FST
- Constraint Grammar/Speed – some tips on speeding up your rules
- Constraint Grammar/Optimisation – ideas on how to optimise the vislcg3 engine
External links
- VISL CG-3 Development Information + documentation and downloads
- Basic Tutorial for VISL CG-3
- cg-mode for emacs, gives basic syntax highlighting and indentation
- Kevin Donnelly's CG tutorial
- Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117 shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the TSX format used by apertium-tagger.