Difference between revisions of "Constraint Grammar"
Line 53: | Line 53: | ||
* [http://github.com/unhammer/cg-mode cg-mode for emacs, gives basic syntax highlighting and indentation] |
* [http://github.com/unhammer/cg-mode cg-mode for emacs, gives basic syntax highlighting and indentation] |
||
* [http://kevindonnelly.org.uk/2010/05/constraint-grammar-tutorial/ Kevin Donnelly's CG tutorial] |
* [http://kevindonnelly.org.uk/2010/05/constraint-grammar-tutorial/ Kevin Donnelly's CG tutorial] |
||
* [http://www.lrec-conf.org/proceedings/lrec2012/pdf/1075_Paper.pdf Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117] shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. |
* [http://www.lrec-conf.org/proceedings/lrec2012/pdf/1075_Paper.pdf Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117] shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the [[TSX format]] used by apertium-tagger. |
||
[[Category:Constraint Grammar|*]] |
[[Category:Constraint Grammar|*]] |
Revision as of 14:47, 16 June 2012
Constraint Grammar is a tool that can be used to POS-tag ambiguous text. There are free constraint grammars developed outside the Apertium project for: Norwegian (the Oslo-Bergen tagger), Sámi languages (from Giellatekno) and Faroese (also from Giellatekno).
Terminology
- See also: Apertium stream format
- cohort — a surface form of a word, along with its analyses (possible lexical units), an ambiguous lexical unit.
- Apertium equivalent:
^words/word<n><pl>/word<vblex><pres><p3><sg>$
- Apertium equivalent:
- baseform — the lemma of a word.
- reading — a single analysis of a word.
- Apertium equivalent:
^word<n><pl>$
- Apertium equivalent:
- wordform — a surface form of a word.
Note on parenthesis
The use of parentheses to distinguish between tags and lists/sets seems to be the main confusing point for people learning CG. If we have the morphological tags tag1
and tag2
, then we can have rules like this:
LIST set1 = tag1 ; LIST set2 = (tag1 tag2) ; # matches a word with both tag1 and tag2 LIST set3 = tag1 tag2 ; # matches a word with tag1 or tag2 LIST word = "hello" ; SELECT:1a (tag1) (1 word) ; SELECT:1b set1 (1 word) ; # equivalent to 1a SELECT:2a (tag1 tag2) (1 word) ; SELECT:2b set2 (1 word) ; # equivalent to 2a SELECT:3a tag1 (1 word) ; SELECT:3b tag2 (1 word) ; SELECT:3c set3 (1 word) ; # equivalent to 3a and 3b combined SELECT:1c set1 (1 ("hello")) ; # equivalent to 1a (or 1b)
Languages using CG in Apertium
When is CG needed?
Currently some of the CG rules written in the above language pairs may be written as forbid rules in the TSX format used by apertium-tagger. If the rules for your language pair can be written in the .tsx format, you can go for an easier design without a CG module in that language pair.
See also
- Apertium and Constraint Grammar -- installation and use
- Introduksjon til føringsgrammatikk -- a HOWTO, in Norwegian bokmål
External links
- VISL CG-3 Development Information + documentation and downloads
- Basic Tutorial for VISL CG-3
- cg-mode for emacs, gives basic syntax highlighting and indentation
- Kevin Donnelly's CG tutorial
- Hulden M, Francom J (2012) Boosting Statistical Tagger Accuracy with Simple Rule-Based Grammars, Proc. LREC 2012, p. 2114-2117 shows how 20 hours (very little time!) writing disambiguation rules gives substantial improvements. Some of the rules shown may also be implemented in the TSX format used by apertium-tagger.