Beginner's Constraint Grammar HOWTO

From Apertium
Revision as of 17:08, 29 November 2010 by Vaskata (talk | contribs)
Jump to navigation Jump to search

General for CG

Constraint Grammar (CG) is a methodological paradigm for Natural language processing (NLP). Linguist-written, context dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation (lexeme or base form), inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

The Constraint Grammar concept was launched by Fred Karlsson in 1990 (Karlsson 1990; Karlsson et al., eds, 1995), and CG taggers and parsers have since been written for a large variety of languages, routinely achieving accuracy F-scores for PoS (word class) of over 99%. A number of syntactic CG systems have reported F-scores of around 95% for syntactic function labels. CG systems can be used to create full syntactic trees in other formalisms by adding small, non-terminal based phrase structure grammars or dependency grammars, and a number of corpus/treebank projects have used Constraint Grammar for automatic annotation. CG methodology has also used in a number of language technology applications, such as spell checkers and machine translation systems.


List of CG systems sorted by language

Free software

Free software

VISL CG-3 Constraint Grammar compiler/parser


Non-free software


  • Swahili


Method of annotation

Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way:


1. Tokenisation;

2. Lookup of morphological tags;

  • Lexical component;
  • Guesser;

3. Resolution of morphological ambiguities;

4. Lookup of syntactic tags;

5. Resolution of syntactic ambiguities


Tokenisation

The tokeniser identifies punctuation and multiword units, and splits enclitic forms into grammatical words.


Morphological lookup

This process begins with a lexical analysis based on a large lexicon including all inflected and central derived word forms. The lexical analyser assigns all possible morphological analyses to each word that is in the lexicon, and the remaining words are assigned an analysis by means of the guesser (a heuristic rule-based module). These rules are mainly governed by word shape, and if none of them apply, then a nominal analysis is given.


Resolution of morphological ambiguities

The rule-based Constraint Grammar parser is used to resolve some of the ambiguities at this stage. The constraints are partial paraphrases of form definitions of syntactic constructs such as the noun phrase. The English grammar for example, contains about 1,200 grammar-based constraints, plus 200 heuristic constraints.


Syntactic lookup

All possible syntactic tags are introduced for each word. This could, in some cases, mean that more than ten alternatives are given for one morphological reading.


Resolution of syntactic ambiguities

The parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage.


Syntactic tags

The English version of the Constraint Grammar marks the syntactic functions shown in table.


@+FAUXV finite auxiliary verb

@-FAUXV nonfinite auxiliary verb

@+FMAINV finite main verb

@-FMAINV nonfinite main verb

@SUBJ subject

@F-SUBJ formal subject

@OBJ object

@I-OBJ indirect object

@PCOMPL-S subject complement

@PCOMPL-O object complement

@APP apposition

@NPHR stray nominal

@N title

@O-ADVL object adverbial

@ADVL adverbial

@DN> determiner

@NN> premodifying noun

@AN> premodifying adjective

@QN> premodifying quantifier

@GN> premodifying genitive

@AD-A> premodifying ad-adjective

@<AD-A postmodifying ad-adjective

@<NOM-FMAINV postmodifying nonfinite verb

@<NOM other postmodifier

@<P-FMAINV nonfinite verb as complement of preposition @<P other complement of preposition

@CC coordinator

@CS subordinator

@INFMARK infinitive marker

(ENGCG tags )


Example

As mentioned above, the syntactic tags are distinguished by the use of the `@' sign. The analysis is dependency based, but only partially. As can be seen in table 3.5, dependency relations are shown by the use of the left and right angle brackets, showing that a word is dependent on another to either the right of the left. In the example below, Karlsson is marked as `@<P' meaning that it is the complement of a preposition to be found previous to Karlsson.


"<*i>"

  • "i" <*> <NonMod> PRON PERS NOM SG1 SUBJ @SUBJ

"<started>"

  • "start" <SV> <SVO>

    V PAST VFIN @+FMAINV

"<work>"

  • "work" N NOM SG @OBJ

"<on>"

  • "on" PREP @ADVL

"<an>"

  • "an" <Indef> DET CENTRAL ART SG @DN>

"<*english>"

  • "english" <*> <Nominal> A ABS @AN>

"<description>"

  • "description" N NOM SG @<P

"<within>"

  • "within" PREP @<NOM @ADVL

"<the>"

  • "the" <Def> DET CENTRAL ART SG/PL @DN>

"<*constraint>"

  • "constraint" <*> N NOM SG @NN>

"<*grammar>"

  • "grammar" <*> N NOM SG @NN>

"<framework>"

  • "framework" N NOM SG @<P

"<proposed>"

  • "propose" <Vcog> <SVO> <SV> PCP2 @<NOM-FMAINV

"<by>"

  • "by" PREP @ADVL

"<*karlsson>"

  • "karlsson" <*> <Proper> N NOM SG @<P

"<$[>" "<1990>"

  • "1990" <1900> NUM CARD @ADVL

"<$;>" "<1994a>"

  • "1994a" <1994a> NUM CARD @ADVL

{ENCG output }



Publications

Early general Constraint Grammar publications:

  • Karlsson, Fred (1990). "Constraint grammar as a framework for parsing running text". In: Karlgren, Hans (ed.), Proceedings of 13th International Conference on Computational Linguistics, volume 3, pp. 168-173, Helsinki, Finland.
  • Karlsson et al. (1995), "Constraint Grammar - A Language-Independent System for Parsing Unrestricted Text". Mouton de Gruyter
  • Tapanainen, Pasi (1996). "The Constraint Grammar Parser CG-2". No 27, Publications of the Department of General Linguistics, University of Helsinki.


Some publications concerning VISL Constraint Grammar systems:


  • Valverde, Pilar & Bick, Eckhard (2010). "A Web Corpus of Spanish Automatically Annotated with Semantic Roles". In: Sánchez, A. & M. Almela. 2010. A Mosaic of Corpus Linguistics. Selected Approaches. Berlin/Frankfurt: Peter Lang. [Oral presentation at: 1st International Conerence on Corpus Linguistics (CILC-09), Murcia May 7-9 2009]
  • Bick, Eckhard (2009). A Dependency Constraint Grammar for Esperanto. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8,
  • Bick, Eckhard (2009). Introducing probabilistic information in Constraint Grammar parsing. Proceedings of Corpus Linguistics 2009, Liverpool, UK. Electronically published at ... (forthcoming)
  • Bick, Eckhard & Valverde, Pilar (2009). Automatic Semantic Role Annotation for Spanish. Proceedings of NODALIDA 2009. NEALT Proceedings Series Vol. 4.
  • Bick, Eckhard (2007). Automatic Semantic Role Annotation for Portuguese. In: Proceedings of TIL 2007 - 5th Workshop on Information and Human Language Technology / Anais do XXVII Congresso da SBC (Rio de Janeiro, July 5-6, 2007).
  • Bick, Eckhard (2007), "Functional Aspects in Portuguese NER". In: Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área..
  • Bick, Eckhard (2007), Dan2eng: Wide-Coverage Danish-English Machine Translation, In: Bente Maegaard (ed.), Proceedings of Machine Translation Summit XI, 10-14. Sept. 2007, Copenhagen, Denmark.
  • Bick, Eckhard (2007), Tagging and Parsing an Artificial Language: An Annotated Web-Corpus of Esperanto, In: Proceedings of Corpus Linguistics 2007, Birmingham, UK. Electronically published at (http://ucrel.lancs.ac.uk/publications/CL2007/, Nov. 2007)
  • Bick, Eckhard & Nygaard, Lars (2007). Using Danish as a CG Interlingua. A Wide-Coverage Norwegian-English Machine Translation System. In: Proceedings of the 16th Nordic Conference of Computational Linguistics. Tartu, Estonia. ISBN978-9985-4-0514-7
  • Bick, Eckhard (2006), Noun Sense Tagging: Semantic Prototype Annotation of a Portuguese Treebank, In: Hajic, Jan & Nivre, Joakim (red.), Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories (December 1-2, 2006, Prague, Czech Republic),
  • Bick, Eckhard (2006), A Constraint Grammar-Based Parser for Spanish. In: Proceedings of TIL 2006 - 4th Workshop on Information and Human Language Technology (Ribeirão Preto, October 27-28, 2006).
  • Bick, Eckhard (2006), "Functional Aspects in Portuguese NER", in: Renata Vieira et al. (eds.) Computational Processing of the Portuguese Language (Proceedings of PROPOR 2006, Itatiaia, May 15th-17th, 2006),
  • Bick, Eckhard (2006), "A Constraint Grammar Based Spellchecker for Danish with a Special Focus on Dyslexics". In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19 (ISSN 1796-279X),
  • Bick, Eckhard (2005), Turning Constraint Grammar Data into Running Dependency Treebanks, In: Civit, Montserrat & Kübler, Sandra & Martí, Ma. Antònia (red.), Proceedings of TLT 2005 (4th Workshop on Treebanks and Linguistic Theory, Barcelona, December 9th - 10th, 2005),
  • Bick, Eckhard (2005), Gramática Constritiva na Análise Automática de Sintaxe Portuguesa. In: Berber Sardinha, Tony (ed.), A Língua Portuguesa no Computador [The Portuguese Language on the Computer]. Campinas: Mercado de Letras, São Paulo:
  • Bick, Eckhard (2004), PaNoLa: Integrating Constraint Grammar and CALL, In: Henrik Holmboe (red.), Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (Yearbook 2003).
  • Bick, Eckhard (2004), Parsing and evaluating the French Europarl corpus, In: Patrick Paroubek, Isabelle Robba & Anne Vilnat (red.): Méthodes et outils pour lévaluation des analyseurs syntaxiques (Journée ATALA, May 15, 2004).
  • Bick, Eckhard (2003). "A Constraint Grammar Based Question-Answering System for Portuguese". In: Fernando Moura Pires & Salvador (eds.) Progress in Artificial Intelligence (Proceedings of EPIA'2003, Beja, Dec. 2003)
  • Bick, Eckhard (2003), A CG & PSG Hybrid Approach to Automatic Corpus Annotation, in Kiril Simow & Petya Osenova: Proceedings of SProLaC2003 (at Corpus Linguistics 2003, Lancaster),
  • Bick, Eckhard (2001), En Constraint Grammar Parser for Dansk, in Peter Widell & Mette Kunøe (eds.) 8. Møde om Udforskningen af Dansk Sprog, 12.-13. oktober 2000, pp. 40-50, Århus University
  • Bick, Eckhard (2000), The Parsing System Palavras - Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework, Aarhus: Aarhus University Press (preprint version) -- dr.phil. thesis (cf. the Disputatio for an introduction)
  • Bick, Eckhard (1998), Tagging Speech Data - Constraint Grammar Analysis of Spoken Portuguese, in: Proceedings of the 17th Scandinavian Conference of Linguistics, (Odense 1998)
  • Bick, Eckhard (1996), Automatic Parsing of Portuguese. In García, Laura Sánchez (ed.), Anais / II Encontro para o Processamento Computacional de Português Escrito e Falado. Curitiba: CEFET-PR.


Other publications concerning Constraint Grammar


  • Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Constraint Grammar in Dialogue systems. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.13-21. Tartu: Tartu University Library.
  • Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Interactive pedagogical programs based on Constraint Grammar. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.10-17. Tartu: Tartu University Library.
  • Lindström, Liina & Müürisep, Kaili (2009). Parsing Corpus of Estonian Dialects. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp. 22-29. Tartu: Tartu University Library.
  • Trosterud, Trond (2009). A Constraint Grammar for Faroese. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.1-7. Tartu: Tartu University Library.
  • Dhonnchadha, E. Uí (2006). "A Part-of-speech tagger for Irish using Finite-State Morphology and Constraint Grammar Disambiguation". In: Proceedings of LREC'06. Genova, Italy.
  • Atserias, J. et al. (2006). "FreeLing 1.3: Syntactic and semantic services in an open-source NLP library". In: Proceedings of LREC'06. Genoa, Italy (2006)
  • Hurskainen, Arvi (2006), Constraint Grammar in Unconventional Use: Handling complex Swahili idioms and proverbs. In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19, pp. 397-406. Turku: The Linguistic Association of Finland
  • Müürisep, Kaili and Uibo, Heli. "Shallow Parsing of Spoken Estonian Using Constraint Grammar". In: P.J.Henriksen & P.R.Skadhauge, Proceedings of NODALIDA-2005 special session on treebanking. Copenhagen Studies in Language #33/2006.
  • Müürisep, Kaili et al. (2003). A New Language for Constraint Grammar: Estonian. In: International Conference Recent Advances in Natural Language Processing. Proceedings. Borovets, Bulgaria, 10-12 September 2003,
  • Hagen, Kristin & Lane, Pia. & Trosterud, Trond (2001). "En grammatikkontrol for bokmål". In: Kjell Ivar Vannebo & Helge Sandøy (eds.): Språkknyt 3-2001.
  • Hagen, K., Johannessen, J. B., Nøklestad, A.(2000). "A Constraint-Based Tagger for Norwegian". In: Lindberg, C.-E. og Lund, S.N. (red.): 17th Scandinavian Conference of Linguistic, Odense. Odense Working Papers in Language and Communication, No. 19, vol I.
  • Arppe, Antti (2000). "Developing a grammar checker for Swedish". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
  • Birn, Jussi (2000). "Detecting grammar errors with Lingsoft's Swedish grammar checker". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
  • Lager, Torbjörn (1999). "The µ-TBL System: Logic Programming Tools for Transformation-Based Learning". In: Proceedings of CoNLL'99, Bergen.
  • Padró, L.(1996). "POS Tagging Using Relaxation Labelling". In: Proceedings of COLING '96. Copenhagen, Denmark.
  • Hurskainen, Arvi (1996). "Disambiguation of morphological analysis in Bantu languages". In: Proceedings of the 16th conference on Computational Linguistics. Copenhagen:ACL. Vol.1,
  • Chanod, Jean-Pierre & Tapanainen, Pasi, "Tagging French - comparing a statistical and a constraint- based method", adapted from: Statistical and Constraint- based Taggers for French, Technical report MLTT-016, Rank Xerox Research Centre, Grenoble, 1994
  • Voutilainen, Atro, Juha Heikkilä, and Arto Anttila (1992). "Constraint Grammar of English - A Performance-Oriented Introduction". No. 21, Publications of the Department of General Linguistics, University of Helsinki.