Beginner's Constraint Grammar HOWTO

From Apertium
Revision as of 13:02, 30 November 2010 by Vaskata (talk | contribs)
Jump to navigation Jump to search

General for CG

Constraint Grammar (CG) is a methodological paradigm for Natural language processing (NLP). Linguist-written, context dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation (lexeme or base form), inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

The Constraint Grammar concept was launched by Fred Karlsson in 1990 (Karlsson 1990; Karlsson et al., eds, 1995), and CG taggers and parsers have since been written for a large variety of languages, routinely achieving accuracy F-scores for PoS (word class) of over 99%. A number of syntactic CG systems have reported F-scores of around 95% for syntactic function labels. CG systems can be used to create full syntactic trees in other formalisms by adding small, non-terminal based phrase structure grammars or dependency grammars, and a number of corpus/treebank projects have used Constraint Grammar for automatic annotation. CG methodology has also used in a number of language technology applications, such as spell checkers and machine translation systems.


VISLCG3

What is vislcg3


Vislcg3 is the newest parser generation from Odense. As its predecessor, vislcg, it is open source. Vislcg3 is licensed under GPL.

Starting on March 5th 2008, we have migrated to vislcg3. Rule files for vislcg are still available in older revisions. For vislcg3 documentation we recommend the online documentation.


Preparations before you install vislcg3


The MacOS needs certain libraries to be able to run vislcg3. They can be found by downloading the latest version of ICU here. The folder should be saved in the home catalogue and run with the commands:


cd ~/icu/source

./runConfigureICU MacOSX

gnumake # (or if the machine protests, try with make or gmake)

gnumake check

sudo gnumake install


After installation, the icu folder may be deleted.


Commands to check out, install and update the vislcg3 program


vislcg3 may be checked out, and later on updated, from Odense via svn, or it may be downloaded from sourceforge. Here, we assume you download it from Odense. Run the following commands:


Commands to check out and install vislcg3


svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3

cd vislcg3/

./autogen.sh

make

test/runall.pl

sudo make install


Now vislcg3 is installed in /usr/local/bin/, and is ready to be used.


Note: If you are logged in as a non-admin user, you need to switch to an admin user before you run the last command (the sudo command): su [admin-username] Replace [admin-username] with a username with administrative privileges. Then type in the corresponding password, and continue with the final step above.


cd vislcg3/trunk/

./compile-mac.sh # (or: ./compile-linux.sh)

test/runall.pl

mv vislcg3 ~/bin/


Using this method, vislcg3 is installed in your home dir, in ~/bin/.

Note: If you are using this method, there is no need to do the su + sudo steps outlined in the first case.


Commands to update


If you already have checked out vislcg3, then you can simply do the following:


cd vislcg3/

svn up

./autogen.sh

make

test/runall.pl

sudo make install


Tips: The vislcg3 is downloaded automatically to victorio every night. If you have access to the svn you can check whether you have the latest version (compare vislcg3 --version on your machine and on victorio, and repeat the steps above if your version is older).


Compilation and usage of CG files


The CG .rle files can be run as text files, or comiled. They will be compiled with the make TARGET=$LANG command d:

... | vislcg3 -g src/sme-dis.rle | ...

Vislcg3 can be run with this command:

... | vislcg3 -g src/sme-dis.rle | ...


Flags

The list of flags can be obtained by vislcg3 --help. That command prints something like this (use the newest version rather than this list):


-bash-3.00$ vislcg3 -h

VISL CG-3 Disambiguator version 0.9.2.3279

Usage: vislcg3 [OPTIONS]

Options:

-h or -? or --help Displays this list.

-V or --version Prints version number.

-g or --grammar Specifies the grammar file to use for disambiguation.

-p or --vislcg-compat Tells the grammar compiler to be compatible with older VISLCG syntax.

--grammar-out Writes the compiled grammar back out in textual form to a file.

--grammar-bin Writes the compiled grammar back out in binary form to a file.

--grammar-info Writes the compiled grammar back out in textual form to a file, with lots of statistics and information.

--grammar-only Compiles the grammar only.

--trace Prints debug output alongside with normal output.

--prefix Sets the prefix for mapping. Defaults to @.

--sections Number of sections to run. Defaults to running all sections.

--single-run Only runs each section once.

--no-mappings Disables running any MAP, ADD, or REPLACE rules.

--no-corrections Disables running any SUBSTITUTE or APPEND rules.

--no-before-sections Disables running rules from BEFORE-SECTIONS.

--no-sections Disables running rules from any SECTION.

--no-after-sections Disables running rules from AFTER-SECTIONS.


--num-windows Number of windows to keep in before/ahead buffers. Defaults to 2.

--always-span Forces all scanning tests to always span across window boundaries.

--soft-limit Number of cohorts after which the SOFT-DELIMITERS kick in. Defaults to 300.

--hard-limit Number of cohorts after which the window is delimited forcefully. Defaults to 500.

--no-magic-readings Prevents running rules on magic readings.

--dep-allow-loops Allows the creation of circular dependencies.


-O or --stdout A file to print output to instead of stdout.

-I or --stdin A file to read input from instead of stdin.

-E or --stderr A file to print errors to instead of stderr.


-C or --codepage-all The codepage to use for grammar, input, and output streams. Auto-detects default from environment.

--codepage-grammar Codepage to use for grammar. Overrides --codepage-all.

--codepage-input Codepage to use for input. Overrides --codepage-all.

--codepage-output Codepage to use for output and errors. Overrides --codepage-all.


-L or --locale-all The locale to use for grammar, input, and output streams. Defaults to en_US_POSIX.

--locale-grammar Locale to use for grammar. Overrides --locale-all.

--locale-input Locale to use for input. Overrides --locale-all.

--locale-output Locale to use for output and errors. Overrides --locale-all.


List of CG systems sorted by language

Free software

Free software

VISL CG-3 Constraint Grammar compiler/parser


Non-free software


  • Swahili


Method of annotation

Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way:


1. Tokenisation;

2. Lookup of morphological tags;

  • Lexical component;
  • Guesser;

3. Resolution of morphological ambiguities;

4. Lookup of syntactic tags;

5. Resolution of syntactic ambiguities


Tokenisation

The tokeniser identifies punctuation and multiword units, and splits enclitic forms into grammatical words.


Morphological lookup

This process begins with a lexical analysis based on a large lexicon including all inflected and central derived word forms. The lexical analyser assigns all possible morphological analyses to each word that is in the lexicon, and the remaining words are assigned an analysis by means of the guesser (a heuristic rule-based module). These rules are mainly governed by word shape, and if none of them apply, then a nominal analysis is given.


Resolution of morphological ambiguities

The rule-based Constraint Grammar parser is used to resolve some of the ambiguities at this stage. The constraints are partial paraphrases of form definitions of syntactic constructs such as the noun phrase. The English grammar for example, contains about 1,200 grammar-based constraints, plus 200 heuristic constraints.


Syntactic lookup

All possible syntactic tags are introduced for each word. This could, in some cases, mean that more than ten alternatives are given for one morphological reading.


Resolution of syntactic ambiguities

The parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage.


Syntactic tags

The English version of the Constraint Grammar marks the syntactic functions shown in table.


@+FAUXV finite auxiliary verb

@-FAUXV nonfinite auxiliary verb

@+FMAINV finite main verb

@-FMAINV nonfinite main verb

@SUBJ subject

@F-SUBJ formal subject

@OBJ object

@I-OBJ indirect object

@PCOMPL-S subject complement

@PCOMPL-O object complement

@APP apposition

@NPHR stray nominal

@N title

@O-ADVL object adverbial

@ADVL adverbial

@DN> determiner

@NN> premodifying noun

@AN> premodifying adjective

@QN> premodifying quantifier

@GN> premodifying genitive

@AD-A> premodifying ad-adjective

@<AD-A postmodifying ad-adjective

@<NOM-FMAINV postmodifying nonfinite verb

@<NOM other postmodifier

@<P-FMAINV nonfinite verb as complement of preposition @<P other complement of preposition

@CC coordinator

@CS subordinator

@INFMARK infinitive marker

(ENGCG tags )


Example

As mentioned above, the syntactic tags are distinguished by the use of the `@' sign. The analysis is dependency based, but only partially. As can be seen in table 3.5, dependency relations are shown by the use of the left and right angle brackets, showing that a word is dependent on another to either the right of the left. In the example below, Karlsson is marked as `@<P' meaning that it is the complement of a preposition to be found previous to Karlsson.


"<*i>"

  • "i" <*> <NonMod> PRON PERS NOM SG1 SUBJ @SUBJ

"<started>"

  • "start" <SV> <SVO>

    V PAST VFIN @+FMAINV

"<work>"

  • "work" N NOM SG @OBJ

"<on>"

  • "on" PREP @ADVL

"<an>"

  • "an" <Indef> DET CENTRAL ART SG @DN>

"<*english>"

  • "english" <*> <Nominal> A ABS @AN>

"<description>"

  • "description" N NOM SG @<P

"<within>"

  • "within" PREP @<NOM @ADVL

"<the>"

  • "the" <Def> DET CENTRAL ART SG/PL @DN>

"<*constraint>"

  • "constraint" <*> N NOM SG @NN>

"<*grammar>"

  • "grammar" <*> N NOM SG @NN>

"<framework>"

  • "framework" N NOM SG @<P

"<proposed>"

  • "propose" <Vcog> <SVO> <SV> PCP2 @<NOM-FMAINV

"<by>"

  • "by" PREP @ADVL

"<*karlsson>"

  • "karlsson" <*> <Proper> N NOM SG @<P

"<$[>" "<1990>"

  • "1990" <1900> NUM CARD @ADVL

"<$;>" "<1994a>"

  • "1994a" <1994a> NUM CARD @ADVL

{ENCG output }



Publications

Early general Constraint Grammar publications:

  • Karlsson, Fred (1990). "Constraint grammar as a framework for parsing running text". In: Karlgren, Hans (ed.), Proceedings of 13th International Conference on Computational Linguistics, volume 3, pp. 168-173, Helsinki, Finland.
  • Karlsson et al. (1995), "Constraint Grammar - A Language-Independent System for Parsing Unrestricted Text". Mouton de Gruyter
  • Tapanainen, Pasi (1996). "The Constraint Grammar Parser CG-2". No 27, Publications of the Department of General Linguistics, University of Helsinki.


Some publications concerning VISL Constraint Grammar systems:


  • Valverde, Pilar & Bick, Eckhard (2010). "A Web Corpus of Spanish Automatically Annotated with Semantic Roles". In: Sánchez, A. & M. Almela. 2010. A Mosaic of Corpus Linguistics. Selected Approaches. Berlin/Frankfurt: Peter Lang. [Oral presentation at: 1st International Conerence on Corpus Linguistics (CILC-09), Murcia May 7-9 2009]
  • Bick, Eckhard (2009). A Dependency Constraint Grammar for Esperanto. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8,
  • Bick, Eckhard (2009). Introducing probabilistic information in Constraint Grammar parsing. Proceedings of Corpus Linguistics 2009, Liverpool, UK. Electronically published at ... (forthcoming)
  • Bick, Eckhard & Valverde, Pilar (2009). Automatic Semantic Role Annotation for Spanish. Proceedings of NODALIDA 2009. NEALT Proceedings Series Vol. 4.
  • Bick, Eckhard (2007). Automatic Semantic Role Annotation for Portuguese. In: Proceedings of TIL 2007 - 5th Workshop on Information and Human Language Technology / Anais do XXVII Congresso da SBC (Rio de Janeiro, July 5-6, 2007).
  • Bick, Eckhard (2007), "Functional Aspects in Portuguese NER". In: Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área..
  • Bick, Eckhard (2007), Dan2eng: Wide-Coverage Danish-English Machine Translation, In: Bente Maegaard (ed.), Proceedings of Machine Translation Summit XI, 10-14. Sept. 2007, Copenhagen, Denmark.
  • Bick, Eckhard (2007), Tagging and Parsing an Artificial Language: An Annotated Web-Corpus of Esperanto, In: Proceedings of Corpus Linguistics 2007, Birmingham, UK. Electronically published at (http://ucrel.lancs.ac.uk/publications/CL2007/, Nov. 2007)
  • Bick, Eckhard & Nygaard, Lars (2007). Using Danish as a CG Interlingua. A Wide-Coverage Norwegian-English Machine Translation System. In: Proceedings of the 16th Nordic Conference of Computational Linguistics. Tartu, Estonia. ISBN978-9985-4-0514-7
  • Bick, Eckhard (2006), Noun Sense Tagging: Semantic Prototype Annotation of a Portuguese Treebank, In: Hajic, Jan & Nivre, Joakim (red.), Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories (December 1-2, 2006, Prague, Czech Republic),
  • Bick, Eckhard (2006), A Constraint Grammar-Based Parser for Spanish. In: Proceedings of TIL 2006 - 4th Workshop on Information and Human Language Technology (Ribeirão Preto, October 27-28, 2006).
  • Bick, Eckhard (2006), "Functional Aspects in Portuguese NER", in: Renata Vieira et al. (eds.) Computational Processing of the Portuguese Language (Proceedings of PROPOR 2006, Itatiaia, May 15th-17th, 2006),
  • Bick, Eckhard (2006), "A Constraint Grammar Based Spellchecker for Danish with a Special Focus on Dyslexics". In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19 (ISSN 1796-279X),
  • Bick, Eckhard (2005), Turning Constraint Grammar Data into Running Dependency Treebanks, In: Civit, Montserrat & Kübler, Sandra & Martí, Ma. Antònia (red.), Proceedings of TLT 2005 (4th Workshop on Treebanks and Linguistic Theory, Barcelona, December 9th - 10th, 2005),
  • Bick, Eckhard (2005), Gramática Constritiva na Análise Automática de Sintaxe Portuguesa. In: Berber Sardinha, Tony (ed.), A Língua Portuguesa no Computador [The Portuguese Language on the Computer]. Campinas: Mercado de Letras, São Paulo:
  • Bick, Eckhard (2004), PaNoLa: Integrating Constraint Grammar and CALL, In: Henrik Holmboe (red.), Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (Yearbook 2003).
  • Bick, Eckhard (2004), Parsing and evaluating the French Europarl corpus, In: Patrick Paroubek, Isabelle Robba & Anne Vilnat (red.): Méthodes et outils pour lévaluation des analyseurs syntaxiques (Journée ATALA, May 15, 2004).
  • Bick, Eckhard (2003). "A Constraint Grammar Based Question-Answering System for Portuguese". In: Fernando Moura Pires & Salvador (eds.) Progress in Artificial Intelligence (Proceedings of EPIA'2003, Beja, Dec. 2003)
  • Bick, Eckhard (2003), A CG & PSG Hybrid Approach to Automatic Corpus Annotation, in Kiril Simow & Petya Osenova: Proceedings of SProLaC2003 (at Corpus Linguistics 2003, Lancaster),
  • Bick, Eckhard (2001), En Constraint Grammar Parser for Dansk, in Peter Widell & Mette Kunøe (eds.) 8. Møde om Udforskningen af Dansk Sprog, 12.-13. oktober 2000, pp. 40-50, Århus University
  • Bick, Eckhard (2000), The Parsing System Palavras - Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework, Aarhus: Aarhus University Press (preprint version) -- dr.phil. thesis (cf. the Disputatio for an introduction)
  • Bick, Eckhard (1998), Tagging Speech Data - Constraint Grammar Analysis of Spoken Portuguese, in: Proceedings of the 17th Scandinavian Conference of Linguistics, (Odense 1998)
  • Bick, Eckhard (1996), Automatic Parsing of Portuguese. In García, Laura Sánchez (ed.), Anais / II Encontro para o Processamento Computacional de Português Escrito e Falado. Curitiba: CEFET-PR.


Other publications concerning Constraint Grammar


  • Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Constraint Grammar in Dialogue systems. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.13-21. Tartu: Tartu University Library.
  • Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Interactive pedagogical programs based on Constraint Grammar. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.10-17. Tartu: Tartu University Library.
  • Lindström, Liina & Müürisep, Kaili (2009). Parsing Corpus of Estonian Dialects. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp. 22-29. Tartu: Tartu University Library.
  • Trosterud, Trond (2009). A Constraint Grammar for Faroese. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.1-7. Tartu: Tartu University Library.
  • Dhonnchadha, E. Uí (2006). "A Part-of-speech tagger for Irish using Finite-State Morphology and Constraint Grammar Disambiguation". In: Proceedings of LREC'06. Genova, Italy.
  • Atserias, J. et al. (2006). "FreeLing 1.3: Syntactic and semantic services in an open-source NLP library". In: Proceedings of LREC'06. Genoa, Italy (2006)
  • Hurskainen, Arvi (2006), Constraint Grammar in Unconventional Use: Handling complex Swahili idioms and proverbs. In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19, pp. 397-406. Turku: The Linguistic Association of Finland
  • Müürisep, Kaili and Uibo, Heli. "Shallow Parsing of Spoken Estonian Using Constraint Grammar". In: P.J.Henriksen & P.R.Skadhauge, Proceedings of NODALIDA-2005 special session on treebanking. Copenhagen Studies in Language #33/2006.
  • Müürisep, Kaili et al. (2003). A New Language for Constraint Grammar: Estonian. In: International Conference Recent Advances in Natural Language Processing. Proceedings. Borovets, Bulgaria, 10-12 September 2003,
  • Hagen, Kristin & Lane, Pia. & Trosterud, Trond (2001). "En grammatikkontrol for bokmål". In: Kjell Ivar Vannebo & Helge Sandøy (eds.): Språkknyt 3-2001.
  • Hagen, K., Johannessen, J. B., Nøklestad, A.(2000). "A Constraint-Based Tagger for Norwegian". In: Lindberg, C.-E. og Lund, S.N. (red.): 17th Scandinavian Conference of Linguistic, Odense. Odense Working Papers in Language and Communication, No. 19, vol I.
  • Arppe, Antti (2000). "Developing a grammar checker for Swedish". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
  • Birn, Jussi (2000). "Detecting grammar errors with Lingsoft's Swedish grammar checker". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
  • Lager, Torbjörn (1999). "The µ-TBL System: Logic Programming Tools for Transformation-Based Learning". In: Proceedings of CoNLL'99, Bergen.
  • Padró, L.(1996). "POS Tagging Using Relaxation Labelling". In: Proceedings of COLING '96. Copenhagen, Denmark.
  • Hurskainen, Arvi (1996). "Disambiguation of morphological analysis in Bantu languages". In: Proceedings of the 16th conference on Computational Linguistics. Copenhagen:ACL. Vol.1,
  • Chanod, Jean-Pierre & Tapanainen, Pasi, "Tagging French - comparing a statistical and a constraint- based method", adapted from: Statistical and Constraint- based Taggers for French, Technical report MLTT-016, Rank Xerox Research Centre, Grenoble, 1994
  • Voutilainen, Atro, Juha Heikkilä, and Arto Anttila (1992). "Constraint Grammar of English - A Performance-Oriented Introduction". No. 21, Publications of the Department of General Linguistics, University of Helsinki.