Difference between revisions of "Beginner's Constraint Grammar HOWTO"

From Apertium
Jump to navigation Jump to search
(wget -> curl)
 
(17 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[[Installation et fonctionnement de Constraint Grammar|En français]]
=General for CG=


''The installation part for Apertium and language pairs described below refer to Ubuntu distribution. For others Linux distributions or others operating systems, let see the general [[Installation]] page''.
Constraint Grammar (CG) is a methodological paradigm for [http://en.wikipedia.org/wiki/Natural_language_processing Natural language processing] (NLP). Linguist-written, context dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address [http://en.wikipedia.org/wiki/Lemmatisation lemmatisation] (lexeme or base form), [http://en.wikipedia.org/wiki/Inflexion inflexion], [http://en.wikipedia.org/wiki/Derivation_%28linguistics%29 derivation], [http://en.wikipedia.org/wiki/Syntactic_function syntactic function], dependency, [http://en.wikipedia.org/wiki/Valency_%28linguistics%29 valency], [http://en.wikipedia.org/wiki/Case_role case roles], [http://en.wikipedia.org/wiki/Semantic semantic] type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.


==Download==
The Constraint Grammar concept was launched by [http://en.wikipedia.org/wiki/Fred_Karlsson Fred Karlsson] in 1990 (Karlsson 1990; Karlsson et al., eds, 1995), and CG taggers and parsers have since been written for a large variety of languages, routinely achieving accuracy F-scores for PoS (word class) of over 99%. A number of syntactic CG systems have reported F-scores of around 95% for syntactic function labels. CG systems can be used to create full syntactic trees in other formalisms by adding small, non-terminal based [http://en.wikipedia.org/wiki/Phrase_structure_grammar phrase structure grammars] or [http://en.wikipedia.org/wiki/Dependency_grammar dependency grammars], and a number of corpus/treebank projects have used Constraint Grammar for automatic annotation. CG methodology has also used in a number of language technology applications, such as [http://en.wikipedia.org/wiki/Spell_checker spell checkers] and [http://en.wikipedia.org/wiki/Machine_translation machine translation] systems.


;Apertium


Sourced from [[Install Apertium core using packaging]]
First, remove any Apertium packages you have installed from operating system repositories. They will be out-of-date, sometimes by years.


Add the repository,
=VISLCG3=


<pre>
# Pick one:


# Nightly, unstable, new, almost always use this:
'''What is vislcg3'''
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash


# Release, stable, old:
curl -sS https://apertium.projectjj.com/apt/install-release.sh | sudo bash
</pre>


Vislcg3 is the newest parser generation from Odense. As its predecessor, vislcg, it is open source. Vislcg3 is licensed under GPL.


You should see messages.
Starting on March 5th 2008, we have migrated to vislcg3. Rule files for vislcg are still available in older revisions. For vislcg3 documentation we recommend the online [http://beta.visl.sdu.dk/cg3.html documentation].


Install dev tools,


<pre>
'''Preparations before you install vislcg3'''
sudo apt-get -f install apertium-all-dev
</pre>


====About the Debian repository install====
Check the script installed Apertium repository details,


<pre>
The MacOS needs certain libraries to be able to run vislcg3. They can be found by downloading the latest version of ICU [http://site.icu-project.org/download here]. The folder should be saved in the home catalogue and run with the commands:
apt-cache policy | grep apertium
</pre>


Unfortunately, due to the seamless upgrading of Debian packaging, it is difficult to see which packages the new repository has added, and where. Even Synaptic, the wonder GUI, has no way through. You could try this brute force commandline,


<pre>
find /var/lib/apt/lists/ |grep projectjj.*Packages | xargs grep -h Package
</pre>


Which will, if nothing else, tell you a lot about byways of the Apertium project.
cd ~/icu/source


./runConfigureICU MacOSX


;Constraint grammar
gnumake # (or if the machine protests, try with make or gmake)


To use CG we must have lttoolbox (we have it), apertium (we have it too) and ICU (we have to install it now).
gnumake check


How to install ICU for Ubuntu. Open terminal and copy/paste this code:
sudo gnumake install


apt-get install libicu-dev


Now we can install apertium, lttoolbox and CG.


==Install==
After installation, the icu folder may be deleted.


;Apertium


Before installing apertium we have to install lttoolbox(which has been downloaded with apertium at same time).To do that you have to copy/paste this code:


'''cd apertium'''
'''Commands to check out, install and update the vislcg3 program'''


'''cd lttoolbox/'''


'''PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh'''
vislcg3 may be checked out, and later on updated, from Odense via svn, or it may be downloaded from sourceforge. Here, we assume you download it from Odense. Run the following commands:


'''make'''


'''Commands to check out and install vislcg3'''
'''sudo make install'''


'''sudo ldconfig'''


svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3


Terminal will ask us for password again '''[sudo] password for user:''' When you write it press '''Enter'''.
cd vislcg3/
Wait to show you terminal user@ubuntu:~/apertium/lttoolbox$ then copy/paste this code:


'''cd ..'''
./autogen.sh


'''cd apertium/'''
make


'''PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh'''
test/runall.pl


sudo make install
'''make'''


'''sudo make install'''


'''sudo ldconfig'''


This will start installing apertium.You have to wait a few minutes.When shows you
Now vislcg3 is installed in /usr/local/bin/, and is ready to be used.


'''vasil@ubuntu:~/apertium/apertium$ sudo ldconfig'''


'''vasil@ubuntu:~/apertium/apertium$ '''
'''''Note:''' If you are logged in as a non-admin user, you need to switch to an admin user before you run the last command (the sudo command): su [admin-username] Replace [admin-username] with a username with administrative privileges. Then type in the corresponding password, and continue with the final step above.''


the process is ready.




cd vislcg3/trunk/


;Constraint grammar
./compile-mac.sh # (or: ./compile-linux.sh)


test/runall.pl


How to install CG.Open terminal and copy/paste this code:
mv vislcg3 ~/bin/


'''$ svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3'''


'''$ cd vislcg3'''


'''$ sh autogen.sh --prefix=<prefix>'''
Using this method, vislcg3 is installed in your home dir, in ~/bin/.


'''$ make'''
'''''Note:''' If you are using this method, there is no need to do the su + sudo steps outlined in the first case.
''


'''$ make install'''


It will ask you for password '''[sudo] password for user:''' . When you write it press '''Enter.'''


We are ready.
'''Commands to update'''


=Usage=


For the examples below, we use the language pair apertium-es-ca, but the principles should be applicable to any language pair. First we have to compile this pair. Go into the directory from where you installed Apertium, then
If you already have checked out vislcg3, then you can simply do the following:


cd apertium/apertium-es-ca
sh autogen.sh
make


Let's try that what we installed is working. First copy/paste this code:
cd vislcg3/


echo "vino a la playa" | lt-proc es-ca.automorf.bin
svn up


This should give you:
./autogen.sh


^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>/lo<prn><pro><p3><f><sg>$ ^playa/playa<n><f><sg>$
make


Here we have ambiguities,one between a noun and a verb and other between a determiner and a pronoun.We can write some rules which can impose to categorize between two ambiguities.First we define our categories, these can be tags, wordforms or lemmas. It might help to think of them as "coarse tags", which may involve a set of fine tags or lemmas. So, create a file grammar.txt, and add the following text:
test/runall.pl


DELIMITERS = "<$.>" ;
sudo make install
LIST NOUN = n;
LIST VERB = vblex;
LIST DET = det;
LIST PRN = prn;
LIST PREP = pr;
SECTION


So first rule is states "When the current lexical unit can be a pronoun or a determiner, and it is followed on the right by a lexical unit which could be a noun, choose the determiner". We have to add this rule to the file, and compile using cg-comp:


rule:
'''Tips:''' The vislcg3 is downloaded automatically to victorio every night. If you have access to the svn you can check whether you have the latest version (compare vislcg3 --version on your machine and on victorio, and repeat the steps above if your version is older).




# 1
'''Compilation and usage of CG files'''
SELECT DET IF
(0 DET)
(0 PRN)
(1 NOUN) ;


compile with:


$ ./cg-comp grammar.txt grammar.bin
The CG .rle files can be run as text files, or comiled. They will be compiled with the make TARGET=$LANG command d:
Sections: 1, Rules: 1, Sets: 6, Tags: 7


To try what we have done copy/paste this code:
... | vislcg3 -g src/sme-dis.rle | ...


$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin
Vislcg3 can be run with this command:
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$


... | vislcg3 -g src/sme-dis.rle | ...


Second rule is states "When the current lexical unit can be a noun or a verb, if the subsequent two units to the right are preposition and determiner, remove the noun reading." Now we have to add this rule:


'''Flags'''


rule:
The list of flags can be obtained by vislcg3 --help. That command prints something like this (use the newest version rather than this list):


# 2
REMOVE NOUN IF
(0 NOUN)
(0 VERB)
(1 PREP)
(2 DET) ;


re-compile the grammar and test:
-bash-3.00$ vislcg3 -h


$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin
VISL CG-3 Disambiguator version 0.9.2.3279
^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$


Third rule states "Remove interjection if the preceeding word is a modal verb."
Usage: vislcg3 [OPTIONS]


Options:


[[Category:Documentation in English]]
'''-h or -? or --help ''' Displays this list.

'''-V or --version''' Prints version number.

'''-g or --grammar''' Specifies the grammar file to use for disambiguation.

'''-p or --vislcg-compat''' Tells the grammar compiler to be compatible with older VISLCG syntax.

'''--grammar-out''' Writes the compiled grammar back out in textual form to a file.

'''--grammar-bin ''' Writes the compiled grammar back out in binary form to a file.

'''--grammar-info ''' Writes the compiled grammar back out in textual form to a file, with lots of statistics and information.

'''--grammar-only''' Compiles the grammar only.

'''--trace''' Prints debug output alongside with normal output.

'''--prefix''' Sets the prefix for mapping. Defaults to @.

'''--sections''' Number of sections to run. Defaults to running all sections.

'''--single-run''' Only runs each section once.

'''--no-mappings''' Disables running any MAP, ADD, or REPLACE rules.

'''--no-corrections ''' Disables running any SUBSTITUTE or APPEND rules.

'''--no-before-sections''' Disables running rules from BEFORE-SECTIONS.

'''--no-sections''' Disables running rules from any SECTION.

'''--no-after-sections ''' Disables running rules from AFTER-SECTIONS.


'''--num-windows''' Number of windows to keep in before/ahead buffers. Defaults to 2.

'''--always-span''' Forces all scanning tests to always span across window boundaries.

'''--soft-limit''' Number of cohorts after which the SOFT-DELIMITERS kick in. Defaults to 300.

'''--hard-limit''' Number of cohorts after which the window is delimited forcefully. Defaults to 500.

'''--no-magic-readings''' Prevents running rules on magic readings.

'''--dep-allow-loops''' Allows the creation of circular dependencies.


'''-O or --stdout''' A file to print output to instead of stdout.

'''-I or --stdin''' A file to read input from instead of stdin.

'''-E or --stderr''' A file to print errors to instead of stderr.


'''-C or --codepage-all''' The codepage to use for grammar, input, and output streams. Auto-detects default from environment.

'''--codepage-grammar''' Codepage to use for grammar. Overrides --codepage-all.

'''--codepage-input''' Codepage to use for input. Overrides --codepage-all.

'''--codepage-output''' Codepage to use for output and errors. Overrides --codepage-all.


'''-L or --locale-all''' The locale to use for grammar, input, and output streams. Defaults to en_US_POSIX.

'''--locale-grammar''' Locale to use for grammar. Overrides --locale-all.

'''--locale-input''' Locale to use for input. Overrides --locale-all.

'''--locale-output''' Locale to use for output and errors. Overrides --locale-all.



=List of CG systems sorted by language=

'''Free software'''

Free software

[http://beta.visl.sdu.dk/cg3.html VISL CG-3] Constraint Grammar compiler/parser

*[http://en.wikipedia.org/wiki/Northern_Sami_language North] and [http://en.wikipedia.org/wiki/Lule_Sami_language Lule] Sami, [http://en.wikipedia.org/wiki/Faroese_language Faroese], [http://en.wikipedia.org/wiki/Komi_language Komi] and [http://en.wikipedia.org/wiki/Greenlandic_language Greenlandic] from the [http://en.wikipedia.org/wiki/University_of_Troms%C3%B8 University of Tromsø]
** Fred Karlsson's original Finnish FinCG is also available from the University of Tromsø as GPL.
*http://en.wikipedia.org/wiki/Norwegian_language Norwegian] Nynorsk and Bokmål online,Oslo-Bergen tagger
*http://en.wikipedia.org/wiki/Breton_language Breton], Welsh, Irish Gaelic and http://en.wikipedia.org/wiki/Norwegian_language Norwegian] (converted from the above) in Apertium (see CG in Apertium)



'''Non-free software'''


*Basque [http://paginaspersonales.deusto.es/abaitua/konzeptu/nlp/MGnag.html Basque]

*Catalan [http://mutis.upf.es/cgi-bin/catcg/demo.pl CATCG]

*Danish [http://beta.visl.sdu.dk/constraint_grammar.html/ DanGram]

*English [http://www2.lingsoft.fi/cgi-bin/engcg ENGCG], ENGCG-2, [http://beta.visl.sdu.dk/constraint_grammar.html/ VISL-ENGCG]

*Esperanto http://beta.visl.sdu.dk/constraint_grammar.html/ EspGram]

*French [http://beta.visl.sdu.dk/constraint_grammar.html/ FrAG]

*German [http://beta.visl.sdu.dk/constraint_grammar.html/ GerGram]

*Irish [https://www.cs.tcd.ie/Elaine.UiDhonnchadha/irish.htm online]

*Italian [http://beta.visl.sdu.dk/visl/it/parsing/automatic/parse.php ItaGram]

*Spanish [http://beta.visl.sdu.dk/constraint_grammar.html/ HISPAL]

*Swedish [http://www2.lingsoft.fi/doc/swecg/intro/ SWECG]

*Swahili

*Portuguese [http://beta.visl.sdu.dk/constraint_grammar.html/ PALAVRAS]



=Method of annotation=

Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way:


1. Tokenisation;

2. Lookup of morphological tags;

* Lexical component;

* Guesser;

3. Resolution of morphological ambiguities;

4. Lookup of syntactic tags;

5. Resolution of syntactic ambiguities



=Tokenisation=

The tokeniser identifies punctuation and multiword units, and splits enclitic forms into grammatical words.


=Morphological lookup=


This process begins with a lexical analysis based on a large lexicon including all inflected and central derived word forms. The lexical analyser assigns all possible morphological analyses to each word that is in the lexicon, and the remaining words are assigned an analysis by means of the guesser (a heuristic rule-based module). These rules are mainly governed by word shape, and if none of them apply, then a nominal analysis is given.


=Resolution of morphological ambiguities=


The rule-based Constraint Grammar parser is used to resolve some of the ambiguities at this stage. The constraints are partial paraphrases of form definitions of syntactic constructs such as the noun phrase. The English grammar for example, contains about 1,200 grammar-based constraints, plus 200 heuristic constraints.


=Syntactic lookup=

All possible syntactic tags are introduced for each word. This could, in some cases, mean that more than ten alternatives are given for one morphological reading.


=Resolution of syntactic ambiguities=

The parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage.


=Syntactic tags=

The English version of the Constraint Grammar marks the syntactic functions shown in table.


'''@+FAUXV''' finite auxiliary verb

'''@-FAUXV''' nonfinite auxiliary verb

'''@+FMAINV''' finite main verb

'''@-FMAINV''' nonfinite main verb

'''@SUBJ''' subject

'''@F-SUBJ''' formal subject

'''@OBJ''' object

'''@I-OBJ''' indirect object

'''@PCOMPL-S''' subject complement

'''@PCOMPL-O''' object complement

'''@APP''' apposition

'''@NPHR''' stray nominal

'''@N''' title

'''@O-ADVL''' object adverbial

'''@ADVL''' adverbial

'''@DN>''' determiner

'''@NN>''' premodifying noun

'''@AN>''' premodifying adjective

'''@QN>''' premodifying quantifier

'''@GN>''' premodifying genitive

'''@AD-A>''' premodifying ad-adjective

'''@<AD-A''' postmodifying ad-adjective

'''@<NOM-FMAINV''' postmodifying nonfinite verb

'''@<NOM''' other postmodifier

'''@<P-FMAINV''' nonfinite verb as complement of preposition
'''@<P''' other complement of preposition

'''@CC''' coordinator

'''@CS''' subordinator

'''@INFMARK''' infinitive marker

(ENGCG tags )


=Example=


As mentioned above, the syntactic tags are distinguished by the use of the `@' sign. The analysis is dependency based, but only partially. As can be seen in table 3.5, dependency relations are shown by the use of the left and right angle brackets, showing that a word is dependent on another to either the right of the left. In the example below, Karlsson is marked as `@<P' meaning that it is the complement of a preposition to be found previous to Karlsson.


"<*i>"
*"i" <*> <NonMod> PRON PERS NOM SG1 SUBJ @SUBJ
"<started>"
*"start" <SV> <SVO> <P/on>V PAST VFIN @+FMAINV
"<work>"
*"work" N NOM SG @OBJ
"<on>"
*"on" PREP @ADVL
"<an>"
*"an" <Indef> DET CENTRAL ART SG @DN>
"<*english>"
*"english" <*> <Nominal> A ABS @AN>
"<description>"
*"description" N NOM SG @<P
"<within>"
*"within" PREP @<NOM @ADVL
"<the>"
*"the" <Def> DET CENTRAL ART SG/PL @DN>
"<*constraint>"
*"constraint" <*> N NOM SG @NN>
"<*grammar>"
*"grammar" <*> N NOM SG @NN>
"<framework>"
*"framework" N NOM SG @<P
"<proposed>"
*"propose" <Vcog> <SVO> <SV> PCP2 @<NOM-FMAINV
"<by>"
*"by" PREP @ADVL
"<*karlsson>"
*"karlsson" <*> <Proper> N NOM SG @<P
"<$[>"
"<1990>"
*"1990" <1900> NUM CARD @ADVL
"<$;>"
"<1994a>"
*"1994a" <1994a> NUM CARD @ADVL

{ENCG output }




=Publications=




'''Early general Constraint Grammar publications:'''

*Karlsson, Fred (1990). "Constraint grammar as a framework for parsing running text". In: Karlgren, Hans (ed.), Proceedings of 13th International Conference on Computational Linguistics, volume 3, pp. 168-173, Helsinki, Finland.
*Karlsson et al. (1995), "Constraint Grammar - A Language-Independent System for Parsing Unrestricted Text". Mouton de Gruyter
*Tapanainen, Pasi (1996). "The Constraint Grammar Parser CG-2". No 27, Publications of the Department of General Linguistics, University of Helsinki.


'''Some publications concerning VISL Constraint Grammar systems:'''


*Valverde, Pilar & Bick, Eckhard (2010). "A Web Corpus of Spanish Automatically Annotated with Semantic Roles". In: Sánchez, A. & M. Almela. 2010. A Mosaic of Corpus Linguistics. Selected Approaches. Berlin/Frankfurt: Peter Lang. [Oral presentation at: 1st International Conerence on Corpus Linguistics (CILC-09), Murcia May 7-9 2009]
*Bick, Eckhard (2009). A Dependency Constraint Grammar for Esperanto. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8,
*Bick, Eckhard (2009). Introducing probabilistic information in Constraint Grammar parsing. Proceedings of Corpus Linguistics 2009, Liverpool, UK. Electronically published at ... (forthcoming)
*Bick, Eckhard & Valverde, Pilar (2009). Automatic Semantic Role Annotation for Spanish. Proceedings of NODALIDA 2009. NEALT Proceedings Series Vol. 4.
*Bick, Eckhard (2007). Automatic Semantic Role Annotation for Portuguese. In: Proceedings of TIL 2007 - 5th Workshop on Information and Human Language Technology / Anais do XXVII Congresso da SBC (Rio de Janeiro, July 5-6, 2007).
*Bick, Eckhard (2007), "Functional Aspects in Portuguese NER". In: Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área..
*Bick, Eckhard (2007), Dan2eng: Wide-Coverage Danish-English Machine Translation, In: Bente Maegaard (ed.), Proceedings of Machine Translation Summit XI, 10-14. Sept. 2007, Copenhagen, Denmark.
*Bick, Eckhard (2007), Tagging and Parsing an Artificial Language: An Annotated Web-Corpus of Esperanto, In: Proceedings of Corpus Linguistics 2007, Birmingham, UK. Electronically published at (http://ucrel.lancs.ac.uk/publications/CL2007/, Nov. 2007)
*Bick, Eckhard & Nygaard, Lars (2007). Using Danish as a CG Interlingua. A Wide-Coverage Norwegian-English Machine Translation System. In: Proceedings of the 16th Nordic Conference of Computational Linguistics. Tartu, Estonia. ISBN978-9985-4-0514-7
*Bick, Eckhard (2006), Noun Sense Tagging: Semantic Prototype Annotation of a Portuguese Treebank, In: Hajic, Jan & Nivre, Joakim (red.), Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories (December 1-2, 2006, Prague, Czech Republic),
*Bick, Eckhard (2006), A Constraint Grammar-Based Parser for Spanish. In: Proceedings of TIL 2006 - 4th Workshop on Information and Human Language Technology (Ribeirão Preto, October 27-28, 2006).
*Bick, Eckhard (2006), "Functional Aspects in Portuguese NER", in: Renata Vieira et al. (eds.) Computational Processing of the Portuguese Language (Proceedings of PROPOR 2006, Itatiaia, May 15th-17th, 2006),
*Bick, Eckhard (2006), "A Constraint Grammar Based Spellchecker for Danish with a Special Focus on Dyslexics". In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19 (ISSN 1796-279X),
*Bick, Eckhard (2005), Turning Constraint Grammar Data into Running Dependency Treebanks, In: Civit, Montserrat & Kübler, Sandra & Martí, Ma. Antònia (red.), Proceedings of TLT 2005 (4th Workshop on Treebanks and Linguistic Theory, Barcelona, December 9th - 10th, 2005),
*Bick, Eckhard (2005), Gramática Constritiva na Análise Automática de Sintaxe Portuguesa. In: Berber Sardinha, Tony (ed.), A Língua Portuguesa no Computador [The Portuguese Language on the Computer]. Campinas: Mercado de Letras, São Paulo:
*Bick, Eckhard (2004), PaNoLa: Integrating Constraint Grammar and CALL, In: Henrik Holmboe (red.), Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (Yearbook 2003).
*Bick, Eckhard (2004), Parsing and evaluating the French Europarl corpus, In: Patrick Paroubek, Isabelle Robba & Anne Vilnat (red.): Méthodes et outils pour lévaluation des analyseurs syntaxiques (Journée ATALA, May 15, 2004).
*Bick, Eckhard (2003). "A Constraint Grammar Based Question-Answering System for Portuguese". In: Fernando Moura Pires & Salvador (eds.) Progress in Artificial Intelligence (Proceedings of EPIA'2003, Beja, Dec. 2003)
*Bick, Eckhard (2003), A CG & PSG Hybrid Approach to Automatic Corpus Annotation, in Kiril Simow & Petya Osenova: Proceedings of SProLaC2003 (at Corpus Linguistics 2003, Lancaster),
*Bick, Eckhard (2001), En Constraint Grammar Parser for Dansk, in Peter Widell & Mette Kunøe (eds.) 8. Møde om Udforskningen af Dansk Sprog, 12.-13. oktober 2000, pp. 40-50, Århus University
*Bick, Eckhard (2000), The Parsing System Palavras - Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework, Aarhus: Aarhus University Press (preprint version) -- dr.phil. thesis (cf. the Disputatio for an introduction)
*Bick, Eckhard (1998), Tagging Speech Data - Constraint Grammar Analysis of Spoken Portuguese, in: Proceedings of the 17th Scandinavian Conference of Linguistics, (Odense 1998)
*Bick, Eckhard (1996), Automatic Parsing of Portuguese. In García, Laura Sánchez (ed.), Anais / II Encontro para o Processamento Computacional de Português Escrito e Falado. Curitiba: CEFET-PR.


'''Other publications concerning Constraint Grammar'''


*Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Constraint Grammar in Dialogue systems. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.13-21. Tartu: Tartu University Library.
*Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Interactive pedagogical programs based on Constraint Grammar. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.10-17. Tartu: Tartu University Library.
*Lindström, Liina & Müürisep, Kaili (2009). Parsing Corpus of Estonian Dialects. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp. 22-29. Tartu: Tartu University Library.
*Trosterud, Trond (2009). A Constraint Grammar for Faroese. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.1-7. Tartu: Tartu University Library.
*Dhonnchadha, E. Uí (2006). "A Part-of-speech tagger for Irish using Finite-State Morphology and Constraint Grammar Disambiguation". In: Proceedings of LREC'06. Genova, Italy.
*Atserias, J. et al. (2006). "FreeLing 1.3: Syntactic and semantic services in an open-source NLP library". In: Proceedings of LREC'06. Genoa, Italy (2006)
*Hurskainen, Arvi (2006), Constraint Grammar in Unconventional Use: Handling complex Swahili idioms and proverbs. In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19, pp. 397-406. Turku: The Linguistic Association of Finland
*Müürisep, Kaili and Uibo, Heli. "Shallow Parsing of Spoken Estonian Using Constraint Grammar". In: P.J.Henriksen & P.R.Skadhauge, Proceedings of NODALIDA-2005 special session on treebanking. Copenhagen Studies in Language #33/2006.
*Müürisep, Kaili et al. (2003). A New Language for Constraint Grammar: Estonian. In: International Conference Recent Advances in Natural Language Processing. Proceedings. Borovets, Bulgaria, 10-12 September 2003,
*Hagen, Kristin & Lane, Pia. & Trosterud, Trond (2001). "En grammatikkontrol for bokmål". In: Kjell Ivar Vannebo & Helge Sandøy (eds.): Språkknyt 3-2001.
*Hagen, K., Johannessen, J. B., Nøklestad, A.(2000). "A Constraint-Based Tagger for Norwegian". In: Lindberg, C.-E. og Lund, S.N. (red.): 17th Scandinavian Conference of Linguistic, Odense. Odense Working Papers in Language and Communication, No. 19, vol I.
*Arppe, Antti (2000). "Developing a grammar checker for Swedish". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
*Birn, Jussi (2000). "Detecting grammar errors with Lingsoft's Swedish grammar checker". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
*Lager, Torbjörn (1999). "The µ-TBL System: Logic Programming Tools for Transformation-Based Learning". In: Proceedings of CoNLL'99, Bergen.
*Padró, L.(1996). "POS Tagging Using Relaxation Labelling". In: Proceedings of COLING '96. Copenhagen, Denmark.
*Hurskainen, Arvi (1996). "Disambiguation of morphological analysis in Bantu languages". In: Proceedings of the 16th conference on Computational Linguistics. Copenhagen:ACL. Vol.1,
*Chanod, Jean-Pierre & Tapanainen, Pasi, "Tagging French - comparing a statistical and a constraint- based method", adapted from: Statistical and Constraint- based Taggers for French, Technical report MLTT-016, Rank Xerox Research Centre, Grenoble, 1994
*Voutilainen, Atro, Juha Heikkilä, and Arto Anttila (1992). "Constraint Grammar of English - A Performance-Oriented Introduction". No. 21, Publications of the Department of General Linguistics, University of Helsinki.



[[Category:Documentation]]

Latest revision as of 20:55, 2 April 2021

En français

The installation part for Apertium and language pairs described below refer to Ubuntu distribution. For others Linux distributions or others operating systems, let see the general Installation page.

Download[edit]

Apertium

Sourced from Install Apertium core using packaging First, remove any Apertium packages you have installed from operating system repositories. They will be out-of-date, sometimes by years.

Add the repository,

# Pick one:

# Nightly, unstable, new, almost always use this:
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash

# Release, stable, old:
curl -sS https://apertium.projectjj.com/apt/install-release.sh | sudo bash


You should see messages.

Install dev tools,

sudo apt-get -f install apertium-all-dev

About the Debian repository install[edit]

Check the script installed Apertium repository details,

apt-cache policy | grep apertium

Unfortunately, due to the seamless upgrading of Debian packaging, it is difficult to see which packages the new repository has added, and where. Even Synaptic, the wonder GUI, has no way through. You could try this brute force commandline,

find /var/lib/apt/lists/ |grep projectjj.*Packages | xargs grep -h Package

Which will, if nothing else, tell you a lot about byways of the Apertium project.


Constraint grammar

To use CG we must have lttoolbox (we have it), apertium (we have it too) and ICU (we have to install it now).

How to install ICU for Ubuntu. Open terminal and copy/paste this code:

   apt-get install libicu-dev

Now we can install apertium, lttoolbox and CG.

Install[edit]

Apertium

Before installing apertium we have to install lttoolbox(which has been downloaded with apertium at same time).To do that you have to copy/paste this code:

cd apertium

cd lttoolbox/

PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh

make

sudo make install

sudo ldconfig


Terminal will ask us for password again [sudo] password for user: When you write it press Enter. Wait to show you terminal user@ubuntu:~/apertium/lttoolbox$ then copy/paste this code:

cd ..

cd apertium/

PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh

make

sudo make install

sudo ldconfig

This will start installing apertium.You have to wait a few minutes.When shows you

vasil@ubuntu:~/apertium/apertium$ sudo ldconfig

vasil@ubuntu:~/apertium/apertium$

the process is ready.


Constraint grammar


How to install CG.Open terminal and copy/paste this code:

$ svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3

$ cd vislcg3

$ sh autogen.sh --prefix=<prefix>

$ make

$ make install

It will ask you for password [sudo] password for user: . When you write it press Enter.

We are ready.

Usage[edit]

For the examples below, we use the language pair apertium-es-ca, but the principles should be applicable to any language pair. First we have to compile this pair. Go into the directory from where you installed Apertium, then

   cd apertium/apertium-es-ca
   sh autogen.sh
   make

Let's try that what we installed is working. First copy/paste this code:

   echo "vino a la playa" | lt-proc es-ca.automorf.bin

This should give you:

   ^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>/lo<prn><pro><p3><f><sg>$ ^playa/playa<n><f><sg>$

Here we have ambiguities,one between a noun and a verb and other between a determiner and a pronoun.We can write some rules which can impose to categorize between two ambiguities.First we define our categories, these can be tags, wordforms or lemmas. It might help to think of them as "coarse tags", which may involve a set of fine tags or lemmas. So, create a file grammar.txt, and add the following text:

   DELIMITERS = "<$.>" ;
   LIST NOUN = n;
   LIST VERB = vblex;
   LIST DET = det;
   LIST PRN = prn;
   LIST PREP = pr;
   SECTION

So first rule is states "When the current lexical unit can be a pronoun or a determiner, and it is followed on the right by a lexical unit which could be a noun, choose the determiner". We have to add this rule to the file, and compile using cg-comp:

rule:


   # 1
   SELECT DET IF
           (0 DET)
           (0 PRN)
           (1 NOUN) ;

compile with:

   $ ./cg-comp grammar.txt grammar.bin
   Sections: 1, Rules: 1, Sets: 6, Tags: 7

To try what we have done copy/paste this code:

   $ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
   ^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$


Second rule is states "When the current lexical unit can be a noun or a verb, if the subsequent two units to the right are preposition and determiner, remove the noun reading." Now we have to add this rule:


rule:

   # 2
   REMOVE NOUN IF
           (0 NOUN)
           (0 VERB)
           (1 PREP)
           (2 DET) ;

re-compile the grammar and test:

   $ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
   ^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

Third rule states "Remove interjection if the preceeding word is a modal verb."