Difference between revisions of "Beginner's Constraint Grammar HOWTO"

From Apertium
Jump to navigation Jump to search
(wget -> curl)
 
(18 intermediate revisions by 6 users not shown)
Line 1: Line 1:
  +
[[Installation et fonctionnement de Constraint Grammar|En français]]
=General for CG=
 
   
  +
''The installation part for Apertium and language pairs described below refer to Ubuntu distribution. For others Linux distributions or others operating systems, let see the general [[Installation]] page''.
Constraint Grammar (CG) is a methodological paradigm for [http://en.wikipedia.org/wiki/Natural_language_processing Natural language processing] (NLP). Linguist-written, context dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address [http://en.wikipedia.org/wiki/Lemmatisation lemmatisation] (lexeme or base form), [http://en.wikipedia.org/wiki/Inflexion inflexion], [http://en.wikipedia.org/wiki/Derivation_%28linguistics%29 derivation], [http://en.wikipedia.org/wiki/Syntactic_function syntactic function], dependency, [http://en.wikipedia.org/wiki/Valency_%28linguistics%29 valency], [http://en.wikipedia.org/wiki/Case_role case roles], [http://en.wikipedia.org/wiki/Semantic semantic] type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.
 
   
  +
==Download==
The Constraint Grammar concept was launched by [http://en.wikipedia.org/wiki/Fred_Karlsson Fred Karlsson] in 1990 (Karlsson 1990; Karlsson et al., eds, 1995), and CG taggers and parsers have since been written for a large variety of languages, routinely achieving accuracy F-scores for PoS (word class) of over 99%. A number of syntactic CG systems have reported F-scores of around 95% for syntactic function labels. CG systems can be used to create full syntactic trees in other formalisms by adding small, non-terminal based [http://en.wikipedia.org/wiki/Phrase_structure_grammar phrase structure grammars] or [http://en.wikipedia.org/wiki/Dependency_grammar dependency grammars], and a number of corpus/treebank projects have used Constraint Grammar for automatic annotation. CG methodology has also used in a number of language technology applications, such as [http://en.wikipedia.org/wiki/Spell_checker spell checkers] and [http://en.wikipedia.org/wiki/Machine_translation machine translation] systems.
 
   
  +
;Apertium
   
  +
Sourced from [[Install Apertium core using packaging]]
  +
First, remove any Apertium packages you have installed from operating system repositories. They will be out-of-date, sometimes by years.
   
  +
Add the repository,
=List of CG systems sorted by language=
 
   
  +
<pre>
'''Free software'''
 
  +
# Pick one:
   
  +
# Nightly, unstable, new, almost always use this:
Free software
 
  +
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
   
  +
# Release, stable, old:
[http://beta.visl.sdu.dk/cg3.html VISL CG-3] Constraint Grammar compiler/parser
 
  +
curl -sS https://apertium.projectjj.com/apt/install-release.sh | sudo bash
  +
</pre>
   
*[http://en.wikipedia.org/wiki/Northern_Sami_language North] and [http://en.wikipedia.org/wiki/Lule_Sami_language Lule] Sami, [http://en.wikipedia.org/wiki/Faroese_language Faroese], [http://en.wikipedia.org/wiki/Komi_language Komi] and [http://en.wikipedia.org/wiki/Greenlandic_language Greenlandic] from the [http://en.wikipedia.org/wiki/University_of_Troms%C3%B8 University of Tromsø]
 
** Fred Karlsson's original Finnish FinCG is also available from the University of Tromsø as GPL.
 
*http://en.wikipedia.org/wiki/Norwegian_language Norwegian] Nynorsk and Bokmål online,Oslo-Bergen tagger
 
*http://en.wikipedia.org/wiki/Breton_language Breton], Welsh, Irish Gaelic and http://en.wikipedia.org/wiki/Norwegian_language Norwegian] (converted from the above) in Apertium (see CG in Apertium)
 
   
  +
You should see messages.
   
  +
Install dev tools,
   
  +
<pre>
'''Non-free software'''
 
  +
sudo apt-get -f install apertium-all-dev
  +
</pre>
   
  +
====About the Debian repository install====
  +
Check the script installed Apertium repository details,
   
  +
<pre>
*Basque [http://paginaspersonales.deusto.es/abaitua/konzeptu/nlp/MGnag.html Basque]
 
  +
apt-cache policy | grep apertium
  +
</pre>
   
  +
Unfortunately, due to the seamless upgrading of Debian packaging, it is difficult to see which packages the new repository has added, and where. Even Synaptic, the wonder GUI, has no way through. You could try this brute force commandline,
*Catalan [http://mutis.upf.es/cgi-bin/catcg/demo.pl CATCG]
 
   
  +
<pre>
*Danish [http://beta.visl.sdu.dk/constraint_grammar.html/ DanGram]
 
  +
find /var/lib/apt/lists/ |grep projectjj.*Packages | xargs grep -h Package
  +
</pre>
   
  +
Which will, if nothing else, tell you a lot about byways of the Apertium project.
*English [http://www2.lingsoft.fi/cgi-bin/engcg ENGCG], ENGCG-2, [http://beta.visl.sdu.dk/constraint_grammar.html/ VISL-ENGCG]
 
   
*Esperanto http://beta.visl.sdu.dk/constraint_grammar.html/ EspGram]
 
   
  +
;Constraint grammar
*French [http://beta.visl.sdu.dk/constraint_grammar.html/ FrAG]
 
   
  +
To use CG we must have lttoolbox (we have it), apertium (we have it too) and ICU (we have to install it now).
*German [http://beta.visl.sdu.dk/constraint_grammar.html/ GerGram]
 
   
  +
How to install ICU for Ubuntu. Open terminal and copy/paste this code:
*Irish [https://www.cs.tcd.ie/Elaine.UiDhonnchadha/irish.htm online]
 
   
  +
apt-get install libicu-dev
*Italian [http://beta.visl.sdu.dk/visl/it/parsing/automatic/parse.php ItaGram]
 
   
  +
Now we can install apertium, lttoolbox and CG.
*Spanish [http://beta.visl.sdu.dk/constraint_grammar.html/ HISPAL]
 
   
  +
==Install==
*Swedish [http://www2.lingsoft.fi/doc/swecg/intro/ SWECG]
 
   
  +
;Apertium
*Swahili
 
   
  +
Before installing apertium we have to install lttoolbox(which has been downloaded with apertium at same time).To do that you have to copy/paste this code:
*Portuguese [http://beta.visl.sdu.dk/constraint_grammar.html/ PALAVRAS]
 
   
  +
'''cd apertium'''
   
  +
'''cd lttoolbox/'''
   
  +
'''PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh'''
=Method of annotation=
 
   
  +
'''make'''
Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way:
 
   
  +
'''sudo make install'''
   
  +
'''sudo ldconfig'''
1. Tokenisation;
 
   
2. Lookup of morphological tags;
 
   
  +
Terminal will ask us for password again '''[sudo] password for user:''' When you write it press '''Enter'''.
* Lexical component;
 
  +
Wait to show you terminal user@ubuntu:~/apertium/lttoolbox$ then copy/paste this code:
   
  +
'''cd ..'''
* Guesser;
 
   
  +
'''cd apertium/'''
3. Resolution of morphological ambiguities;
 
   
  +
'''PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh'''
4. Lookup of syntactic tags;
 
   
  +
'''make'''
5. Resolution of syntactic ambiguities
 
   
  +
'''sudo make install'''
   
  +
'''sudo ldconfig'''
   
  +
This will start installing apertium.You have to wait a few minutes.When shows you
=Tokenisation=
 
   
  +
'''vasil@ubuntu:~/apertium/apertium$ sudo ldconfig'''
The tokeniser identifies punctuation and multiword units, and splits enclitic forms into grammatical words.
 
   
  +
'''vasil@ubuntu:~/apertium/apertium$ '''
   
  +
the process is ready.
=Morphological lookup=
 
   
   
This process begins with a lexical analysis based on a large lexicon including all inflected and central derived word forms. The lexical analyser assigns all possible morphological analyses to each word that is in the lexicon, and the remaining words are assigned an analysis by means of the guesser (a heuristic rule-based module). These rules are mainly governed by word shape, and if none of them apply, then a nominal analysis is given.
 
   
  +
;Constraint grammar
   
=Resolution of morphological ambiguities=
 
   
  +
How to install CG.Open terminal and copy/paste this code:
   
  +
'''$ svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3'''
The rule-based Constraint Grammar parser is used to resolve some of the ambiguities at this stage. The constraints are partial paraphrases of form definitions of syntactic constructs such as the noun phrase. The English grammar for example, contains about 1,200 grammar-based constraints, plus 200 heuristic constraints.
 
   
  +
'''$ cd vislcg3'''
   
  +
'''$ sh autogen.sh --prefix=<prefix>'''
=Syntactic lookup=
 
   
  +
'''$ make'''
All possible syntactic tags are introduced for each word. This could, in some cases, mean that more than ten alternatives are given for one morphological reading.
 
   
  +
'''$ make install'''
   
  +
It will ask you for password '''[sudo] password for user:''' . When you write it press '''Enter.'''
=Resolution of syntactic ambiguities=
 
   
  +
We are ready.
The parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage.
 
   
  +
=Usage=
   
  +
For the examples below, we use the language pair apertium-es-ca, but the principles should be applicable to any language pair. First we have to compile this pair. Go into the directory from where you installed Apertium, then
=Syntactic tags=
 
   
  +
cd apertium/apertium-es-ca
The English version of the Constraint Grammar marks the syntactic functions shown in table.
 
  +
sh autogen.sh
  +
make
   
  +
Let's try that what we installed is working. First copy/paste this code:
   
  +
echo "vino a la playa" | lt-proc es-ca.automorf.bin
'''@+FAUXV''' finite auxiliary verb
 
   
  +
This should give you:
'''@-FAUXV''' nonfinite auxiliary verb
 
   
  +
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>/lo<prn><pro><p3><f><sg>$ ^playa/playa<n><f><sg>$
'''@+FMAINV''' finite main verb
 
   
  +
Here we have ambiguities,one between a noun and a verb and other between a determiner and a pronoun.We can write some rules which can impose to categorize between two ambiguities.First we define our categories, these can be tags, wordforms or lemmas. It might help to think of them as "coarse tags", which may involve a set of fine tags or lemmas. So, create a file grammar.txt, and add the following text:
'''@-FMAINV''' nonfinite main verb
 
   
  +
DELIMITERS = "<$.>" ;
'''@SUBJ''' subject
 
  +
LIST NOUN = n;
  +
LIST VERB = vblex;
  +
LIST DET = det;
  +
LIST PRN = prn;
  +
LIST PREP = pr;
  +
SECTION
   
  +
So first rule is states "When the current lexical unit can be a pronoun or a determiner, and it is followed on the right by a lexical unit which could be a noun, choose the determiner". We have to add this rule to the file, and compile using cg-comp:
'''@F-SUBJ''' formal subject
 
   
  +
rule:
'''@OBJ''' object
 
   
'''@I-OBJ''' indirect object
 
   
  +
# 1
'''@PCOMPL-S''' subject complement
 
  +
SELECT DET IF
  +
(0 DET)
  +
(0 PRN)
  +
(1 NOUN) ;
   
  +
compile with:
'''@PCOMPL-O''' object complement
 
   
  +
$ ./cg-comp grammar.txt grammar.bin
'''@APP''' apposition
 
  +
Sections: 1, Rules: 1, Sets: 6, Tags: 7
   
  +
To try what we have done copy/paste this code:
'''@NPHR''' stray nominal
 
   
  +
$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin
'''@N''' title
 
  +
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$
   
'''@O-ADVL''' object adverbial
 
   
  +
Second rule is states "When the current lexical unit can be a noun or a verb, if the subsequent two units to the right are preposition and determiner, remove the noun reading." Now we have to add this rule:
'''@ADVL''' adverbial
 
   
'''@DN>''' determiner
 
   
  +
rule:
'''@NN>''' premodifying noun
 
   
  +
# 2
'''@AN>''' premodifying adjective
 
  +
REMOVE NOUN IF
  +
(0 NOUN)
  +
(0 VERB)
  +
(1 PREP)
  +
(2 DET) ;
   
  +
re-compile the grammar and test:
'''@QN>''' premodifying quantifier
 
   
  +
$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin
'''@GN>''' premodifying genitive
 
  +
^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$
   
  +
Third rule states "Remove interjection if the preceeding word is a modal verb."
'''@AD-A>''' premodifying ad-adjective
 
   
'''@<AD-A''' postmodifying ad-adjective
 
   
  +
[[Category:Documentation in English]]
'''@<NOM-FMAINV''' postmodifying nonfinite verb
 
 
'''@<NOM''' other postmodifier
 
 
'''@<P-FMAINV''' nonfinite verb as complement of preposition
 
'''@<P''' other complement of preposition
 
 
'''@CC''' coordinator
 
 
'''@CS''' subordinator
 
 
'''@INFMARK''' infinitive marker
 
 
(ENGCG tags )
 
 
 
=Example=
 
 
 
As mentioned above, the syntactic tags are distinguished by the use of the `@' sign. The analysis is dependency based, but only partially. As can be seen in table 3.5, dependency relations are shown by the use of the left and right angle brackets, showing that a word is dependent on another to either the right of the left. In the example below, Karlsson is marked as `@<P' meaning that it is the complement of a preposition to be found previous to Karlsson.
 
 
 
"<*i>"
 
*"i" <*> <NonMod> PRON PERS NOM SG1 SUBJ @SUBJ
 
"<started>"
 
*"start" <SV> <SVO> <P/on>V PAST VFIN @+FMAINV
 
"<work>"
 
*"work" N NOM SG @OBJ
 
"<on>"
 
*"on" PREP @ADVL
 
"<an>"
 
*"an" <Indef> DET CENTRAL ART SG @DN>
 
"<*english>"
 
*"english" <*> <Nominal> A ABS @AN>
 
"<description>"
 
*"description" N NOM SG @<P
 
"<within>"
 
*"within" PREP @<NOM @ADVL
 
"<the>"
 
*"the" <Def> DET CENTRAL ART SG/PL @DN>
 
"<*constraint>"
 
*"constraint" <*> N NOM SG @NN>
 
"<*grammar>"
 
*"grammar" <*> N NOM SG @NN>
 
"<framework>"
 
*"framework" N NOM SG @<P
 
"<proposed>"
 
*"propose" <Vcog> <SVO> <SV> PCP2 @<NOM-FMAINV
 
"<by>"
 
*"by" PREP @ADVL
 
"<*karlsson>"
 
*"karlsson" <*> <Proper> N NOM SG @<P
 
"<$[>"
 
"<1990>"
 
*"1990" <1900> NUM CARD @ADVL
 
"<$;>"
 
"<1994a>"
 
*"1994a" <1994a> NUM CARD @ADVL
 
 
{ENCG output }
 
 
 
 
 
=Publications=
 
 
 
 
 
'''Early general Constraint Grammar publications:'''
 
 
*Karlsson, Fred (1990). "Constraint grammar as a framework for parsing running text". In: Karlgren, Hans (ed.), Proceedings of 13th International Conference on Computational Linguistics, volume 3, pp. 168-173, Helsinki, Finland.
 
*Karlsson et al. (1995), "Constraint Grammar - A Language-Independent System for Parsing Unrestricted Text". Mouton de Gruyter
 
*Tapanainen, Pasi (1996). "The Constraint Grammar Parser CG-2". No 27, Publications of the Department of General Linguistics, University of Helsinki.
 
 
 
'''Some publications concerning VISL Constraint Grammar systems:'''
 
 
 
*Valverde, Pilar & Bick, Eckhard (2010). "A Web Corpus of Spanish Automatically Annotated with Semantic Roles". In: Sánchez, A. & M. Almela. 2010. A Mosaic of Corpus Linguistics. Selected Approaches. Berlin/Frankfurt: Peter Lang. [Oral presentation at: 1st International Conerence on Corpus Linguistics (CILC-09), Murcia May 7-9 2009]
 
*Bick, Eckhard (2009). A Dependency Constraint Grammar for Esperanto. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8,
 
*Bick, Eckhard (2009). Introducing probabilistic information in Constraint Grammar parsing. Proceedings of Corpus Linguistics 2009, Liverpool, UK. Electronically published at ... (forthcoming)
 
*Bick, Eckhard & Valverde, Pilar (2009). Automatic Semantic Role Annotation for Spanish. Proceedings of NODALIDA 2009. NEALT Proceedings Series Vol. 4.
 
*Bick, Eckhard (2007). Automatic Semantic Role Annotation for Portuguese. In: Proceedings of TIL 2007 - 5th Workshop on Information and Human Language Technology / Anais do XXVII Congresso da SBC (Rio de Janeiro, July 5-6, 2007).
 
*Bick, Eckhard (2007), "Functional Aspects in Portuguese NER". In: Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área..
 
*Bick, Eckhard (2007), Dan2eng: Wide-Coverage Danish-English Machine Translation, In: Bente Maegaard (ed.), Proceedings of Machine Translation Summit XI, 10-14. Sept. 2007, Copenhagen, Denmark.
 
*Bick, Eckhard (2007), Tagging and Parsing an Artificial Language: An Annotated Web-Corpus of Esperanto, In: Proceedings of Corpus Linguistics 2007, Birmingham, UK. Electronically published at (http://ucrel.lancs.ac.uk/publications/CL2007/, Nov. 2007)
 
*Bick, Eckhard & Nygaard, Lars (2007). Using Danish as a CG Interlingua. A Wide-Coverage Norwegian-English Machine Translation System. In: Proceedings of the 16th Nordic Conference of Computational Linguistics. Tartu, Estonia. ISBN978-9985-4-0514-7
 
*Bick, Eckhard (2006), Noun Sense Tagging: Semantic Prototype Annotation of a Portuguese Treebank, In: Hajic, Jan & Nivre, Joakim (red.), Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories (December 1-2, 2006, Prague, Czech Republic),
 
*Bick, Eckhard (2006), A Constraint Grammar-Based Parser for Spanish. In: Proceedings of TIL 2006 - 4th Workshop on Information and Human Language Technology (Ribeirão Preto, October 27-28, 2006).
 
*Bick, Eckhard (2006), "Functional Aspects in Portuguese NER", in: Renata Vieira et al. (eds.) Computational Processing of the Portuguese Language (Proceedings of PROPOR 2006, Itatiaia, May 15th-17th, 2006),
 
*Bick, Eckhard (2006), "A Constraint Grammar Based Spellchecker for Danish with a Special Focus on Dyslexics". In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19 (ISSN 1796-279X),
 
*Bick, Eckhard (2005), Turning Constraint Grammar Data into Running Dependency Treebanks, In: Civit, Montserrat & Kübler, Sandra & Martí, Ma. Antònia (red.), Proceedings of TLT 2005 (4th Workshop on Treebanks and Linguistic Theory, Barcelona, December 9th - 10th, 2005),
 
*Bick, Eckhard (2005), Gramática Constritiva na Análise Automática de Sintaxe Portuguesa. In: Berber Sardinha, Tony (ed.), A Língua Portuguesa no Computador [The Portuguese Language on the Computer]. Campinas: Mercado de Letras, São Paulo:
 
*Bick, Eckhard (2004), PaNoLa: Integrating Constraint Grammar and CALL, In: Henrik Holmboe (red.), Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (Yearbook 2003).
 
*Bick, Eckhard (2004), Parsing and evaluating the French Europarl corpus, In: Patrick Paroubek, Isabelle Robba & Anne Vilnat (red.): Méthodes et outils pour lévaluation des analyseurs syntaxiques (Journée ATALA, May 15, 2004).
 
*Bick, Eckhard (2003). "A Constraint Grammar Based Question-Answering System for Portuguese". In: Fernando Moura Pires & Salvador (eds.) Progress in Artificial Intelligence (Proceedings of EPIA'2003, Beja, Dec. 2003)
 
*Bick, Eckhard (2003), A CG & PSG Hybrid Approach to Automatic Corpus Annotation, in Kiril Simow & Petya Osenova: Proceedings of SProLaC2003 (at Corpus Linguistics 2003, Lancaster),
 
*Bick, Eckhard (2001), En Constraint Grammar Parser for Dansk, in Peter Widell & Mette Kunøe (eds.) 8. Møde om Udforskningen af Dansk Sprog, 12.-13. oktober 2000, pp. 40-50, Århus University
 
*Bick, Eckhard (2000), The Parsing System Palavras - Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework, Aarhus: Aarhus University Press (preprint version) -- dr.phil. thesis (cf. the Disputatio for an introduction)
 
*Bick, Eckhard (1998), Tagging Speech Data - Constraint Grammar Analysis of Spoken Portuguese, in: Proceedings of the 17th Scandinavian Conference of Linguistics, (Odense 1998)
 
*Bick, Eckhard (1996), Automatic Parsing of Portuguese. In García, Laura Sánchez (ed.), Anais / II Encontro para o Processamento Computacional de Português Escrito e Falado. Curitiba: CEFET-PR.
 
 
 
'''Other publications concerning Constraint Grammar'''
 
 
 
*Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Constraint Grammar in Dialogue systems. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.13-21. Tartu: Tartu University Library.
 
*Antonsen, Lene & Huhmarniemi, Saara & Trosterud, Trond (2009). Interactive pedagogical programs based on Constraint Grammar. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.10-17. Tartu: Tartu University Library.
 
*Lindström, Liina & Müürisep, Kaili (2009). Parsing Corpus of Estonian Dialects. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp. 22-29. Tartu: Tartu University Library.
 
*Trosterud, Trond (2009). A Constraint Grammar for Faroese. Constraint Grammar Workshop at NODALIDA 2009, Odense. NEALT Proceedings Series, Vol 8, pp.1-7. Tartu: Tartu University Library.
 
*Dhonnchadha, E. Uí (2006). "A Part-of-speech tagger for Irish using Finite-State Morphology and Constraint Grammar Disambiguation". In: Proceedings of LREC'06. Genova, Italy.
 
*Atserias, J. et al. (2006). "FreeLing 1.3: Syntactic and semantic services in an open-source NLP library". In: Proceedings of LREC'06. Genoa, Italy (2006)
 
*Hurskainen, Arvi (2006), Constraint Grammar in Unconventional Use: Handling complex Swahili idioms and proverbs. In: Suominen, Mickael et.al. (ed.) A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday. Special Supplement to SKY Jounal of Linguistics, Vol. 19, pp. 397-406. Turku: The Linguistic Association of Finland
 
*Müürisep, Kaili and Uibo, Heli. "Shallow Parsing of Spoken Estonian Using Constraint Grammar". In: P.J.Henriksen & P.R.Skadhauge, Proceedings of NODALIDA-2005 special session on treebanking. Copenhagen Studies in Language #33/2006.
 
*Müürisep, Kaili et al. (2003). A New Language for Constraint Grammar: Estonian. In: International Conference Recent Advances in Natural Language Processing. Proceedings. Borovets, Bulgaria, 10-12 September 2003,
 
*Hagen, Kristin & Lane, Pia. & Trosterud, Trond (2001). "En grammatikkontrol for bokmål". In: Kjell Ivar Vannebo & Helge Sandøy (eds.): Språkknyt 3-2001.
 
*Hagen, K., Johannessen, J. B., Nøklestad, A.(2000). "A Constraint-Based Tagger for Norwegian". In: Lindberg, C.-E. og Lund, S.N. (red.): 17th Scandinavian Conference of Linguistic, Odense. Odense Working Papers in Language and Communication, No. 19, vol I.
 
*Arppe, Antti (2000). "Developing a grammar checker for Swedish". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
 
*Birn, Jussi (2000). "Detecting grammar errors with Lingsoft's Swedish grammar checker". In: Nordgård, T. (ed.) Nodalida'99 Proceedings. Department of Linguistics, University of Trondheim.
 
*Lager, Torbjörn (1999). "The µ-TBL System: Logic Programming Tools for Transformation-Based Learning". In: Proceedings of CoNLL'99, Bergen.
 
*Padró, L.(1996). "POS Tagging Using Relaxation Labelling". In: Proceedings of COLING '96. Copenhagen, Denmark.
 
*Hurskainen, Arvi (1996). "Disambiguation of morphological analysis in Bantu languages". In: Proceedings of the 16th conference on Computational Linguistics. Copenhagen:ACL. Vol.1,
 
*Chanod, Jean-Pierre & Tapanainen, Pasi, "Tagging French - comparing a statistical and a constraint- based method", adapted from: Statistical and Constraint- based Taggers for French, Technical report MLTT-016, Rank Xerox Research Centre, Grenoble, 1994
 
*Voutilainen, Atro, Juha Heikkilä, and Arto Anttila (1992). "Constraint Grammar of English - A Performance-Oriented Introduction". No. 21, Publications of the Department of General Linguistics, University of Helsinki.
 
 
 
 
[[Category:Documentation]]
 

Latest revision as of 20:55, 2 April 2021

En français

The installation part for Apertium and language pairs described below refer to Ubuntu distribution. For others Linux distributions or others operating systems, let see the general Installation page.

Download[edit]

Apertium

Sourced from Install Apertium core using packaging First, remove any Apertium packages you have installed from operating system repositories. They will be out-of-date, sometimes by years.

Add the repository,

# Pick one:

# Nightly, unstable, new, almost always use this:
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash

# Release, stable, old:
curl -sS https://apertium.projectjj.com/apt/install-release.sh | sudo bash


You should see messages.

Install dev tools,

sudo apt-get -f install apertium-all-dev

About the Debian repository install[edit]

Check the script installed Apertium repository details,

apt-cache policy | grep apertium

Unfortunately, due to the seamless upgrading of Debian packaging, it is difficult to see which packages the new repository has added, and where. Even Synaptic, the wonder GUI, has no way through. You could try this brute force commandline,

find /var/lib/apt/lists/ |grep projectjj.*Packages | xargs grep -h Package

Which will, if nothing else, tell you a lot about byways of the Apertium project.


Constraint grammar

To use CG we must have lttoolbox (we have it), apertium (we have it too) and ICU (we have to install it now).

How to install ICU for Ubuntu. Open terminal and copy/paste this code:

   apt-get install libicu-dev

Now we can install apertium, lttoolbox and CG.

Install[edit]

Apertium

Before installing apertium we have to install lttoolbox(which has been downloaded with apertium at same time).To do that you have to copy/paste this code:

cd apertium

cd lttoolbox/

PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh

make

sudo make install

sudo ldconfig


Terminal will ask us for password again [sudo] password for user: When you write it press Enter. Wait to show you terminal user@ubuntu:~/apertium/lttoolbox$ then copy/paste this code:

cd ..

cd apertium/

PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh

make

sudo make install

sudo ldconfig

This will start installing apertium.You have to wait a few minutes.When shows you

vasil@ubuntu:~/apertium/apertium$ sudo ldconfig

vasil@ubuntu:~/apertium/apertium$

the process is ready.


Constraint grammar


How to install CG.Open terminal and copy/paste this code:

$ svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3

$ cd vislcg3

$ sh autogen.sh --prefix=<prefix>

$ make

$ make install

It will ask you for password [sudo] password for user: . When you write it press Enter.

We are ready.

Usage[edit]

For the examples below, we use the language pair apertium-es-ca, but the principles should be applicable to any language pair. First we have to compile this pair. Go into the directory from where you installed Apertium, then

   cd apertium/apertium-es-ca
   sh autogen.sh
   make

Let's try that what we installed is working. First copy/paste this code:

   echo "vino a la playa" | lt-proc es-ca.automorf.bin

This should give you:

   ^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>/lo<prn><pro><p3><f><sg>$ ^playa/playa<n><f><sg>$

Here we have ambiguities,one between a noun and a verb and other between a determiner and a pronoun.We can write some rules which can impose to categorize between two ambiguities.First we define our categories, these can be tags, wordforms or lemmas. It might help to think of them as "coarse tags", which may involve a set of fine tags or lemmas. So, create a file grammar.txt, and add the following text:

   DELIMITERS = "<$.>" ;
   LIST NOUN = n;
   LIST VERB = vblex;
   LIST DET = det;
   LIST PRN = prn;
   LIST PREP = pr;
   SECTION

So first rule is states "When the current lexical unit can be a pronoun or a determiner, and it is followed on the right by a lexical unit which could be a noun, choose the determiner". We have to add this rule to the file, and compile using cg-comp:

rule:


   # 1
   SELECT DET IF
           (0 DET)
           (0 PRN)
           (1 NOUN) ;

compile with:

   $ ./cg-comp grammar.txt grammar.bin
   Sections: 1, Rules: 1, Sets: 6, Tags: 7

To try what we have done copy/paste this code:

   $ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
   ^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$


Second rule is states "When the current lexical unit can be a noun or a verb, if the subsequent two units to the right are preposition and determiner, remove the noun reading." Now we have to add this rule:


rule:

   # 2
   REMOVE NOUN IF
           (0 NOUN)
           (0 VERB)
           (1 PREP)
           (2 DET) ;

re-compile the grammar and test:

   $ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
   ^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

Third rule states "Remove interjection if the preceeding word is a modal verb."