Apertium and Constraint Grammar

From Apertium
Revision as of 14:16, 13 April 2008 by Francis Tyers (talk | contribs) (New page: This page describes the use of '''Constraint Grammar''' (CG) within the '''Apertium''' MT platform. Although Apertium already has a fast, high accuracy statistical disambiguator (POS tagge...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes the use of Constraint Grammar (CG) within the Apertium MT platform. Although Apertium already has a fast, high accuracy statistical disambiguator (POS tagger), the use of CG may be able to improve the results.

Requisite software

  • lttoolbox (>= 3.0.5)
  • Apertium (>= 3.0.0)
  • VISL CG3 (from SVN -- vislcg3_apertium branch)

Install VISL CG3

$ svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3
$ cd vislcg3/branches/vislcg_apertium
$ sh autogen.sh <prefix>
$ make 
$ make install

You should now have three binaries in <prefix>/bin:

  • vislcg3 — is the original disambiguator. It has all the features available and uses the CG input / output format.
  • cg-comp — is a program to compile grammars into a binary format.
  • cg-proc — is a program to run binary grammars on an lttoolbox formatted input stream.

Example usage

Lets take an example from Apertium, we have:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin 
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>/lo<prn><pro><p3><f><sg>$ ^playa/playa<n><f><sg>$

Here we have two ambiguities, the first is between a noun and a verb, the second is between a determiner and a pronoun. The more appropriate sequence would be verb prep det noun. We can write some rules in CG to enforce this.

First we define our categories, these can be tags, wordforms or lemmas. It might help to think of them as "coarse tags", which may involve a set of fine tags or lemmas. So, create a file grammar.txt, and add the following text:

DELIMITERS = "<$.>" ;

LIST NOUN = n;
LIST VERB = vblex;
LIST DET = det;
LIST PRN = prn;
LIST PREP = pr;

SECTION

The next thing we want to do is write the two rules, so:

Rule #1
"When the current lexical unit can be a pronoun or a determiner, and it is followed on the right by a lexical unit which could be a noun, choose the determiner"
# 1
SELECT DET IF
        (0 DET)
        (0 PRN)
        (1 NOUN) ;

Add this rule to the file, and compile using cg-comp

$ ./cg-comp grammar.txt grammar.bin
Sections: 1, Rules: 1, Sets: 6, Tags: 7

Now try testing it in the Apertium pipeline:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin 2>/dev/null
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

As we can see, the determiner reading has been selected over the pronoun reading. Note the 2>/dev/null redirects debugging output.

Rule #2
"When the current lexical unit can be a noun or a verb, if the subsequent two units to the right are preposition and determiner, remove the noun reading."
# 2
REMOVE NOUN IF
        (0 NOUN)
        (0 VERB)
        (1 PREP)
        (2 DET) ;

Add this rule, re-compile the grammar and test:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin 2>/dev/null
^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

Voilà! A fully disambiguated sentence.