Difference between revisions of "Apertium and Constraint Grammar"

From Apertium
Jump to navigation Jump to search
Line 29: Line 29:


Important! Note that you also need to have International Components for Unicode installed. If you install a distro package for ICU, you also need to install the corresponding devel package, and note that you seem to need versions 3.4-36 or later. To install from source, see [http://beta.visl.sdu.dk/cg3/chunked/installation.html#installing_icu here]. (On a Mac, there is no ldconfig command. It seems to work anyway though.)
Important! Note that you also need to have International Components for Unicode installed. If you install a distro package for ICU, you also need to install the corresponding devel package, and note that you seem to need versions 3.4-36 or later. To install from source, see [http://beta.visl.sdu.dk/cg3/chunked/installation.html#installing_icu here]. (On a Mac, there is no ldconfig command. It seems to work anyway though.)

On Ubuntu/Debian based system, you can install the icu devel package with the following command:

sudo apt-get install libicu-dev

You will know that you do not have the package installed if you get this kind of error while doing make:

stdafx.h:120:28: error: unicode/unistr.h: No such file or directory


==Example usage==
==Example usage==

Revision as of 02:14, 16 April 2010

This page describes the use of Constraint Grammar (CG) within the Apertium MT platform. Although Apertium already has a fast, high accuracy statistical disambiguator (POS tagger), the use of CG will probably in many cases be able to be used to improve the results. For example the CG disambiguator could be used as a pre-disambiguator for the Apertium tagger, allowing the imposition of more fine grained constraints than would be otherwise possible.

Requisite software

Installing VISL CG3

$ svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3
$ cd vislcg3
$ sh autogen.sh --prefix=<prefix>
$ make 
$ make install

You should now have three binaries in <prefix>/bin:

  • vislcg3 — is the original disambiguator. It has all the features available and uses the CG input / output format.
  • cg-comp — is a program to compile grammars into a binary format.
  • cg-proc — is a program to run binary grammars on an apertium formatted input stream.

Note: The Apertium support in VISL CG is still under development and thus bugs may be found.

Important! Note that you also need to have International Components for Unicode installed. If you install a distro package for ICU, you also need to install the corresponding devel package, and note that you seem to need versions 3.4-36 or later. To install from source, see here. (On a Mac, there is no ldconfig command. It seems to work anyway though.)

On Ubuntu/Debian based system, you can install the icu devel package with the following command:

sudo apt-get install libicu-dev

You will know that you do not have the package installed if you get this kind of error while doing make:

stdafx.h:120:28: error: unicode/unistr.h: No such file or directory

Example usage

Lets take an example from Apertium, we have:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin 
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>/lo<prn><pro><p3><f><sg>$ ^playa/playa<n><f><sg>$

Here we have two ambiguities, the first is between a noun and a verb, the second is between a determiner and a pronoun. The more appropriate sequence would be verb prep det noun. We can write some rules in CG to enforce this.

First we define our categories, these can be tags, wordforms or lemmas. It might help to think of them as "coarse tags", which may involve a set of fine tags or lemmas. So, create a file grammar.txt, and add the following text:

DELIMITERS = "<$.>" ;

LIST NOUN = n;
LIST VERB = vblex;
LIST DET = det;
LIST PRN = prn;
LIST PREP = pr;

SECTION

Note: The delimiters statement is used to define Window boundaries.

The next thing we want to do is write the two rules, so:

Rule #1
"When the current lexical unit can be a pronoun or a determiner, and it is followed on the right by a lexical unit which could be a noun, choose the determiner"
# 1
SELECT DET IF
        (0 DET)
        (0 PRN)
        (1 NOUN) ;

Add this rule to the file, and compile using cg-comp

$ ./cg-comp grammar.txt grammar.bin
Sections: 1, Rules: 1, Sets: 6, Tags: 7

Now try testing it in the Apertium pipeline:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

As we can see, the determiner reading has been selected over the pronoun reading.

Rule #2
"When the current lexical unit can be a noun or a verb, if the subsequent two units to the right are preposition and determiner, remove the noun reading."
# 2
REMOVE NOUN IF
        (0 NOUN)
        (0 VERB)
        (1 PREP)
        (2 DET) ;

Add this rule, re-compile the grammar and test:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

Voilà! A fully disambiguated sentence. Its worth noting that the SELECT and REMOVE statements can be thought of as similar to the forbid / enforce constraints in the TSX format used by apertium-tagger, only much more flexible.

Performance

To apply the above two-rule grammar to an input text of 10,000 lines (40,000 words), it took approximately 12 seconds (~3,000 words/second). As a comparison, the apertium-tagger processes this in 1.5 seconds (~26,000 words/second). Tested with a larger grammar, for Faroese — of 204 rules, the performance drops to (~2,000 words/second).

Troubleshooting

If you get

/usr/local/bin/cg-proc: invalid option -- 'w'
/usr/local/bin/apertium: line 480:  9764 Avbrutt (SIGABRT)       $APERTIUM_PATH/apertium-re$FORMATADOR > $SALIDA

that means your vislcg3 needs updating.

After you update vislcg3, you're likely to get something like

Error: Grammar revision is 4879, but this loader requires 5465 or later!

You need to recompile your CG grammars each time you've updated vislcg3, eg.

cd apertium-nn-nb
touch *.rlx               # trick make into thinking the grammars need recompiling
make
sudo make install

See also

External links