Difference between revisions of "Apertium and Constraint Grammar"

From Apertium
Jump to navigation Jump to search
(wget -> curl)
 
(56 intermediate revisions by 12 users not shown)
Line 1: Line 1:
[[Apertium et les contraintes grammaticales (vislcg3)|En français]]

{{TOCD}}
{{TOCD}}
This page describes the use of '''Constraint Grammar''' (CG) within the '''Apertium''' MT platform. Although Apertium already has a fast, high accuracy statistical disambiguator (POS tagger), the use of CG may be able to improve the results.
This page describes the use of [[Constraint Grammar]] (CG) within the Apertium MT platform. Constraint Grammar is often used as a pre-disambiguator for the Apertium tagger, allowing the imposition of more fine grained constraints than would be otherwise possible.


==Requisite software==
==Requisite software==
Line 6: Line 8:
* [[lttoolbox]] (>= 3.0.5)
* [[lttoolbox]] (>= 3.0.5)
* [[Apertium]] (>= 3.0.0)
* [[Apertium]] (>= 3.0.0)
* [[ICU]] (>= 3.8)
* VISL CG3 (from SVN -- <code>vislcg3_apertium</code> branch)
* A [[List of language pairs|language pair]] (examples use <code>apertium-es-ca</code>)
* VISL CG3 (from SVN -- see below)


===Installing VISL CG3===
===Installing VISL CG3===
<span style="color: #f00;">See [[Installation]], for most real operating systems you can now get pre-built packages of CG-3 (as well as other core tools) through your regular package manager.</span>


==== Debian / Ubuntu Derivatives ====
<pre>
<pre>
curl -sS http://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
$ svn co --username anonymous --password anonymous http://beta.visl.sdu.dk/svn/visl/tools/vislcg3
sudo apt-get install cg3
$ cd vislcg3/branches/vislcg_apertium
$ sh autogen.sh <prefix>
$ make
$ make install
</pre>
</pre>


====Other Distros====
You should now have three binaries in <code><prefix>/bin</code>:

You will need to install <code>cmake</code>. To install it in Debian/Ubuntu just type <code>apt-get install cmake</code>.

You will need to install <code>libicu-dev</code>. To install it in Debian/Ubuntu just type<code> apt-get install libicu-dev</code>.

You will need to install <code>libboost-dev</code>. To install it in Debian/Ubuntu just type<code> apt-get install libboost-dev</code>.

You may want to install <code>tmalloc</code>. To install it in Debian/Ubuntu just type <code> apt-get install libgoogle-perftools-dev</code>.

<pre>
$ svn co http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3
$ cd vislcg3
$ ./cmake.sh
$ make -j3
# make install
</pre>

(For those installing in a prefix, use <code>./cmake.sh -DCMAKE_INSTALL_PREFIX=<prefix></code>)

You should now have four binaries in <code>/usr/local/bin</code>:


* <code>vislcg3</code> &mdash; is the original disambiguator. It has all the features available and uses the CG input / output format.
* <code>vislcg3</code> &mdash; is the original disambiguator. It has all the features available and uses the CG input / output format.
* <code>cg-comp</code> &mdash; is a program to compile grammars into a binary format.
* <code>cg-comp</code> &mdash; is a program to compile grammars into a binary format.
* <code>cg-proc</code> &mdash; is a program to run binary grammars on an [[lttoolbox format]]ted input stream.
* <code>cg-proc</code> &mdash; is a program to run binary grammars on an [[apertium stream format|apertium format]]ted input stream.
* <code>cg-conv</code> &mdash; is a program to convert between stream formats on-the-fly.

Note: The Apertium support in VISL CG is still under development and thus bugs may be found.


==Example usage==
==Example usage==
Line 48: Line 74:
SECTION
SECTION
</pre>
</pre>

Note: The {{sc|delimiters}} statement is used to define Window boundaries.


The next thing we want to do is write the two rules, so:
The next thing we want to do is write the two rules, so:
Line 73: Line 101:


<pre>
<pre>
$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin 2>/dev/null
$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$
</pre>
</pre>


As we can see, the determiner reading has been selected over the pronoun reading. Note the <code>2>/dev/null</code> redirects debugging output.
As we can see, the determiner reading has been selected over the pronoun reading.


;Rule #2
;Rule #2
Line 95: Line 123:


<pre>
<pre>
$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin 2>/dev/null
$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-proc grammar.bin
^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$
^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$
</pre>
</pre>


Voilà! A fully disambiguated sentence. Its worth noting that the <code>SELECT</code> and <code>REMOVE</code> statements can be thought of as similar to the forbid / enforce constraints in the [[TSX format]] used by <code>apertium-tagger</code>, only much more flexible.
Voilà! A fully disambiguated sentence.

==Matching unknown words in Apertium==
lttoolbox prepends a star to unknown words, so you can match unknown words using a simple regexp matching that star:

<pre>LIST unknown = ("\\*.*"r) ; </pre>

Now you can have a rule like

<pre>SELECT proper-name IF (1 unknown);</pre>

==Applying a grammar to lexical selection==

Input:

<pre>
^estació<n><f><sg>/season<n><sg>/station<n><sg>$
</pre>

<pre>
SECTION

SELECT ("season") IF (0 ("<estació>")) ;
</pre>

Output:

<pre>
$ echo "^estació<n><f><sg>/season<n><sg>/station<n><sg>$" | cg-proc /tmp/cg-test.bin
^estació<n><f><sg>/season<n><sg>$
</pre>

==Performance==

To apply the above two-rule grammar to an input text of 10,000 lines (40,000 words), it took approximately 12 seconds (~3,000 words/second). As a comparison, the <code>apertium-tagger</code> processes this in 1.5 seconds (~26,000 words/second). Tested with a larger grammar, for Faroese &mdash; of 204 rules, the performance drops to (~2,000 words/second).

==Tracing==
vislcg3 provides tracing, showing which rules were applied, although the cg-proc command doesn't support this. However, for development purposes, we can use <code>cg-conv -a</code> to turn Apertium-formatted text into the vislcg3 format, and then just use the vislcg3 command:

<pre>
$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-conv -a | vislcg3 --trace -g grammar.txt
</pre>

This is really handy, since you quickly get the line number (and rule name, if you specified one) for each change made to the stream by vislcg3.

==Troubleshooting==
If you get

/usr/local/bin/cg-proc: invalid option -- 'w'
/usr/local/bin/apertium: line 480: 9764 Avbrutt (SIGABRT) $APERTIUM_PATH/apertium-re$FORMATADOR > $SALIDA

that means your vislcg3 needs updating.

After you update vislcg3, you're likely to get something like

Error: Grammar revision is 4879, but this loader requires 5465 or later!

You need to recompile your CG grammars '''each time you've updated vislcg3''', eg.

cd apertium-nn-nb
touch *.rlx # trick make into thinking the grammars need recompiling
make
sudo make install

==See also==
* [[Constraint Grammar]]
* [[Tagger training]]


==Speed==
==External links==


* [http://beta.visl.sdu.dk/cg3/single/ Constraint Grammar Manual]
To apply the above two-rule grammar to an input text of 10,000 lines (40,000 words), it took approximately 12 seconds. As a comparison, the <code>apertium-tagger</code> processes this in 1.5 seconds.


[[Category:Development]]
[[Category:Development]]
[[Category:Documentation]]
[[Category:Documentation]]
[[Category:Constraint Grammar]]
[[Category:Documentation in English]]

Latest revision as of 20:57, 2 April 2021

En français

This page describes the use of Constraint Grammar (CG) within the Apertium MT platform. Constraint Grammar is often used as a pre-disambiguator for the Apertium tagger, allowing the imposition of more fine grained constraints than would be otherwise possible.

Requisite software[edit]

Installing VISL CG3[edit]

See Installation, for most real operating systems you can now get pre-built packages of CG-3 (as well as other core tools) through your regular package manager.

Debian / Ubuntu Derivatives[edit]

curl -sS http://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
sudo apt-get install cg3

Other Distros[edit]

You will need to install cmake. To install it in Debian/Ubuntu just type apt-get install cmake.

You will need to install libicu-dev. To install it in Debian/Ubuntu just type apt-get install libicu-dev.

You will need to install libboost-dev. To install it in Debian/Ubuntu just type apt-get install libboost-dev.

You may want to install tmalloc. To install it in Debian/Ubuntu just type apt-get install libgoogle-perftools-dev.

$ svn co http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3
$ cd vislcg3
$ ./cmake.sh 
$ make -j3
# make install

(For those installing in a prefix, use ./cmake.sh -DCMAKE_INSTALL_PREFIX=<prefix>)

You should now have four binaries in /usr/local/bin:

  • vislcg3 — is the original disambiguator. It has all the features available and uses the CG input / output format.
  • cg-comp — is a program to compile grammars into a binary format.
  • cg-proc — is a program to run binary grammars on an apertium formatted input stream.
  • cg-conv — is a program to convert between stream formats on-the-fly.

Note: The Apertium support in VISL CG is still under development and thus bugs may be found.

Example usage[edit]

Lets take an example from Apertium, we have:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin 
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>/lo<prn><pro><p3><f><sg>$ ^playa/playa<n><f><sg>$

Here we have two ambiguities, the first is between a noun and a verb, the second is between a determiner and a pronoun. The more appropriate sequence would be verb prep det noun. We can write some rules in CG to enforce this.

First we define our categories, these can be tags, wordforms or lemmas. It might help to think of them as "coarse tags", which may involve a set of fine tags or lemmas. So, create a file grammar.txt, and add the following text:

DELIMITERS = "<$.>" ;

LIST NOUN = n;
LIST VERB = vblex;
LIST DET = det;
LIST PRN = prn;
LIST PREP = pr;

SECTION

Note: The delimiters statement is used to define Window boundaries.

The next thing we want to do is write the two rules, so:

Rule #1
"When the current lexical unit can be a pronoun or a determiner, and it is followed on the right by a lexical unit which could be a noun, choose the determiner"
# 1
SELECT DET IF
        (0 DET)
        (0 PRN)
        (1 NOUN) ;

Add this rule to the file, and compile using cg-comp

$ ./cg-comp grammar.txt grammar.bin
Sections: 1, Rules: 1, Sets: 6, Tags: 7

Now try testing it in the Apertium pipeline:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

As we can see, the determiner reading has been selected over the pronoun reading.

Rule #2
"When the current lexical unit can be a noun or a verb, if the subsequent two units to the right are preposition and determiner, remove the noun reading."
# 2
REMOVE NOUN IF
        (0 NOUN)
        (0 VERB)
        (1 PREP)
        (2 DET) ;

Add this rule, re-compile the grammar and test:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin |  cg-proc grammar.bin
^vino/venir<vblex><ifi><p3><sg>$ ^a/a<pr>$ ^la/el<det><def><f><sg>$ ^playa/playa<n><f><sg>$

Voilà! A fully disambiguated sentence. Its worth noting that the SELECT and REMOVE statements can be thought of as similar to the forbid / enforce constraints in the TSX format used by apertium-tagger, only much more flexible.

Matching unknown words in Apertium[edit]

lttoolbox prepends a star to unknown words, so you can match unknown words using a simple regexp matching that star:

LIST unknown = ("\\*.*"r) ; 

Now you can have a rule like

SELECT proper-name IF (1 unknown);

Applying a grammar to lexical selection[edit]

Input:

^estació<n><f><sg>/season<n><sg>/station<n><sg>$
SECTION

SELECT ("season") IF (0 ("<estació>")) ;

Output:

$ echo "^estació<n><f><sg>/season<n><sg>/station<n><sg>$" | cg-proc /tmp/cg-test.bin
^estació<n><f><sg>/season<n><sg>$

Performance[edit]

To apply the above two-rule grammar to an input text of 10,000 lines (40,000 words), it took approximately 12 seconds (~3,000 words/second). As a comparison, the apertium-tagger processes this in 1.5 seconds (~26,000 words/second). Tested with a larger grammar, for Faroese — of 204 rules, the performance drops to (~2,000 words/second).

Tracing[edit]

vislcg3 provides tracing, showing which rules were applied, although the cg-proc command doesn't support this. However, for development purposes, we can use cg-conv -a to turn Apertium-formatted text into the vislcg3 format, and then just use the vislcg3 command:

$ echo "vino a la playa" | lt-proc es-ca.automorf.bin | cg-conv -a | vislcg3 --trace -g grammar.txt

This is really handy, since you quickly get the line number (and rule name, if you specified one) for each change made to the stream by vislcg3.

Troubleshooting[edit]

If you get

/usr/local/bin/cg-proc: invalid option -- 'w'
/usr/local/bin/apertium: line 480:  9764 Avbrutt (SIGABRT)       $APERTIUM_PATH/apertium-re$FORMATADOR > $SALIDA

that means your vislcg3 needs updating.

After you update vislcg3, you're likely to get something like

Error: Grammar revision is 4879, but this loader requires 5465 or later!

You need to recompile your CG grammars each time you've updated vislcg3, eg.

cd apertium-nn-nb
touch *.rlx               # trick make into thinking the grammars need recompiling
make
sudo make install

See also[edit]

External links[edit]