Difference between revisions of "Constraint-based lexical selection module"

Revision as of 09:50, 9 October 2012

Lexical transfer

This is the output of lt-proc -b on an ambiguous bilingual dictionary.

[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ 
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ 
^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ 
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ 
^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$

I.e.

L'estació més plujós és el tardor, i la més sec el estiu

Goes to:

The season/station more rainy is the autumn/fall, and the more dry the summer.

The module requires VM for transfer, or using apertium-transfer -b in order to work.

Compilation

Check out the code from

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-tools

you can make it using:

$ ./autogen.sh
$ make

Usage

Make a simple rule file,

<rules>
  <rule>
    <match lemma="criminal" tags="adj"/>
    <match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match>
  </rule>
</rules>

Then compile it:

$ ./apertium-lrx-comp rules.xml rules.fst
1
Written 1 rules, 3 patterns.

The input is the output of lt-proc -b,

$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ 
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./apertium-lrx-proc -t rules2.fst 
1:SELECT:1 juzgado<n><m><sg>
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$

Rule format

A rule is made up of an ordered list of:

Matches
Operations (select, remove)

<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">    
    <select lemma="wife"/> 
  </match>  
  <match lemma="de"/>
</rule>

<pre>
<rule>  
  <match lemma="estació" tags="n.*">    
    <select lemma="season"/> 
  </match>  
  <match lemma="més"/>
  <match lemma="plujós"/>
</rule>

<rule>  
  <match lemma="guanyador"/>
  <match lemma="de"/>
  <match/>
  <match lemma="prova" tags="n.*">    
    <select lemma="event"/> 
  </match>  
</rule>

Writing and generating rules

Writing

Main article: How to get started with lexical selection rules

A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.

Generating

Parallel corpus: Main article: Generating lexical-selection rules from a parallel corpus

Monolingual corpora

Compiled

The general structure is as follows:


LSSRECORD = id, len, weight;

<ALPHABET>
<NUM_TRANSDUCERS>
<TRANSDUCER>
<TRANSDUCER>
<TRANSDUCER>
...
"main"
<TRANSDUCER>
<LSRRECORD>
<LSRRECORD>
<LSRRECORD>

Todo and bugs

~~xml compiler~~
~~compile rule operation patterns, as well as matching patterns~~
~~make rules with gaps work~~
~~optimal coverage~~
~~fix bug with processing multiple sentences~~
~~instead of having regex OR, insert separate paths/states.~~
~~optimise the bestPath function (don't use strings to store the paths)~~
~~autotoolsise build~~
~~add option to compiler to spit out ATT transducers~~
~~fix bug with outputting an extra '\n' at the end~~
~~edit transfer.cc to allow input from lt-proc -b~~
profiling and speed up
- ~~why do the regex transducers have to be minimised ?~~
- ~~retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths~~
- ~~stop using string processing to retrieve rule numbers~~
- ~~retrieve vector of vectors of words, not string of words from lttoolbox~~
- why does the performance drop substantially with more rules ?
- ~~add a pattern -> first letter map so we don't have to call recognise() with every transition~~ (didn't work so well)
~~there is a problem with the regex recognition code: see bug1 in testing.~~
~~there is a problem with two defaults next to each other; bug2 in testing.~~
default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in testing/.
make sure that -b works with -n too.
testing
null flush
add option to processor to spit out ATT transducers

Rendimiento

2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)

Preparedness of language pairs

Pair	LR (L)	LR (L→R)	Fertility	Rules
`apertium-is-en`	18,563	22,220	1.19	115
`apertium-es-fr`
`apertium-eu-es`	16,946	18,550	1.09	250
`apertium-eu-en`
`apertium-br-fr`	20,489	20,770	1.01	256
`apertium-mk-en`	8,568	10,624	1.24	81
`apertium-es-pt`
`apertium-es-it`
`apertium-es-ro`
`apertium-en-es`	267,469	268,522	1.003	334
`apertium-en-ca`

References

Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) "Flexible finite-state lexical selection for rule-based machine translation". Proceedings of the 17th Annual Conference of the European Association of Machine Translation, EAMT12

@@ Line 161: / Line 161: @@
 * <s>fix bug with outputting an extra '\n' at the end</s>
 * <s>edit <code>transfer.cc</code> to allow input from <code>lt-proc -b</code></s>
-* make sure that <code>-b</code> works with <code>-n</code> too.
-* testing
-* null flush
-* add option to processor to spit out ATT transducers
 * profiling and speed up
 ** <s>why do the regex transducers have to be minimised ?</s>
 ** <s>retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths</s>
-** stop using string processing to retrieve rule numbers
+** <s>stop using string processing to retrieve rule numbers</s>
-** retrieve vector of vectors of words, not string of words from lttoolbox
+** <s>retrieve vector of vectors of words, not string of words from lttoolbox</s>
 ** why does the performance drop substantially with more rules ?
 ** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well)
-** each state with >10 out transitions could have a char-transition list
 * <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s>
 * <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s>
 * default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.
+* make sure that <code>-b</code> works with <code>-n</code> too.
+* testing
+* null flush
+* add option to processor to spit out ATT transducers
 ; Rendimiento

Difference between revisions of "Constraint-based lexical selection module"

Revision as of 09:50, 9 October 2012

Contents

Lexical transfer

Compilation

Usage

Rule format

Writing and generating rules

Writing

Generating

Compiled

Todo and bugs

Preparedness of language pairs

See also

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools