Lttoolbox/weights

From Apertium
Jump to navigation Jump to search

lttoolbox supports weights on transitions.

Why should I care

Now that we can weight our morphological analysers, generators and bilingual dictionaries, here are some problems that can be solved:

  • Having zero-context rules in your .lrx files. Now you can just put the weights directly in your bilingual dictionary
$ echo "^estación<n><f><sg>$" | lt-proc -W -b testbidix.bin
^estación<n><f><sg>/season<n><W:1.000000><sg>/station<n><W:1.500000><sg>$

$ echo "^estación<n><f><sg>$" | lt-proc -b testbidix.bin
^estación<n><f><sg>/season<n><sg>/station<n><sg>$

Analyses will be output according to lowest weight first. So you can mark your default translation as "1.0" and then all others as >1.0 ... because of how transfer works, it will always take the first, which will be the one with the lowest weight.

  • Improving POS-tagging accuracy by ordering analyses by probability. This way if your CG doesn't mop up all the ambiguity, you will get the best remaining analysis. This works kind of like the unigram tagger, but because it can be in the analyser itself, it can be easier to control.
  • Dealing with non-standard forms, instead of having to use LR/RL direction restrictions, you can just make non-standard forms have a high weight and ask for lt-proc to only generate the surface form with the lowest weight.

How do I use it

Put the w attribute on an <e> like

   <e w="2.0">

and use

   lt-proc -L1

to output only the 1 best weight classes.

Unmarked <e> is the same as <e w="1.0">, and lower weights are better.

Better analyses are output before worse ones.

See also lt-proc --help for other relevant options, e.g. -W will show the weights and -N outputs the N best analyses instead of the L best weight-classes.

What are the downsides

Simply turning on weights for one transition means all transitions are now weighted. Weights come at a slight cost to compilation time and fst size; try it and see if it's worth it to you.