Talk:Automatically trimming a monodix

From Apertium
Jump to navigation Jump to search

Compounds vs trimming in HFST

The sme.lexc can't be trimmed using the simple HFST trick, due to compounds.

Say you have cake n sg, cake n pl, beer n pl and beer n sg in monodix, while bidix has beer n and wine n. The HFST method without compounding is to intersect (cake|beer) n (sg|pl) with (beer|wine) n .* to get beer n (sg|pl).

But HFST represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like

((cake|beer) n sg)*(cake|beer) n (sg|pl)

The intersection of this with

(beer|wine) n .*

is

(beer n sg)*(cake|beer) n (sg|pl) | beer n pl

when it should have been

(beer n sg)*(beer n (sg|pl)


Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing. lt-trim is able to understand compounds by simply skipping the compund tags