Talk:Automatically trimming a monodix

Compounds vs trimming in HFST

The sme.lexc can't be trimmed using the simple HFST trick, due to compounds.

Say you have cake n sg, cake n pl, beer n pl and beer n sg in monodix, while bidix has beer n and wine n. The HFST method without compounding is to intersect (cake|beer) n (sg|pl) with (beer|wine) n .* to get beer n (sg|pl).

But HFST represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like

((cake|beer) n sg)*(cake|beer) n (sg|pl)

The intersection of this with

(beer|wine) n .*

is

(beer n sg)*(cake|beer) n (sg|pl) | beer n pl

when it should have been

(beer n sg)*(beer n (sg|pl)

Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing. lt-trim is able to understand compounds by simply skipping the compund tags

Talk:Automatically trimming a monodix

Compounds vs trimming in HFST

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools