Difference between revisions of "Talk:Automatically trimming a monodix"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
− | == Compounds vs trimming in HFST == |
||
− | |||
− | The sme.lexc can't be trimmed using the simple HFST trick, due to compounds. |
||
− | |||
− | Say you have '''cake n sg''', '''cake n pl''', '''beer n pl''' and '''beer n sg''' in monodix, while bidix has '''beer n''' and '''wine n'''. The HFST method without compounding is to intersect '''(cake|beer) n (sg|pl)''' with '''(beer|wine) n .*''' to get '''beer n (sg|pl)'''. |
||
− | |||
− | But HFST represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like |
||
− | : '''((cake|beer) n sg)*(cake|beer) n (sg|pl)''' |
||
− | The intersection of this with |
||
− | : '''(beer|wine) n .*''' |
||
− | is |
||
− | : '''(beer n sg)*(cake|beer) n (sg|pl) | beer n pl''' |
||
− | when it should have been |
||
− | : '''(beer n sg)*(beer n (sg|pl)''' |
||
− | |||
− | |||
− | Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing. |
||
− | lt-trim is able to understand compounds by simply skipping the compund tags |