Difference between revisions of "Lexical selection in target language"
| (5 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
| :''This module deals with [[lexical selection]], for more information on the topic, see the [[lexical selection|main page]].'' | |||
| {{deprecated}} | |||
| With apertium-multiple-translations it is possible to get an ambiguous text output from transfer, it comes in the following form: | With apertium-multiple-translations it is possible to get an ambiguous text output from transfer, it comes in the following form: | ||
| :<code>Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase  {constantly|steadily|constant|steady} in the {recent|last} years , and be your government wishing to promote that objective</code> | :<code>Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase  {constantly|steadily|constant|steady} in the {recent|last} years , and be your government wishing to promote that objective</code> | ||
| Original sentence: | |||
| ⚫ | |||
| :<code>Mae iaith pawb yn datgan yn glir fod y gallu i gael addysg drwy gyfrwng y Gymraeg wedi cynyddu’n gyson yn y blynyddoedd diwethaf , a bod eich llywodraeth yn dymuno hybu’r amcan hwnnw</code> | |||
| ⚫ | |||
| The ambiguous entries come from a target-language polysemia dictionary which looks something like: | The ambiguous entries come from a target-language polysemia dictionary which looks something like: | ||
| Line 59: | Line 66: | ||
| A minor improvement, but something that could be improved with more work. | A minor improvement, but something that could be improved with more work. | ||
| ==See also== | |||
| * [[Lexical selection]] | |||
| * [[Word sense disambiguation]] | |||
| [[Category:Development]] | [[Category:Development]] | ||
| [[Category:Lexical selection]] | |||
| [[Category:Documentation in English]] | |||
Latest revision as of 21:06, 1 December 2013
- This module deals with lexical selection, for more information on the topic, see the main page.
This discussion page is deprecated as the functionality now exists.
With apertium-multiple-translations it is possible to get an ambiguous text output from transfer, it comes in the following form:
- Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase {constantly|steadily|constant|steady} in the {recent|last} years , and be your government wishing to promote that objective
Original sentence:
- Mae iaith pawb yn datgan yn glir fod y gallu i gael addysg drwy gyfrwng y Gymraeg wedi cynyddu’n gyson yn y blynyddoedd diwethaf , a bod eich llywodraeth yn dymuno hybu’r amcan hwnnw
While the multiple-translation output is useful as-is, one of the things we can do with it is try to do a ranking based on a target language model. Each of the options given is kind of ok, but some sound more fluent than others.
The ambiguous entries come from a target-language polysemia dictionary which looks something like:
    <e><p><l>constant<s n="adj"/></l><r>constant<s n="adj"/></r></p></e> 
    <e><p><l>constant<s n="adj"/></l><r>constantly<s n="adv"/></r></p></e> 
    <e><p><l>constant<s n="adj"/></l><r>steady<s n="adj"/></r></p></e> 
    <e><p><l>constant<s n="adj"/></l><r>steadily<s n="adv"/></r></p></e> 
    <e><p><l>last<s n="adj"/></l><r>last<s n="adj"/></r></p></e> 
    <e><p><l>last<s n="adj"/></l><r>recent<s n="adj"/></r></p></e> 
This dictionary could be automatically generated from an existing bidix (by taking the restrictions) or from a thesaurus, or manually.
So... if you want to do target-language based scoring, first you calculate your very basic n-gram model, for example of [1-5] grams over a corpus. It might look something like this:
$ cat test.ngrams | head 3086,1,last 1157,2,the last 1128,1,recent 703,1,recently 501,2,last year 301,2,in recent 277,2,recent years 250,2,the recent 231,1,constantly 225,3,in the last
Then you run the ambiguous text through a ranker, which works on a window of ambiguity:
231.0 Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase constantly 30.0 Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase steadily 177.0 Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase constant 31.0 Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase steady 1703.0 in the recent 6075.0 in the last
At each stage you choose the most likely and construct the final sentence as you go along, the final output (for 1--5 grams being):
- Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase constantly in the last years , and be your government wishing to promote that objective
and for 2--5 grams:
- Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase steadily in the last years , and be your government wishing to promote that objective
Original:
- Language is everyone declaring clear be the capacity to get education through the medium Welsh after increase constant in the last years , and be your government wishing to promote that objective
A minor improvement, but something that could be improved with more work.

