Difference between revisions of "User:Francis Tyers/Sandbox"

From Apertium
Jump to navigation Jump to search
Line 37: Line 37:
   
 
<pre>
 
<pre>
@þar sem my farmer is
+
@þar sem my farmer is
 
Now difference my farmer @elska
 
Now difference my farmer @elska
 
Now difference my farmer the lid
 
Now difference my farmer the lid
Line 57: Line 57:
 
* Translations are scored on a target language corpus. -- The target language corpora would need to be preprocessed in some cases, to, for example give the word in POS or syntactic context. <code>n _farmer_ prn.pos, n _husband_ prn.pos</code> etc. The number of target words would be limited to the number of correspondences in the bilingual dictionary.
 
* Translations are scored on a target language corpus. -- The target language corpora would need to be preprocessed in some cases, to, for example give the word in POS or syntactic context. <code>n _farmer_ prn.pos, n _husband_ prn.pos</code> etc. The number of target words would be limited to the number of correspondences in the bilingual dictionary.
 
* Where the difference in score between one translation and another reaches a threshold, a rule is created in the form of:
 
* Where the difference in score between one translation and another reaches a threshold, a rule is created in the form of:
** <code>MAP (sense1) ("bóndi") IF (1 ("minn"));</code>
+
** <code>MAP (husband) ("bóndi") IF (1 ("minn"));</code>
 
* Morphology or syntax could also be included.
 
* Morphology or syntax could also be included.
** <code>MAP (sense1) ("bóndi") IF (1 PrnPos);</code>
+
** <code>MAP (husband) ("bóndi") IF (1 PrnPos);</code>
** <code>MAP (sense1) ("bóndi") IF (-1 Genitive);</code>
+
** <code>MAP (husband) ("bóndi") IF (-1 Genitive);</code>
 
* It would be interesting to see if rules can be learnt which use different discriminators (e.g. surface form, syntax) etc.
 
* It would be interesting to see if rules can be learnt which use different discriminators (e.g. surface form, syntax) etc.
   

Revision as of 16:14, 7 October 2009

Lexical selection


Information

  • Surface form -- tud etc.
  • Lemma -- den etc.
  • Category -- n.f etc.
  • Syntax -- @SUBJ etc.

Ideas

For some things linguistic knowledge is better, or easier. It is also better for hacking. For other things, statistics are better. Wider coverage for cheaper. The lexical selection module(s) should allow both the use of rules and of statistics. Rules for things we "know", statistics for those we don't.

Inferring rules from collocations

Rules as described below are already used in apertium-cy-en, apertium-br-fr and apertium-sme-smj. This stage would be the first pass of lexical selection.

  • The bilingual dictionary has several translations for each ambiguous word.
  • Rules are created to select between them based on context.
  • For each word in the bilingual dictionary, collocations (n-grams) are extracted from a source language corpus.

+ in, skyldi ég þá á munúð hyggja, þar sem  bóndi minn er einnig gamall?``
+ ,Drottinn hefir séð raunir mínar. Nú mun  bóndi minn elska mig.``
  þunguð og ól son. Þá sagði hún: ,,Nú mun  bóndi minn loks hænast að mér, því að é
   ,,Guð hefir gefið mér góða gjöf. Nú mun  bóndi minn búa við mig, því að ég hefi 
  af, þá haldi hann bótum uppi, slíkum sem  bóndi konunnar kveður á hann, og greiði
  l niður fyrir húsdyrum mannsins, þar sem  bóndi hennar var inni, og lá þar, uns b
-                                27  En er  bóndi hennar reis um morguninn og lauk 
+ kubúinn hafi soltið til þess að franskur  bóndi þurfi ekki
  • For each ambiguous word, these collocations are run with each of the entries in the bilingual dictionary through the translator.
@þar sem  my  farmer is 
Now difference my  farmer @elska
Now difference my  farmer the lid
Now difference my  farmer live
*slíkum #as  the woman's  farmer  composes
@þar sem her  farmer  was
But is  her  farmer  rose
to French  farmer need not
@þar sem  my  husband is
Now difference my  husband @elska
Now difference my  husband the lid
Now difference my  husband live
*slíkum #as  the woman's  husband  composes
@þar sem her  husband  was
But is  her  husband  rose
to French  husband need not
  • Translations are scored on a target language corpus. -- The target language corpora would need to be preprocessed in some cases, to, for example give the word in POS or syntactic context. n _farmer_ prn.pos, n _husband_ prn.pos etc. The number of target words would be limited to the number of correspondences in the bilingual dictionary.
  • Where the difference in score between one translation and another reaches a threshold, a rule is created in the form of:
    • MAP (husband) ("bóndi") IF (1 ("minn"));
  • Morphology or syntax could also be included.
    • MAP (husband) ("bóndi") IF (1 PrnPos);
    • MAP (husband) ("bóndi") IF (-1 Genitive);
  • It would be interesting to see if rules can be learnt which use different discriminators (e.g. surface form, syntax) etc.
Advantages
  • Fairly straightforward -- the rules can be created automatically in constraint grammar.
  • Human readable / editable.
  • Doesn't require parallel corpus -- although might work better with one.
  • Unsupervised.
Disadvantages
  • Many rules will be slow.
  • Might not work very well.
Relevant prior work
  • Jin Yang (1999) "Towards the Automatic Acquisition of Lexical Selection Rules"
  • Eckhard Bick (2005) "Dan2eng: Wide-Coverage Danish-English Machine Translation"
Examples

Pediñ can translate as 'prier' or 'inviter'. If it is used transitively it means "inviter", intransitively it means "prier"

  • o huñvreal muioc'h eget o pediñ .
    • Leur *huñvreal plus que en train de prier .
  • Koulskoude e tiviz Francis pediñ e zaou vreur d'ober ...
    • Pourtant il décide Francis prier ses deux frères à faire ...
  • O fal a zo pediñ arzourien a bep seurt evel kizellerien
    • Leur objectif il est inviter des artistes de toute sorte comme les sculpteurs
  • ... bleunioù ha peadra da yac'haat o zreid hag o pediñ evito ...
    • ... de fleurs et des moyens à guérir leurs pieds et en train de prier pour eux ...
  • ha tu a oa bet d'al labourerien pediñ o familhoù hag o mignoned
    • ... et il y avait moyen été aux travailleurs prier leurs familles et leurs amis ...
  • Raktresoù all a zo ivez : pediñ skrivagnerien a-benn eskemm ganto
    • ... de Projets autres il est aussi : inviter des écrivains pour échanger avec eux ...
  • Sharon Stone eo bet an hini gwellañ evit pediñ an embregerien da zisammañ
    • *Sharon *Stone il a été les ceux le plus mieux pour prier les entrepreneurs à décharger ...

The current rule says: SUBSTITUTE (vblex) (vblex tv) ("pediñ" vblex) (1C NC);, that is "choose 'inviter' if the next word can only be a common noun". Obviously, this fails in the case of definite NPs, o familhoù 'their families'.