User:Francis Tyers/Sandbox

From Apertium
Jump to navigation Jump to search

Lexical selection


Information

  • Surface form -- tud etc.
  • Lemma -- den etc.
  • Category -- n.f etc.
  • Syntax -- @SUBJ etc.

Ideas

Inferring rules from collocations

  • The bilingual dictionary has several translations for each ambiguous word.
  • Rules are created to select between them based on context.
  • For each word in the bilingual dictionary, collocations (n-grams) are extracted from a source language corpus.
    • reisa þetta hús og fullgjöra
    • reisa þetta hús og fullgjöra
    • niður þetta hús Guðs í
    • gjört fyrir hús Guðs himnanna
    • inn í hús Semaja Delajasonar
  • For each ambiguous word, these collocations are run with each of the entries in the bilingual dictionary through the translator.
  • Translations are scored on a target language corpus.
  • Where the difference in score between one translation and another reaches a threshold, a rule is created in the form of:
    • MAP (sense1) ("hús") IF (1 ("Guðs"));
  • Syntax could also be included.
    • MAP (sense1) ("hús") IF (1 @SUBJ);
Advantages
  • Fairly straightforward -- the rules can be created automatically in constraint grammar.
  • Human readable / editable.
  • Doesn't require parallel corpus.
  • Unsupervised
Disadvantages
  • Many rules will be slow.
  • Might not work very well.