Difference between revisions of "User:Francis Tyers/Sandbox"
Jump to navigation
Jump to search
Line 32: | Line 32: | ||
* Fairly straightforward -- the rules can be created automatically in constraint grammar. |
* Fairly straightforward -- the rules can be created automatically in constraint grammar. |
||
* Human readable / editable. |
* Human readable / editable. |
||
* Doesn't require parallel corpus. |
* Doesn't require parallel corpus -- although might work better with one. |
||
* Unsupervised. |
* Unsupervised. |
||
Revision as of 11:38, 7 October 2009
Lexical selection
Information
- Surface form -- tud etc.
- Lemma -- den etc.
- Category -- n.f etc.
- Syntax -- @SUBJ etc.
Ideas
Inferring rules from collocations
- The bilingual dictionary has several translations for each ambiguous word.
- Rules are created to select between them based on context.
- For each word in the bilingual dictionary, collocations (n-grams) are extracted from a source language corpus.
- reisa þetta hús og fullgjöra
- reisa þetta hús og fullgjöra
- niður þetta hús Guðs í
- gjört fyrir hús Guðs himnanna
- inn í hús Semaja Delajasonar
- For each ambiguous word, these collocations are run with each of the entries in the bilingual dictionary through the translator.
- Translations are scored on a target language corpus.
- Where the difference in score between one translation and another reaches a threshold, a rule is created in the form of:
MAP (sense1) ("hús") IF (1 ("Guðs"));
- Syntax could also be included.
MAP (sense1) ("hús") IF (1 @SUBJ);
- Advantages
- Fairly straightforward -- the rules can be created automatically in constraint grammar.
- Human readable / editable.
- Doesn't require parallel corpus -- although might work better with one.
- Unsupervised.
- Disadvantages
- Many rules will be slow.
- Might not work very well.
- Relevant prior work
- Jin Yang (1999) "Towards the Automatic Acquisition of Lexical Selection Rules"