Ideas for Google Summer of Code/Detect hidden unknown words
Detect hidden unknown words by using the probabilities of the HMM-based part-of-speech tagger in Apertium
Apertium dictionaries may have incomplete entries, that is, surface forms for which the dictionary does not provided all the possible lexical forms. The problem is that surface forms for which there exists at least one lexical form cannot be considered unknown and there is no way to know whether the set of possible lexical forms provided for them is complete or not.
The idea of this project is to use the transition and emission probabilities of the HMM-based part-of-speech tagger of Apertium to work out if an entry (or entries) in the morphological dictionary is missing or not. Usually missing entries in the dictionary correspond to open-class part-of-speech tags, i.e. nouns, verbs, adjectives, adverbs, etc.
Apertium's part-of-speech tagger is based on first-order hidden Markov models which are implemented in class HMM (files: hmm.h and hmm.cpp). Given an input sentence, once can work out which is the most-likely sequence of part-of-speech tags and use this information to suggest missing entries in the dictionaries. To do so one can extend the set of part-of-speech tags provided for each surface form with the set of open-class tags before disambiguation. Note however, that this implies dealing with new ambiguity classes which may require some changes to the code.
Further reading
- Sánchez-Martínez, F (2008) "Appendix B: Hidden Markov models for part-of-speech tagging" of PhD thesis "Using unsupervised corpus-based methods to build rule-based machine translation systems", June, Departament de Llenguatges i Sistemes Infomàtics, Universitat d'Alacant, Spain.