Automated extraction of lexical resources
(Thanks for spectie and jimregan for the input)
Some ideas for (semi-)automatically extracting lexical resources from corpora.
Things we want to extract:
- Morphological analysers
- Constraint rules (sensible ones)
- Bilingual dictionaries
- Transfer rules
Morphological resource extraction
Closed categories such as pronouns, prepositions, or even very irregular verbs are extremely important, because they are so frequent, but they are not that numerous. While it requires considerable effort and good knowledge of the language in question to generate dictionaries of these words, it should be doable. The real bulk of the work sometimes is just kind of mechanically extending this dictionary with open class words, of which there are a great number.
Because the morphology of these open class words is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way. For example, upon finding a new word in the corpus one has to first discover if it's a new verb, etc, then we have to discover its inflections and attributes like gender and animateness. With this information in hand, we could add an entry to the monolingual dictionary for the word, under the correct paradigm.
Getting concrete, let's say I want to learn the plural forms of some words in Portuguese:
class1: carro -> carros class2: hospital -> hospitais
If we have the classes already defined, when we process a corpus and find a new noun, we may generate it's plural, and try to see which of the forms is attested in the corpus. This could also work well for verb conjugations, declensions, etc. More generically, upon finding a new unknown word, we can productively generate all its inflections according to every paradigm available, and see which of them "fits" better.
Such a technique has been successfully used, for example, in the Wortschatz (http://www.wortschatz.uni-leipzig.de]] project, in order to detect inflection classes of German words (http://wortschatz.uni-leipzig.de/Papers/ToolForLex.pdf).
Things get slightly more complicated if the surface-form words encountered give rise to ambiguous interpretations, like in English, where verbs are often similar to nouns, like in car - cars (noun) or give - gives (verb). This could be tackled with constraints.
Language is always structured in some ways, which restricts the word order of some classes. Of course, some languages have more restricted word order (English) than others (Russian), but in all of them there are some constraints, which we can use in order to gather information about an unknown word. More generically, we can gather information about a word knowing which words (or classes) more often appear immediately before or after it.
Word order constraints allow us to sometimes disambiguate between classes, a noun can never come after a personal pronoun, for example:
*I car something. *He cars something-else. I give something. He sells cars. A car ... The car ... *The give ... Give it ... *Car it ...
Agreement rules allow us to discover thing like plural or gender (Portuguese):
(a|as|uma|duas) noun (feminine) (o|os|um|dois) noun (masculine) Preposition governing case:
na something-a -> animate noun
And so on. Such techniques are even already used in Apertium, eg, the preposition governing case mentioned above, but in an ad-hoc way. Our goal should then be to streamline the precess, creating an easy-to-use framework for using constraints with corpora in order to obtain information about words of interest; and later to provide statistical evidence of hitherto unnoticed constraints from the example words we already have.
I believe the main part of the work here would be some sensible way to extract GOOD alignments. Natools and GIZA generate translation probabilities, those can be a start, but i believe more has to be done. See Parallel_corpus_pruning for some related ideas.
Not much to add here, but good alignments will help here too.
Folows a plan to leverage the phenomena describe above to speed up the creation of monolingual and bilingual dictionaries for Apertium:
- Given a large monolingual corpus, some examples of words added to the correct paradigm, statistically infer constraint rules like mentioned above. Allow the dictionary engineer to change them, and add his own.
- Given a large monolingual corpus, a list of paradigms and constraints, find words which unambiguously match one paradigm, and add them to the dictionary. This would entail using/hacking the lttollbox code. For words with ambiguity, allow the engineer to manually check them. Alternatively, use Google API to use Google searches as a hack for having a big corpus.
- Since we now have more examples of words, we might repeat 1-2.
- Given a bilingual corpus a list of paradigms and constraints, word align the corpus and then tag it with the correct paradigm (or the other way around).
- Prune bad alignments, using only those with sections with highly probable matches.
- Add the found lexical correspondences to the bilingual dictionary, and the translation templates, allowing the engineer to edit them.
Throughout, the tools implemented should also have options allowing the engineer to prioritize precision or recall. As a "collateral effect" of the project, we should have an initial pt_BR-en translator. We will have 3 (closely) related tools, one for doing 1, another for 2, and another for 4-6.
These techinques have some limitation though:
- I don't think we can learn derivational morphology this way
- Non-concatenative morphology is not supported, though that is the case with Apertium as a hole
- Morphologicaly complex languages like Turkish would also pose a problem
Even so, I think such a projecto would be quite useful.