Difference between revisions of "Automated extraction of lexical resources"
Line 1: | Line 1: | ||
{{TOCD}} |
|||
(Thanks for spectie and jimregan for the input) |
(Thanks for spectie and jimregan for the input) |
||
Some |
Some ideas for (semi-)automatically extracting lexical resources from |
||
corpora. |
corpora. |
||
Revision as of 12:17, 1 April 2009
(Thanks for spectie and jimregan for the input)
Some ideas for (semi-)automatically extracting lexical resources from corpora.
Things we want to extract:
- Morphological analysers
- Constraint rules (sensible ones)
- Bilingual dictionaries
- Transfer rules
Morphological resource extraction
First, i should state that our main aim will be to extract information about the open categories, and not the closed. While it would be interesting to try and learn everything from scratch, it would probably be counter-productive, if at all possible.
So, we leave stuff like prepositions, pronouns, irregular (very frequent) verbs like to be to be manually constructed, which should be doable. Our focus shall instead be on less frequent, but regular and much more numerous verbs, nouns, adjectives, etc. Because their morphology is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way.
Some examples should make it clearer:
Let's I want to learn the plural forms of some words in portuguese:
class1: carro -> carros class2: hospital -> hospitais
If we have the classes already defined, when we process a corpus and find a new noun, we may generate it's plural, and try to see wich of the forms is attested in a corpus.
This could also work well for verb conjungations, declensions. Things get slightly more complicated if the surface-form word encountered is ambiguous, say like in english, where verbs are often similar to nouns, like in car - cars (noun) or give - gives (verb). This could be tackled with constraints...
Constraints
Some stuff can, sometimes, only appear after some other stuff ;-) Really, what we are looking for is some small snippets that, because of the language's grammar, specify what our word of interest is, or defines some of its characteristics.
Some examples:
*I car something. *He cars something-else. I give something. He sells cars. A car ... The car ... *The give ... Give it ... *Car it ...
Discover gender (portugues):
(a|as|uma|duas) noun (feminin) (o|os|um|dois) noun (masculin)
Preposition governing case:
na something-a -> animate noun
And so on. The good thing is, the more examples of one class (paradigm) we have, the better can we learn such constraints automatically from corpora. A tool that would learn some of these constraints automatically would be quite useful, in conjunction with another tool that creates monolingual dictionaries from the generated constraints + human created constraints.
Bilingual dictionaries
I believe the main part of the work here would be some sensible way to extract GOOD alignments. Natools and GIZA generate translation probabilities, those can be a start, but i believe more has to be done. See Parallel_corpus_pruning for some related ideas.
Transfer rules
Not much to add here, but good alignments will help here too.