Difference between revisions of "Automated extraction of lexical resources"

From Apertium
Jump to navigation Jump to search
(New page: (Thanks for spectie and jimregan for the input) Some ideias for (semi-)automatically extracting lexical resources from corpora. Things we want to extract: # Morphological analysers # Co...)
 
Line 12: Line 12:
   
   
== Morpholical resource extraction
+
== Morphological resource extraction ==
   
 
First, i should state that our main aim will be to extract information
 
First, i should state that our main aim will be to extract information
Line 20: Line 20:
   
 
So, we leave stuff like prepositions, pronouns, irregular (very frequent) verbs like ''to be'' to be manually constructed, which should be doable. Our focus shall instead
 
So, we leave stuff like prepositions, pronouns, irregular (very frequent) verbs like ''to be'' to be manually constructed, which should be doable. Our focus shall instead
be on less frequent, but regular and '''much more numerous''' verbs, nouns, adjectives, etc.
+
be on less frequent, but regular and '''much more numerous''' verbs, nouns, adjectives, etc. Because their morphology is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way.
  +
  +
Some examples should make it clearer:
  +
  +
Let's I want to learn the plural forms of some words in portuguese:
  +
  +
class1: carro -> carro'''s'''
  +
class2: hospital -> hospita'''is'''
  +
  +
If we have the classes already defined, when we process a corpus and find a new noun,
  +
we may generate it's plural, and try to see wich of the forms is attested in a corpus.
  +
  +
This could also work well for verb conjungations, declensions. Things get slightly more complicated if the surface-form word encountered is ambiguous, say like in english, where verbs are often similar to nouns, like in ''car - cars'' (noun) or
  +
''give - gives'' (verb). This could be tackled with constraints...
  +
  +
=== Constraints ===
  +
  +
Some stuff can, sometimes, only appear after some other stuff ;-) Really, what we are looking for is some small snippets that, because of the language's grammar, specify what our word of interest is, or defines some of its characteristics.
  +
  +
Some examples:
  +
  +
*I car something.
  +
*He cars something-else.
  +
I give something.
  +
He sells cars.
  +
A car ...
  +
The car ...
  +
*The give ...
  +
Give it ...
  +
*Car it ...
  +
  +
Discover gender (portugues):
  +
  +
(a|as|uma|duas) noun (feminin)
  +
(o|os|um|dois) noun (masculin)
  +
  +
  +
Preposition governing case:
  +
  +
na something-a -> animate noun
  +
  +
And so on. The good thing is, the more examples of one class (paradigm) we have,
  +
the better can we learn such constraints automatically from corpora. A tool that would learn some of these constraints automatically would be quite useful, in conjunction with another tool that creates monolingual dictionaries from the generated constraints + human created constraints.
  +
  +
=== Bilingual dictionaries ===
  +
  +
I believe the main part of the work here would be some sensible way to extract GOOD alignments. Natools and GIZA generate translation probabilities, those can be a start, but i believe more has to be done. See [[Parallel_corpus_pruning]] for some related ideas.
  +
  +
  +
=== Transfer rules ===
  +
Not much to add here, but good alignments will help here too.

Revision as of 03:29, 1 April 2009

(Thanks for spectie and jimregan for the input)

Some ideias for (semi-)automatically extracting lexical resources from corpora.

Things we want to extract:

  1. Morphological analysers
  2. Constraint rules (sensible ones)
  3. Bilingual dictionaries
  4. Transfer rules


Morphological resource extraction

First, i should state that our main aim will be to extract information about the open categories, and not the closed. While it would be interesting to try and learn everything from scratch, it would probably be counter-productive, if at all possible.

So, we leave stuff like prepositions, pronouns, irregular (very frequent) verbs like to be to be manually constructed, which should be doable. Our focus shall instead be on less frequent, but regular and much more numerous verbs, nouns, adjectives, etc. Because their morphology is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way.

Some examples should make it clearer:

Let's I want to learn the plural forms of some words in portuguese:

class1: carro -> carros
class2: hospital -> hospitais

If we have the classes already defined, when we process a corpus and find a new noun, we may generate it's plural, and try to see wich of the forms is attested in a corpus.

This could also work well for verb conjungations, declensions. Things get slightly more complicated if the surface-form word encountered is ambiguous, say like in english, where verbs are often similar to nouns, like in car - cars (noun) or give - gives (verb). This could be tackled with constraints...

Constraints

Some stuff can, sometimes, only appear after some other stuff ;-) Really, what we are looking for is some small snippets that, because of the language's grammar, specify what our word of interest is, or defines some of its characteristics.

Some examples:

*I car something.
*He cars something-else.
I give something.
He sells cars.
A car ...
The car ...
*The give ...
Give it ...
*Car it ...

Discover gender (portugues):

(a|as|uma|duas) noun (feminin)
(o|os|um|dois) noun (masculin)


Preposition governing case:

na something-a  -> animate noun

And so on. The good thing is, the more examples of one class (paradigm) we have, the better can we learn such constraints automatically from corpora. A tool that would learn some of these constraints automatically would be quite useful, in conjunction with another tool that creates monolingual dictionaries from the generated constraints + human created constraints.

Bilingual dictionaries

I believe the main part of the work here would be some sensible way to extract GOOD alignments. Natools and GIZA generate translation probabilities, those can be a start, but i believe more has to be done. See Parallel_corpus_pruning for some related ideas.


Transfer rules

Not much to add here, but good alignments will help here too.