Difference between revisions of "Automated extraction of lexical resources"

From Apertium
Jump to navigation Jump to search
(Described the ideas in greater detail)
 
Line 13: Line 13:
   
   
== Morphological resource extraction ==
+
=== Morphological resource extraction ===
   
  +
Closed categories such as pronouns, prepositions, or even very irregular verbs are extremely important, because they are so frequent, but they are not that numerous. While it requires considerable effort and good knowledge of the language in question to generate dictionaries of these words, it should be doable. The real bulk of the work sometimes is just kind of mechanically extending this dictionary with open class words, of which there are a great number. 
First, i should state that our main aim will be to extract information
 
about the ''open'' categories, and not the closed. While it would be
 
interesting to try and learn '''everything''' from scratch, it would
 
probably be counter-productive, if at all possible.
 
   
  +
Because the morphology of these open class words is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way. For example, upon finding a new word in the corpus one has to first discover if it's a new verb, etc, then we have to discover its inflections and attributes like gender and animateness. With this information in hand, we could add an entry to the monolingual dictionary for the word, under the correct paradigm.
So, we leave stuff like prepositions, pronouns, irregular (very frequent) verbs like ''to be'' to be manually constructed, which should be doable. Our focus shall instead
 
be on less frequent, but regular and '''much more numerous''' verbs, nouns, adjectives, etc. Because their morphology is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way.
 
   
 
Getting concrete, let's say I want to learn the plural forms of some words in Portuguese:
Some examples should make it clearer:
 
   
 
class1: carro -> carros
Let's I want to learn the plural forms of some words in portuguese:
 
 
class2: hospital -> hospitais
   
  +
If we have the classes already defined, when we process a corpus and find a new noun, we may generate it's plural, and try to see which of the forms is attested in the corpus. This could also work well for verb conjugations, declensions, etc. More generically, upon finding a new unknown word, we can productively generate all its inflections according to every paradigm available, and see which of them "fits" better.
class1: carro -> carro'''s'''
 
class2: hospital -> hospita'''is'''
 
   
  +
Such a technique has been successfully used, for example, in the Wortschatz (http://www.wortschatz.uni-leipzig.de]] project, in order to detect inflection classes of German words (http://wortschatz.uni-leipzig.de/Papers/ToolForLex.pdf).
If we have the classes already defined, when we process a corpus and find a new noun,
 
we may generate it's plural, and try to see wich of the forms is attested in a corpus.
 
 
This could also work well for verb conjungations, declensions. Things get slightly more complicated if the surface-form word encountered is ambiguous, say like in english, where verbs are often similar to nouns, like in ''car - cars'' (noun) or
 
''give - gives'' (verb). This could be tackled with constraints...
 
   
 
 Things get slightly more complicated if the surface-form words encountered give rise to ambiguous interpretations, like in English, where verbs are often similar to nouns, like in ''car - cars'' (noun) or ''give - gives'' (verb). This could be tackled with constraints.
  +
 
 
=== Constraints ===
 
=== Constraints ===
  +
Language is always structured in some ways, which restricts the word order of some classes. Of course, some languages have more restricted word order (English) than others (Russian), but in all of them there are some constraints, which we can use in order to gather information about an unknown word. More generically, we can gather information about a word knowing which words (or classes) more often appear immediately before or after it.
   
  +
Word order constraints allow us to sometimes disambiguate between classes, a noun can never come after a personal pronoun, for example:
Some stuff can, sometimes, only appear after some other stuff ;-) Really, what we are looking for is some small snippets that, because of the language's grammar, specify what our word of interest is, or defines some of its characteristics.
 
 
Some examples:
 
   
 
*I car something.
 
*I car something.
Line 52: Line 45:
 
*Car it ...
 
*Car it ...
   
  +
Agreement rules allow us to discover thing like plural or gender (Portuguese):
Discover gender (portugues):
 
 
(a|as|uma|duas) noun (feminin)
 
(o|os|um|dois) noun (masculin)
 
 
   
 
   (a|as|uma|duas) noun (feminine)
 
   (o|os|um|dois) noun (masculine)
  +
 
 
Preposition governing case:
 
Preposition governing case:
  +
 
 
 
na something-a -> animate noun
 
na something-a -> animate noun
   
  +
And so on. Such techniques are even already used in Apertium, eg, the preposition governing case mentioned above, but in an ad-hoc way. Our goal should then be to streamline the precess, creating an easy-to-use framework for using constraints with corpora in order to obtain information about words of interest; and later to provide statistical evidence of hitherto unnoticed constraints from the example words we already have.
And so on. The good thing is, the more examples of one class (paradigm) we have,
 
  +
the better can we learn such constraints automatically from corpora. A tool that would learn some of these constraints automatically would be quite useful, in conjunction with another tool that creates monolingual dictionaries from the generated constraints + human created constraints.
 
   
 
=== Bilingual dictionaries ===
 
=== Bilingual dictionaries ===
Line 72: Line 64:
 
=== Transfer rules ===
 
=== Transfer rules ===
 
Not much to add here, but good alignments will help here too.
 
Not much to add here, but good alignments will help here too.
  +
  +
=== A plan ===
  +
  +
Folows a plan to leverage the phenomena describe above to speed up the creation of monolingual and bilingual dictionaries for Apertium:
  +
  +
* Given a large monolingual corpus, some examples of words added to the correct paradigm, statistically infer constraint rules like mentioned above. Allow the dictionary engineer to change them, and add his own.
  +
* Given a large monolingual corpus, a list of paradigms and constraints, find words which unambiguously match one paradigm, and add them to the dictionary. This would entail using/hacking the lttollbox code. For words with ambiguity, allow the engineer to manually check them. Alternatively, use Google API to use Google searches as a hack for having a big corpus.
  +
* Since we now have more examples of words, we might repeat 1-2.
  +
* Given a bilingual corpus a list of paradigms and constraints, word align the corpus and then tag it with the correct paradigm (or the other way around).
  +
* Prune bad alignments, using only those with sections with highly probable matches.
  +
* Add the found lexical correspondences to the bilingual dictionary, and the translation templates, allowing the engineer to edit them.
  +
Throughout, the tools implemented should also have options allowing the engineer to prioritize precision or recall.
  +
As a "collateral effect" of the project, we should have an initial pt_BR-en translator. We will have 3 (closely) related tools, one for doing 1, another for 2, and another for 4-6.
  +
  +
=== Notes ===
  +
  +
These techinques have some limitation though:
  +
  +
* I don't think we can learn derivational morphology this way
  +
* Non-concatenative morphology is not supported, though that is the case with Apertium as a hole
  +
* Morphologicaly complex languages like Turkish would also pose a problem
  +
  +
Even so, I think such a projecto would be quite useful.

Latest revision as of 13:57, 3 April 2009

(Thanks for spectie and jimregan for the input)

Some ideas for (semi-)automatically extracting lexical resources from corpora.

Things we want to extract:

  1. Morphological analysers
  2. Constraint rules (sensible ones)
  3. Bilingual dictionaries
  4. Transfer rules


Morphological resource extraction[edit]

Closed categories such as pronouns, prepositions, or even very irregular verbs are extremely important, because they are so frequent, but they are not that numerous. While it requires considerable effort and good knowledge of the language in question to generate dictionaries of these words, it should be doable. The real bulk of the work sometimes is just kind of mechanically extending this dictionary with open class words, of which there are a great number. 

Because the morphology of these open class words is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way. For example, upon finding a new word in the corpus one has to first discover if it's a new verb, etc, then we have to discover its inflections and attributes like gender and animateness. With this information in hand, we could add an entry to the monolingual dictionary for the word, under the correct paradigm.

Getting concrete, let's say I want to learn the plural forms of some words in Portuguese:

class1: carro -> carros
class2: hospital -> hospitais

If we have the classes already defined, when we process a corpus and find a new noun, we may generate it's plural, and try to see which of the forms is attested in the corpus. This could also work well for verb conjugations, declensions, etc. More generically, upon finding a new unknown word, we can productively generate all its inflections according to every paradigm available, and see which of them "fits" better.

Such a technique has been successfully used, for example, in the Wortschatz (http://www.wortschatz.uni-leipzig.de]] project, in order to detect inflection classes of German words (http://wortschatz.uni-leipzig.de/Papers/ToolForLex.pdf).

 Things get slightly more complicated if the surface-form words encountered give rise to ambiguous interpretations, like in English, where verbs are often similar to nouns, like in car - cars (noun) or give - gives (verb). This could be tackled with constraints.  

Constraints[edit]

Language is always structured in some ways, which restricts the word order of some classes. Of course, some languages have more restricted word order (English) than others (Russian), but in all of them there are some constraints, which we can use in order to gather information about an unknown word. More generically, we can gather information about a word knowing which words (or classes) more often appear immediately before or after it.

Word order constraints allow us to sometimes disambiguate between classes, a noun can never come after a personal pronoun, for example:

*I car something.
*He cars something-else.
I give something.
He sells cars.
A car ...
The car ...
*The give ...
Give it ...
*Car it ...

Agreement rules allow us to discover thing like plural or gender (Portuguese):

   (a|as|uma|duas) noun (feminine)    (o|os|um|dois) noun (masculine)   Preposition governing case:  

na something-a  -> animate noun

And so on. Such techniques are even already used in Apertium, eg, the preposition governing case mentioned above, but in an ad-hoc way. Our goal should then be to streamline the precess, creating an easy-to-use framework for using constraints with corpora in order to obtain information about words of interest; and later to provide statistical evidence of hitherto unnoticed constraints from the example words we already have.


Bilingual dictionaries[edit]

I believe the main part of the work here would be some sensible way to extract GOOD alignments. Natools and GIZA generate translation probabilities, those can be a start, but i believe more has to be done. See Parallel_corpus_pruning for some related ideas.


Transfer rules[edit]

Not much to add here, but good alignments will help here too.

A plan[edit]

Folows a plan to leverage the phenomena describe above to speed up the creation of monolingual and bilingual dictionaries for Apertium:

  • Given a large monolingual corpus, some examples of words added to the correct paradigm, statistically infer constraint rules like mentioned above. Allow the dictionary engineer to change them, and add his own.
  • Given a large monolingual corpus, a list of paradigms and constraints, find words which unambiguously match one paradigm, and add them to the dictionary. This would entail using/hacking the lttollbox code. For words with ambiguity, allow the engineer to manually check them. Alternatively, use Google API to use Google searches as a hack for having a big corpus.
  • Since we now have more examples of words, we might repeat 1-2.
  • Given a bilingual corpus a list of paradigms and constraints, word align the corpus and then tag it with the correct paradigm (or the other way around).
  • Prune bad alignments, using only those with sections with highly probable matches.
  • Add the found lexical correspondences to the bilingual dictionary, and the translation templates, allowing the engineer to edit them.

Throughout, the tools implemented should also have options allowing the engineer to prioritize precision or recall. As a "collateral effect" of the project, we should have an initial pt_BR-en translator. We will have 3 (closely) related tools, one for doing 1, another for 2, and another for 4-6.

Notes[edit]

These techinques have some limitation though:

  • I don't think we can learn derivational morphology this way
  • Non-concatenative morphology is not supported, though that is the case with Apertium as a hole
  • Morphologicaly complex languages like Turkish would also pose a problem

Even so, I think such a projecto would be quite useful.