Difference between revisions of "Automated extraction of lexical resources"

Latest revision as of 13:57, 3 April 2009

Closed categories such as pronouns, prepositions, or even very irregular verbs are extremely important, because they are so frequent, but they are not that numerous. While it requires considerable effort and good knowledge of the language in question to generate dictionaries of these words, it should be doable. The real bulk of the work sometimes is just kind of mechanically extending this dictionary with open class words, of which there are a great number.

Because the morphology of these open class words is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way. For example, upon finding a new word in the corpus one has to first discover if it's a new verb, etc, then we have to discover its inflections and attributes like gender and animateness. With this information in hand, we could add an entry to the monolingual dictionary for the word, under the correct paradigm.

Getting concrete, let's say I want to learn the plural forms of some words in Portuguese:

class1: carro -> carros
class2: hospital -> hospitais

If we have the classes already defined, when we process a corpus and find a new noun, we may generate it's plural, and try to see which of the forms is attested in the corpus. This could also work well for verb conjugations, declensions, etc. More generically, upon finding a new unknown word, we can productively generate all its inflections according to every paradigm available, and see which of them "fits" better.

Such a technique has been successfully used, for example, in the Wortschatz (http://www.wortschatz.uni-leipzig.de]] project, in order to detect inflection classes of German words (http://wortschatz.uni-leipzig.de/Papers/ToolForLex.pdf).

Things get slightly more complicated if the surface-form words encountered give rise to ambiguous interpretations, like in English, where verbs are often similar to nouns, like in car - cars (noun) or give - gives (verb). This could be tackled with constraints.

Constraints[edit]

Language is always structured in some ways, which restricts the word order of some classes. Of course, some languages have more restricted word order (English) than others (Russian), but in all of them there are some constraints, which we can use in order to gather information about an unknown word. More generically, we can gather information about a word knowing which words (or classes) more often appear immediately before or after it.

Word order constraints allow us to sometimes disambiguate between classes, a noun can never come after a personal pronoun, for example:

*I car something.
*He cars something-else.
I give something.
He sells cars.
A car ...
The car ...
*The give ...
Give it ...
*Car it ...

Agreement rules allow us to discover thing like plural or gender (Portuguese):

(a|as|uma|duas) noun (feminine) (o|os|um|dois) noun (masculine) Preposition governing case:

na something-a  -> animate noun

And so on. Such techniques are even already used in Apertium, eg, the preposition governing case mentioned above, but in an ad-hoc way. Our goal should then be to streamline the precess, creating an easy-to-use framework for using constraints with corpora in order to obtain information about words of interest; and later to provide statistical evidence of hitherto unnoticed constraints from the example words we already have.

Bilingual dictionaries[edit]

I believe the main part of the work here would be some sensible way to extract GOOD alignments. Natools and GIZA generate translation probabilities, those can be a start, but i believe more has to be done. See Parallel_corpus_pruning for some related ideas.

Transfer rules[edit]

Not much to add here, but good alignments will help here too.

A plan[edit]

Folows a plan to leverage the phenomena describe above to speed up the creation of monolingual and bilingual dictionaries for Apertium:

Given a large monolingual corpus, some examples of words added to the correct paradigm, statistically infer constraint rules like mentioned above. Allow the dictionary engineer to change them, and add his own.
Given a large monolingual corpus, a list of paradigms and constraints, find words which unambiguously match one paradigm, and add them to the dictionary. This would entail using/hacking the lttollbox code. For words with ambiguity, allow the engineer to manually check them. Alternatively, use Google API to use Google searches as a hack for having a big corpus.
Since we now have more examples of words, we might repeat 1-2.
Given a bilingual corpus a list of paradigms and constraints, word align the corpus and then tag it with the correct paradigm (or the other way around).
Prune bad alignments, using only those with sections with highly probable matches.
Add the found lexical correspondences to the bilingual dictionary, and the translation templates, allowing the engineer to edit them.

Throughout, the tools implemented should also have options allowing the engineer to prioritize precision or recall. As a "collateral effect" of the project, we should have an initial pt_BR-en translator. We will have 3 (closely) related tools, one for doing 1, another for 2, and another for 4-6.

Notes[edit]

These techinques have some limitation though:

I don't think we can learn derivational morphology this way
Non-concatenative morphology is not supported, though that is the case with Apertium as a hole
Morphologicaly complex languages like Turkish would also pose a problem

Even so, I think such a projecto would be quite useful.

@@ Line 13: / Line 13: @@
-== Morphological resource extraction ==
+=== Morphological resource extraction ===
+Closed categories such as pronouns, prepositions, or even very irregular verbs are extremely important, because they are so frequent, but they are not that numerous. While it requires considerable effort and good knowledge of the language in question to generate dictionaries of these words, it should be doable. The real bulk of the work sometimes is just kind of mechanically extending this dictionary with open class words, of which there are a great number.
-First, i should state that our main aim will be to extract information
-about the ''open'' categories, and not the closed. While it would be
-interesting to try and learn '''everything''' from scratch, it would
-probably be counter-productive, if at all possible.
+Because the morphology of these open class words is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way. For example, upon finding a new word in the corpus one has to first discover if it's a new verb, etc, then we have to discover its inflections and attributes like gender and animateness. With this information in hand, we could add an entry to the monolingual dictionary for the word, under the correct paradigm.
-So, we leave stuff like prepositions, pronouns, irregular (very frequent) verbs like ''to be'' to be manually constructed, which should be doable. Our focus shall instead
-be on less frequent, but regular and '''much more numerous''' verbs, nouns, adjectives, etc. Because their morphology is more regular, usually one has a reasonably small number of classes (we would define a paradigm) that behave in the same way.
+Getting concrete, let's say I want to learn the plural forms of some words in Portuguese:
-Some examples should make it clearer:
+ class1: carro -> carros
-Let's I want to learn the plural forms of some words in portuguese:
+ class2: hospital -> hospitais
+If we have the classes already defined, when we process a corpus and find a new noun, we may generate it's plural, and try to see which of the forms is attested in the corpus. This could also work well for verb conjugations, declensions, etc. More generically, upon finding a new unknown word, we can productively generate all its inflections according to every paradigm available, and see which of them "fits" better.
- class1: carro -> carro'''s'''
- class2: hospital -> hospita'''is'''
+Such a technique has been successfully used, for example, in the Wortschatz (http://www.wortschatz.uni-leipzig.de]] project, in order to detect inflection classes of German words (http://wortschatz.uni-leipzig.de/Papers/ToolForLex.pdf).
-If we have the classes already defined, when we process a corpus and find a new noun,
-we may generate it's plural, and try to see wich of the forms is attested in a corpus.
-This could also work well for verb conjungations, declensions. Things get slightly more complicated if the surface-form word encountered is ambiguous, say like in english, where verbs are often similar to nouns, like in ''car - cars'' (noun) or
-''give - gives'' (verb). This could be tackled with constraints...
+ Things get slightly more complicated if the surface-form words encountered give rise to ambiguous interpretations, like in English, where verbs are often similar to nouns, like in ''car - cars'' (noun) or ''give - gives'' (verb). This could be tackled with constraints.
 === Constraints ===
+Language is always structured in some ways, which restricts the word order of some classes. Of course, some languages have more restricted word order (English) than others (Russian), but in all of them there are some constraints, which we can use in order to gather  information about an unknown word. More generically, we can gather information about a word knowing which words (or classes) more often appear immediately before or after it.
+Word order constraints allow us to sometimes disambiguate between classes, a noun can never come after a personal pronoun, for example:
-Some stuff can, sometimes, only appear after some other stuff ;-) Really, what we are looking for is some small snippets that, because of the language's grammar, specify what our word of interest is, or defines some of its characteristics.
-Some examples:
  *I car something.
@@ Line 52: / Line 45: @@
  *Car it ...
+Agreement rules allow us to discover thing like plural or gender (Portuguese):
-Discover gender (portugues):
- (a|as|uma|duas) noun (feminin)
- (o|os|um|dois) noun (masculin)
+   (a|as|uma|duas) noun (feminine)
+   (o|os|um|dois) noun (masculine)
 Preposition governing case:
  na something-a  -> animate noun
+And so on. Such techniques are even already used in Apertium, eg, the preposition governing case mentioned above, but in an ad-hoc way. Our goal should then be to streamline the precess, creating an easy-to-use framework for using constraints with corpora in order to obtain information about words of interest; and later to provide statistical evidence of hitherto unnoticed constraints from the example words we already have.
-And so on. The good thing is, the more examples of one class (paradigm) we have,
-the better can we learn such constraints automatically from corpora. A tool that would learn some of these constraints automatically would be quite useful, in conjunction with another tool that creates monolingual dictionaries from the generated constraints + human created constraints.
 === Bilingual dictionaries ===
@@ Line 72: / Line 64: @@
 === Transfer rules ===
 Not much to add here, but good alignments will help here too.
+=== A plan ===
+Folows a plan to leverage the phenomena describe above to speed up the creation of monolingual and bilingual dictionaries for Apertium:
+* Given a large monolingual corpus, some examples of words added to the correct paradigm, statistically infer constraint rules like mentioned above. Allow the dictionary engineer to change them, and add his own.
+* Given a large monolingual corpus, a list of paradigms and constraints, find words which unambiguously match one paradigm, and add them to the dictionary. This would entail using/hacking the lttollbox code. For words with ambiguity, allow the engineer to manually check them. Alternatively, use Google API to use Google searches as a hack for having a big corpus.
+* Since we now have more examples of words, we might repeat 1-2.
+* Given a bilingual corpus a list of paradigms and constraints, word align the corpus and then tag it with the correct paradigm (or the other way around).
+* Prune bad alignments, using only those with sections with highly probable matches.
+* Add the found lexical correspondences to the bilingual dictionary, and the translation templates, allowing the engineer to edit them.
+Throughout, the tools implemented should also have options allowing the engineer to prioritize precision or recall.
+As a "collateral effect" of the project, we should have an initial pt_BR-en translator. We will have 3 (closely) related tools, one for doing 1, another for 2, and another for 4-6.
+=== Notes ===
+These techinques have some limitation though:
+* I don't think we can learn derivational morphology this way
+* Non-concatenative morphology is not supported, though that is the case with Apertium as a hole
+* Morphologicaly complex languages like Turkish would also pose a problem
+Even so, I think such a projecto would be quite useful.

Difference between revisions of "Automated extraction of lexical resources"

Latest revision as of 13:57, 3 April 2009

Contents

Morphological resource extraction[edit]

Constraints[edit]

Bilingual dictionaries[edit]

Transfer rules[edit]

A plan[edit]

Notes[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools