Difference between revisions of "Toolkit for dictionary development"

Revision as of 06:51, 7 July 2011

We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load.

When we're making a new dictionary, what resources do we have and use ?

Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web')
Glosses (either from the web, or from the grammar)
Monolingual corpus
- Frequency list
Partially made analyser
Spellchecker (list of validated surface forms)
Bilingual wordlists
"full form" lists
partial-full form lists (e.g. category but not gender)
Wikipedia (e.g. to get lists of categorised proper names, and to categorise proper names)
Wiktionary
Existing MT system(s)

What kind of things might we want to do ?

Assign possible categories to surface forms from the corpus
- e.g. DET *UNK* ADJ → DET N ADJ
- these classes/categories could be extracted somehow from a well-trained tagger.
Assign possible features to surface forms in the corpus
- e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG
- DET *UNK* ADJ.MF.PL → DET N.GD.PL ADJ.MF.PL
Relate forms in the corpus between each other by means of paradigms
- Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
  - features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary)
Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in -ió, -joni, etc.
Blacklist words from the corpus / hitparade, e.g. barbarismes.
You might want to verify your analyser output on a corpus. Let's say you add a load of nouns and some of them have multiple gender -- it could be that some of them are erroneous, so you might want to run a corpus and pull out the instances of gender mismatch.

Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.

Payoffs

You might want to add _all_ the possible analyses, but also be able to produce a "trimmed" dictionary.
- For example, adding all the adjectives in -oso, even if they are low frequency -- e.g. annoying or hard to find translations for, because they might give you context for adding some higher frequency noun.
In the end though, you want to be able to produce a dictionary trimmed to a bilingual dictionary. Also, if it needs to be manually revised, it should be trimmed.

@@ Line 29: / Line 29: @@
 * Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in ''-ió'', ''-joni'', etc.
 * Blacklist words from the corpus / hitparade, e.g. ''barbarismes''.
+* You might want to verify your analyser output on a corpus. Let's say you add a load of nouns and some of them have multiple gender -- it could be that some of them are erroneous, so you might want to run a corpus and pull out the instances of gender mismatch.
 Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.

Difference between revisions of "Toolkit for dictionary development"

Revision as of 06:51, 7 July 2011

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools