Difference between revisions of "Toolkit for dictionary development"

Revision as of 18:45, 4 July 2011

We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load.

When we're making a new dictionary, what resources do we have and use ?

Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web')
Glosses (either from the web, or from the grammar)
Monolingual corpus
- Frequency list
Partially made analyser
Spellchecker (list of validated surface forms)
Bilingual wordlists
"full form" lists
partial-full form lists (e.g. category but not gender)
Wikipedia (e.g. to get lists of categorised proper names, and to categorise proper names)
Wiktionary

What kind of things might we want to do ?

Assign possible categories to surface forms from the corpus
- e.g. DET *UNK* ADJ → DET N ADJ
- these classes/categories could be extracted somehow from a well-trained tagger.
Assign possible features to surface forms in the corpus
- e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG
- DET *UNK* ADJ.MF.PL → DET N.GD.PL ADJ.MF.PL
Relate forms in the corpus between each other by means of paradigms
- Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
  - features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary)
Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in -ió, -joni, etc.
Blacklist words from the corpus / hitparade, e.g. barbarismes.

Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.

Payoffs

You might want to add _all_ the possible analyses, but also be able to produce a "trimmed" dictionary.
- For example, adding all the adjectives in -oso, even if they are low frequency, because they might give you context for adding some higher frequency noun.
In the end though, you want to be able to produce a dictionary trimmed to a bilingual dictionary. Also, if it needs to be manually evaluated, it should be trimmed.

@@ Line 13: / Line 13: @@
 * partial-full form lists (e.g. category but not gender)
 * Wikipedia (e.g. to get lists of categorised proper names, and to categorise proper names)
+* Wiktionary
 ;What kind of things might we want to do ?

Difference between revisions of "Toolkit for dictionary development"

Revision as of 18:45, 4 July 2011

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools