Toolkit for dictionary development

We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load.

When we're making a new dictionary, what resources do we have and use ?

Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web')
Glosses (either from the web, or from the grammar)
Monolingual corpus
- Frequency list
Partially made analyser
Spellchecker (list of validated surface forms)
Bilingual wordlists
"full form" lists
partial-full form lists (e.g. category but not gender)

What kind of things might we want to do ?

Assign possible categories to surface forms from the corpus
- e.g. DET *UNK* ADJ → DET N ADJ
- these classes/categories could be extracted somehow from a well-trained tagger.
Assign possible features to surface forms in the corpus
- e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG
- DET *UNK* ADJ.MF.PL → DET N.GD.PL ADJ.MF.PL
Relate forms in the corpus between each other by means of paradigms
- Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
  - features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary)
Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in -ió, -joni, etc.

Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.

Toolkit for dictionary development

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools