Toolkit for dictionary development

From Apertium
Revision as of 12:55, 4 July 2011 by Francis Tyers (talk | contribs) (Created page with 'We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintai…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load.

When we're making a new dictionary, what resources do we have and use ?

  • Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web')
  • Glosses (either from the web, or from the grammar)
  • Monolingual corpus
    • Frequency list
  • Partially made analyser
  • Spellchecker (list of validated surface forms)

What kind of things might we want to do ?

  • Assign possible categories to surface forms from the corpus
    • e.g. DET *UNK* ADJ → DET N ADJ
  • Assign possible features to surface forms in the corpus
    • e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG
    • DET *UNK* ADJ.MF.PL → DET N.GD.PL ADJ.MF.PL
  • Relate forms in the corpus between each other by means of paradigms
    • Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.