Difference between revisions of "Toolkit for dictionary development"
Jump to navigation
Jump to search
(Created page with 'We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintai…') |
|||
Line 14: | Line 14: | ||
* Assign possible categories to surface forms from the corpus |
* Assign possible categories to surface forms from the corpus |
||
** e.g. DET *UNK* ADJ → DET N ADJ |
** e.g. DET *UNK* ADJ → DET N ADJ |
||
** these classes/categories could be extracted somehow from a well-trained tagger. |
|||
* Assign possible features to surface forms in the corpus |
* Assign possible features to surface forms in the corpus |
||
** e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG |
** e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG |
||
Line 19: | Line 20: | ||
* Relate forms in the corpus between each other by means of paradigms |
* Relate forms in the corpus between each other by means of paradigms |
||
** Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus. |
** Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus. |
||
* Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in ''-ió'', ''-joni'', etc. |
|||
Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context. |
|||
[[Category:Development]] |
[[Category:Development]] |
Revision as of 12:59, 4 July 2011
We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load.
When we're making a new dictionary, what resources do we have and use ?
- Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web')
- Glosses (either from the web, or from the grammar)
- Monolingual corpus
- Frequency list
- Partially made analyser
- Spellchecker (list of validated surface forms)
What kind of things might we want to do ?
- Assign possible categories to surface forms from the corpus
- e.g. DET *UNK* ADJ → DET N ADJ
- these classes/categories could be extracted somehow from a well-trained tagger.
- Assign possible features to surface forms in the corpus
- e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG
- DET *UNK* ADJ.MF.PL → DET N.GD.PL ADJ.MF.PL
- Relate forms in the corpus between each other by means of paradigms
- Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
- Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in -ió, -joni, etc.
Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.