Difference between revisions of "Toolkit for dictionary development"
Jump to navigation
Jump to search
(Category:Documentation in English) |
|||
(11 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load. |
We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load. |
||
When we're making a new dictionary, what resources do we have and use ? |
;When we're making a new dictionary, what resources do we have and use ? |
||
* Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web') |
* Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web') |
||
Line 9: | Line 9: | ||
* Partially made analyser |
* Partially made analyser |
||
* Spellchecker (list of validated surface forms) |
* Spellchecker (list of validated surface forms) |
||
* Bilingual wordlists |
|||
* "full form" lists |
|||
* partial-full form lists (e.g. category but not gender) |
|||
* Wikipedia (e.g. to get lists of categorised proper names, and to categorise proper names) |
|||
* Wiktionary |
|||
* Existing MT system(s) |
|||
===Monolingual dictionaries=== |
|||
⚫ | |||
⚫ | |||
* Assign possible categories to surface forms from the corpus |
* Assign possible categories to surface forms from the corpus |
||
Line 20: | Line 28: | ||
* Relate forms in the corpus between each other by means of paradigms |
* Relate forms in the corpus between each other by means of paradigms |
||
** Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus. |
** Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus. |
||
*** features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary) |
|||
* How to find good rules for determining POS ? -- Run the patterns through a corpus and count how frequency the POS you want comes in the slot you think is good. |
|||
* Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in ''-ió'', ''-joni'', etc. |
* Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in ''-ió'', ''-joni'', etc. |
||
* Blacklist words from the corpus / hitparade, e.g. ''barbarismes''. |
|||
* You might want to verify your analyser output on a corpus. Let's say you add a load of nouns and some of them have multiple gender -- it could be that some of them are erroneous, so you might want to run a corpus and pull out the instances of gender mismatch. |
|||
⚫ | |||
;Payoffs: |
|||
* You might want to add _all_ the possible analyses, but also be able to produce a "trimmed" dictionary. |
|||
⚫ | |||
** For example, adding all the adjectives in ''-oso'', even if they are low frequency -- e.g. annoying or hard to find translations for, because they might give you context for adding some higher frequency noun. |
|||
* In the end though, you want to be able to produce a dictionary trimmed to a bilingual dictionary. Also, if it needs to be manually revised, it should be trimmed. |
|||
===Bilingual dictionaries=== |
|||
* Some kind of web/command line interface which gives you one word at a time from a missing-bidix frequency list, along with: |
|||
** a source sentence with that word, -- it should be possible to toggle this with a keypress (e.g. cycle through the available sentences with the word.) |
|||
** the output of the existing MT system, |
|||
** possibilities from a probabilistic dictionary, |
|||
** translation(s) of the sentence with other MT systems. |
|||
[[Category:Development]] |
[[Category:Development]] |
||
[[Category:Documentation in English]] |
Latest revision as of 11:46, 24 March 2012
We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load.
- When we're making a new dictionary, what resources do we have and use ?
- Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web')
- Glosses (either from the web, or from the grammar)
- Monolingual corpus
- Frequency list
- Partially made analyser
- Spellchecker (list of validated surface forms)
- Bilingual wordlists
- "full form" lists
- partial-full form lists (e.g. category but not gender)
- Wikipedia (e.g. to get lists of categorised proper names, and to categorise proper names)
- Wiktionary
- Existing MT system(s)
Monolingual dictionaries[edit]
- What kind of things might we want to do ?
- Assign possible categories to surface forms from the corpus
- e.g. DET *UNK* ADJ → DET N ADJ
- these classes/categories could be extracted somehow from a well-trained tagger.
- Assign possible features to surface forms in the corpus
- e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG
- DET *UNK* ADJ.MF.PL → DET N.GD.PL ADJ.MF.PL
- Relate forms in the corpus between each other by means of paradigms
- Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
- features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary)
- Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
- How to find good rules for determining POS ? -- Run the patterns through a corpus and count how frequency the POS you want comes in the slot you think is good.
- Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in -ió, -joni, etc.
- Blacklist words from the corpus / hitparade, e.g. barbarismes.
- You might want to verify your analyser output on a corpus. Let's say you add a load of nouns and some of them have multiple gender -- it could be that some of them are erroneous, so you might want to run a corpus and pull out the instances of gender mismatch.
Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.
- Payoffs
- You might want to add _all_ the possible analyses, but also be able to produce a "trimmed" dictionary.
- For example, adding all the adjectives in -oso, even if they are low frequency -- e.g. annoying or hard to find translations for, because they might give you context for adding some higher frequency noun.
- In the end though, you want to be able to produce a dictionary trimmed to a bilingual dictionary. Also, if it needs to be manually revised, it should be trimmed.
Bilingual dictionaries[edit]
- Some kind of web/command line interface which gives you one word at a time from a missing-bidix frequency list, along with:
- a source sentence with that word, -- it should be possible to toggle this with a keypress (e.g. cycle through the available sentences with the word.)
- the output of the existing MT system,
- possibilities from a probabilistic dictionary,
- translation(s) of the sentence with other MT systems.