Difference between revisions of "Toolkit for dictionary development"

From Apertium
Jump to navigation Jump to search
Line 27: Line 27:
 
** Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
 
** Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
 
*** features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary)
 
*** features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary)
  +
* How to find good rules for determining POS ? -- Run the patterns through a corpus and count how frequency the POS you want comes in the slot you think is good.
 
* Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in ''-ió'', ''-joni'', etc.
 
* Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in ''-ió'', ''-joni'', etc.
 
* Blacklist words from the corpus / hitparade, e.g. ''barbarismes''.
 
* Blacklist words from the corpus / hitparade, e.g. ''barbarismes''.
 
* You might want to verify your analyser output on a corpus. Let's say you add a load of nouns and some of them have multiple gender -- it could be that some of them are erroneous, so you might want to run a corpus and pull out the instances of gender mismatch.
 
* You might want to verify your analyser output on a corpus. Let's say you add a load of nouns and some of them have multiple gender -- it could be that some of them are erroneous, so you might want to run a corpus and pull out the instances of gender mismatch.
 
 
Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.
 
Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.
   

Revision as of 15:28, 7 July 2011

We all do similar things when making dictionaries. Make a load of scripts that we hack to do a specific job, then throw them away a the end. It would be nice to have some maintained scripts that could lighten the load.

When we're making a new dictionary, what resources do we have and use ?
  • Descriptive grammar of some variety (best case, reference grammar, worst case collection of 'teach yourself stuff from the web')
  • Glosses (either from the web, or from the grammar)
  • Monolingual corpus
    • Frequency list
  • Partially made analyser
  • Spellchecker (list of validated surface forms)
  • Bilingual wordlists
  • "full form" lists
  • partial-full form lists (e.g. category but not gender)
  • Wikipedia (e.g. to get lists of categorised proper names, and to categorise proper names)
  • Wiktionary
  • Existing MT system(s)
What kind of things might we want to do ?
  • Assign possible categories to surface forms from the corpus
    • e.g. DET *UNK* ADJ → DET N ADJ
    • these classes/categories could be extracted somehow from a well-trained tagger.
  • Assign possible features to surface forms in the corpus
    • e.g. DET *UNK* ADJ.M.SG → DET N.M.SG ADJ.M.SG
    • DET *UNK* ADJ.MF.PL → DET N.GD.PL ADJ.MF.PL
  • Relate forms in the corpus between each other by means of paradigms
    • Sometimes you might get conflicts, in this case the paradigm<->stem combinations could get a score which is related to the predictive power according to the corpus.
      • features that are known to be ambiguous could get lower scores (this can be calculated from the partial dictionary)
  • How to find good rules for determining POS ? -- Run the patterns through a corpus and count how frequency the POS you want comes in the slot you think is good.
  • Sometimes you can just get good candidates from an ending + paradigm, e.g. all words that end in -ió, -joni, etc.
  • Blacklist words from the corpus / hitparade, e.g. barbarismes.
  • You might want to verify your analyser output on a corpus. Let's say you add a load of nouns and some of them have multiple gender -- it could be that some of them are erroneous, so you might want to run a corpus and pull out the instances of gender mismatch.

Making a dictionary can be an iterative process, generate some candidates, add them to the dictionary, run the scripts again because you have more context.

Payoffs
  • You might want to add _all_ the possible analyses, but also be able to produce a "trimmed" dictionary.
    • For example, adding all the adjectives in -oso, even if they are low frequency -- e.g. annoying or hard to find translations for, because they might give you context for adding some higher frequency noun.
  • In the end though, you want to be able to produce a dictionary trimmed to a bilingual dictionary. Also, if it needs to be manually revised, it should be trimmed.