Difference between revisions of "Speeding up monodix creation"
Jump to navigation
Jump to search
Line 16: | Line 16: | ||
<spectie> that you have |
<spectie> that you have |
||
<spectie> align it with the swedish side |
<spectie> align it with the swedish side |
||
<spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side |
<spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side -- note you need to specify the lemma in a config file, e.g. for nouns n.*.sg.nom |
||
another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or |
another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or |
Revision as of 11:25, 24 April 2008
This page outlines some ideas for increasing the speed at which monolingual dictionaries (analysers) can be created.
Extract
Tag transfer
Try this at some point:
<spectie> you have an aligned corpus <spectie> polish--czech, czech--slovak, danish--swedish <spectie> and you have an analyser for polish, czech or danish <spectie> you want to make an analyser for swedish <spectie> you make templates from the paradigms in the danish analyser <spectie> tag the danish of the corpus <spectie> that you have <spectie> align it with the swedish side <spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side -- note you need to specify the lemma in a config file, e.g. for nouns n.*.sg.nom another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or even just wordlist) and comparable corpus to disambiguate the possibilities. -- e.g. you have a surface form in language X which can be either Noun or Verb. You look up the surface form in language X in a dictionary X--Y ( you have an analyser + tagger for Y) You disambiguate the right analysis for X based on the analysis in Y. -- you could extend this to >1 languages, e.g. you want to build a Danish analyser and you have English--Danish,Swedish--Danish wordlists and analysers for Swedish,English. You can check both.