Difference between revisions of "Speeding up monodix creation"
Jump to navigation
Jump to search
ScoopGracie (talk | contribs) (Oops) |
|||
(12 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
{{otherlang|Accélérer la création de fichiers unilingues|{{French}}}} |
|||
This page outlines some ideas for increasing the speed at which [[monodix|monolingual dictionaries]] (analysers) can be created. |
This page outlines some ideas for increasing the speed at which [[monodix|monolingual dictionaries]] (analysers) can be created. |
||
==Extract== |
==Extract== |
||
;Extract + constraints |
|||
;Extract + constraints + corpus |
|||
==Tag transfer== |
==Tag transfer== |
||
Line 8: | Line 13: | ||
Try this at some point: |
Try this at some point: |
||
<pre> |
<pre> |
||
**Issues** -- the corpora may not be well aligned (e.g. JRC-Acquis Czech--Slovak) -- try and discard shoddy alignments. |
|||
<spectie> you have an aligned corpus |
<spectie> you have an aligned corpus |
||
<spectie> polish--czech, czech--slovak, danish--swedish |
<spectie> polish--czech, czech--slovak, danish--swedish |
||
Line 16: | Line 23: | ||
<spectie> that you have |
<spectie> that you have |
||
<spectie> align it with the swedish side |
<spectie> align it with the swedish side |
||
<spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side |
<spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side -- note you need to specify the |
||
lemma in a config file, e.g. for nouns n.*.sg.nom |
|||
another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or |
another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or |
||
even just wordlist) and comparable corpus to disambiguate the possibilities. |
even just wordlist) and comparable corpus to disambiguate the possibilities. |
||
-- e.g. you have a surface form in language X which can be either Noun or Verb. |
|||
You look up the surface form in language X in a dictionary X--Y ( you have an analyser + tagger for Y) |
|||
You disambiguate the right analysis for X based on the analysis in Y. |
|||
-- you could extend this to >1 languages, e.g. you want to build a Danish analyser and you have English--Danish,Swedish--Danish |
|||
wordlists and analysers for Swedish,English. You can check both. |
|||
</pre> |
</pre> |
||
:I'd volunteer Upper and Lower Sorbian as ideal candidates for a test run of this: the grammar is almost 100% the same, I have an almost complete paradigm list (minus verbal participles) for Lower Sorbian, and a complete forms list for Upper Sorbian - but very little information mapping endings to cases. What little I have of Kashubian adjectives comes from a manual version of the above process, BTW. -- [[User:Jimregan|Jimregan]] 01:29, 5 June 2008 (BST) |
|||
::Problem is that there is little aligned text for Upper/Lower Sorbian no? - [[User:Francis Tyers|Francis Tyers]] 10:21, 5 June 2008 (BST) |
|||
::: True, but I did say for a test run :) [[User:Jimregan|Jimregan]] 13:00, 12 June 2008 (BST) |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
[[Category:Documentation in English]] |
Latest revision as of 15:36, 26 January 2020
This page outlines some ideas for increasing the speed at which monolingual dictionaries (analysers) can be created.
Extract[edit]
- Extract + constraints
- Extract + constraints + corpus
Tag transfer[edit]
Try this at some point:
**Issues** -- the corpora may not be well aligned (e.g. JRC-Acquis Czech--Slovak) -- try and discard shoddy alignments. <spectie> you have an aligned corpus <spectie> polish--czech, czech--slovak, danish--swedish <spectie> and you have an analyser for polish, czech or danish <spectie> you want to make an analyser for swedish <spectie> you make templates from the paradigms in the danish analyser <spectie> tag the danish of the corpus <spectie> that you have <spectie> align it with the swedish side <spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side -- note you need to specify the lemma in a config file, e.g. for nouns n.*.sg.nom another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or even just wordlist) and comparable corpus to disambiguate the possibilities. -- e.g. you have a surface form in language X which can be either Noun or Verb. You look up the surface form in language X in a dictionary X--Y ( you have an analyser + tagger for Y) You disambiguate the right analysis for X based on the analysis in Y. -- you could extend this to >1 languages, e.g. you want to build a Danish analyser and you have English--Danish,Swedish--Danish wordlists and analysers for Swedish,English. You can check both.
- I'd volunteer Upper and Lower Sorbian as ideal candidates for a test run of this: the grammar is almost 100% the same, I have an almost complete paradigm list (minus verbal participles) for Lower Sorbian, and a complete forms list for Upper Sorbian - but very little information mapping endings to cases. What little I have of Kashubian adjectives comes from a manual version of the above process, BTW. -- Jimregan 01:29, 5 June 2008 (BST)
- Problem is that there is little aligned text for Upper/Lower Sorbian no? - Francis Tyers 10:21, 5 June 2008 (BST)
- True, but I did say for a test run :) Jimregan 13:00, 12 June 2008 (BST)