Difference between revisions of "Speeding up monodix creation"

From Apertium
Jump to navigation Jump to search
(New page: This page outlines some ideas for increasing the speed at which monolingual dictionaries (analysers) can be created. ==Extract== ==Tag transfer== Category:Documentation)
 
(Oops)
 
(15 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
{{otherlang|Accélérer la création de fichiers unilingues|{{French}}}}
  +
 
This page outlines some ideas for increasing the speed at which [[monodix|monolingual dictionaries]] (analysers) can be created.
 
This page outlines some ideas for increasing the speed at which [[monodix|monolingual dictionaries]] (analysers) can be created.
   
 
==Extract==
 
==Extract==
   
  +
;Extract + constraints
  +
  +
;Extract + constraints + corpus
   
 
==Tag transfer==
 
==Tag transfer==
  +
  +
Try this at some point:
  +
<pre>
  +
**Issues** -- the corpora may not be well aligned (e.g. JRC-Acquis Czech--Slovak) -- try and discard shoddy alignments.
  +
  +
<spectie> you have an aligned corpus
  +
<spectie> polish--czech, czech--slovak, danish--swedish
  +
<spectie> and you have an analyser for polish, czech or danish
  +
<spectie> you want to make an analyser for swedish
  +
<spectie> you make templates from the paradigms in the danish analyser
  +
<spectie> tag the danish of the corpus
  +
<spectie> that you have
  +
<spectie> align it with the swedish side
  +
<spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side -- note you need to specify the
  +
lemma in a config file, e.g. for nouns n.*.sg.nom
  +
  +
another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or
  +
even just wordlist) and comparable corpus to disambiguate the possibilities.
  +
  +
-- e.g. you have a surface form in language X which can be either Noun or Verb.
  +
You look up the surface form in language X in a dictionary X--Y ( you have an analyser + tagger for Y)
  +
You disambiguate the right analysis for X based on the analysis in Y.
  +
  +
-- you could extend this to >1 languages, e.g. you want to build a Danish analyser and you have English--Danish,Swedish--Danish
  +
wordlists and analysers for Swedish,English. You can check both.
  +
  +
</pre>
  +
  +
:I'd volunteer Upper and Lower Sorbian as ideal candidates for a test run of this: the grammar is almost 100% the same, I have an almost complete paradigm list (minus verbal participles) for Lower Sorbian, and a complete forms list for Upper Sorbian - but very little information mapping endings to cases. What little I have of Kashubian adjectives comes from a manual version of the above process, BTW. -- [[User:Jimregan|Jimregan]] 01:29, 5 June 2008 (BST)
  +
  +
::Problem is that there is little aligned text for Upper/Lower Sorbian no? - [[User:Francis Tyers|Francis Tyers]] 10:21, 5 June 2008 (BST)
  +
  +
::: True, but I did say for a test run :) [[User:Jimregan|Jimregan]] 13:00, 12 June 2008 (BST)
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]

Latest revision as of 15:36, 26 January 2020

En français

This page outlines some ideas for increasing the speed at which monolingual dictionaries (analysers) can be created.

Extract[edit]

Extract + constraints
Extract + constraints + corpus

Tag transfer[edit]

Try this at some point:

**Issues** -- the corpora may not be well aligned (e.g. JRC-Acquis Czech--Slovak) -- try and discard shoddy alignments.

<spectie> you have an aligned corpus
<spectie> polish--czech, czech--slovak, danish--swedish
<spectie> and you have an analyser for polish, czech or danish
<spectie> you want to make an analyser for swedish
<spectie> you make templates from the paradigms in the danish analyser
<spectie> tag the danish of the corpus
<spectie> that you have
<spectie> align it with the swedish side
<spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side -- note you need to specify the 
          lemma in a config file, e.g. for nouns n.*.sg.nom

another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or 
even just wordlist) and comparable corpus to disambiguate the possibilities.

-- e.g. you have a surface form in language X which can be either Noun or Verb. 
        You look up the surface form in language X in a dictionary X--Y ( you have an analyser + tagger for Y)
        You disambiguate the right analysis for X based on the analysis in Y.

        -- you could extend this to >1 languages, e.g. you want to build a Danish analyser and you have English--Danish,Swedish--Danish
           wordlists and analysers for Swedish,English. You can check both.
 
I'd volunteer Upper and Lower Sorbian as ideal candidates for a test run of this: the grammar is almost 100% the same, I have an almost complete paradigm list (minus verbal participles) for Lower Sorbian, and a complete forms list for Upper Sorbian - but very little information mapping endings to cases. What little I have of Kashubian adjectives comes from a manual version of the above process, BTW. -- Jimregan 01:29, 5 June 2008 (BST)
Problem is that there is little aligned text for Upper/Lower Sorbian no? - Francis Tyers 10:21, 5 June 2008 (BST)
True, but I did say for a test run :) Jimregan 13:00, 12 June 2008 (BST)