Difference between revisions of "Speeding up monodix creation"

From Apertium
Jump to navigation Jump to search
(Link to French page)
(Oops)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
[[Accélérer la création de fichiers unilingues|En français]]
{{otherlang|Accélérer la création de fichiers unilingues|{{French}}}}


This page outlines some ideas for increasing the speed at which [[monodix|monolingual dictionaries]] (analysers) can be created.
This page outlines some ideas for increasing the speed at which [[monodix|monolingual dictionaries]] (analysers) can be created.

Latest revision as of 15:36, 26 January 2020

En français

This page outlines some ideas for increasing the speed at which monolingual dictionaries (analysers) can be created.

Extract[edit]

Extract + constraints
Extract + constraints + corpus

Tag transfer[edit]

Try this at some point:

**Issues** -- the corpora may not be well aligned (e.g. JRC-Acquis Czech--Slovak) -- try and discard shoddy alignments.

<spectie> you have an aligned corpus
<spectie> polish--czech, czech--slovak, danish--swedish
<spectie> and you have an analyser for polish, czech or danish
<spectie> you want to make an analyser for swedish
<spectie> you make templates from the paradigms in the danish analyser
<spectie> tag the danish of the corpus
<spectie> that you have
<spectie> align it with the swedish side
<spectie> then read off the alignments, taking the surface forms from the right side and the tags from the left side -- note you need to specify the 
          lemma in a config file, e.g. for nouns n.*.sg.nom

another variation of this without parallel corpora might be to use extract and then use a bilingual dictionary (or 
even just wordlist) and comparable corpus to disambiguate the possibilities.

-- e.g. you have a surface form in language X which can be either Noun or Verb. 
        You look up the surface form in language X in a dictionary X--Y ( you have an analyser + tagger for Y)
        You disambiguate the right analysis for X based on the analysis in Y.

        -- you could extend this to >1 languages, e.g. you want to build a Danish analyser and you have English--Danish,Swedish--Danish
           wordlists and analysers for Swedish,English. You can check both.
 
I'd volunteer Upper and Lower Sorbian as ideal candidates for a test run of this: the grammar is almost 100% the same, I have an almost complete paradigm list (minus verbal participles) for Lower Sorbian, and a complete forms list for Upper Sorbian - but very little information mapping endings to cases. What little I have of Kashubian adjectives comes from a manual version of the above process, BTW. -- Jimregan 01:29, 5 June 2008 (BST)
Problem is that there is little aligned text for Upper/Lower Sorbian no? - Francis Tyers 10:21, 5 June 2008 (BST)
True, but I did say for a test run :) Jimregan 13:00, 12 June 2008 (BST)