User:Ergaurav3/Discussion Gsoc Unify Metadix

From Apertium
Jump to navigation Jump to search

Basic understanding of the Metadix and other formats[edit]

I have been looking for the following language pair:

en-ca, en-es, fr-es, oc-ca, oc-es, es-pt, hbs-slv, nn-nb

The basic information that I found:

We need to apply the pre-processing by the xslt files to convert the metadix

The formats en-ca, en-es, fr-es, oc-ca, oc-es contains the metadix format files.

The metadix are converted into the dictionary format with the use of basically two xslts file buscPar.xsl and the principal.xsl.

buscPar.xsl : It is basically used to do the sorting and the creation of intermediate xsl file with list of verbs that use metaparadigms and also we perform the uniq to delete the duplicates.

principal.xsl : This stylsheet actually contains the rule for the conversion of the <par> and the <sa> tags to form the dictionary format (.dix).

There are two types of the meta dictionary files (metadix) files :

One contains the <par> tags and the other contains the <sa> tags

  • <sa/> tags: en-ca, en-es (en.metadix)
  • <prm/> tags: br-fr, fr-es, oc-ca, oc-es (fr.metadix, oc.metadix)

The tag <prm/> is a marker that is used to place variable text part in the paradigm definition. The tag <sa> is placed where an optional grammatical symbol (tag) should appear.

For the case of <sa/>, we have: <e lm="time"> time <par n="house__n" sa="unc"/> </e>

In the pardef, <sa/> will be replaced by the <s n="unc">

For the case <prm/>, if we have something like:

<e lm="acu ́lher"> e acu <par n="m/ ́[T]er__vblex" prm="lh"/> e </e>

In the pardef, <prm/> will be replaced by the string "lh".


This language pair contains of a special type of compound tags. rem-compounds.xsl : We use this xslt file to remove the compounds from the dictionary files.

Basic Understanding of the project:[edit]

Problem with Current Scenario:

1. The xslt and other pre-processing needs to applied manually on the meta dictionary files before we actually compile the dictionary files.

Not manually! --Mlforcada (talk) 12:20, 20 March 2014 (UTC)

2. The xslt files varies from language pair to pair.


1. There should be common xslts and pre processing for all the types of the meta dictionary files (we currently have two types.). Target: The buscPar.xsl and the principal.xsl should be common.

2. The lttoolbox should be updated so we don’t need to do the pre-processing when we are treating the pair that contains the meta dictionary files.

Approach 1

Modifying the xslt's file to make them common.

The makefile should be updated according so the dictionary file is generated with a similar name it’s for the other pairs (non-meta dictionary). So, we have everything similar after the installation is done.

Approach 2

Modifying the xslt's file to make them common.

In, lttoolbox we should update the lt-comp and allows it to compile the metadictionary files along with the dictionary files and when we will have the case of metadictionary files, we will perform the pre-processing and then the basic compilation of the dictionary file. But doesn’t mean the pre-processing always at the time of compliation ? We may compile only when there is need to compile (the pre-processing is not already done.)


Please share your comments on my understanding about the metadix formats and the project.

[Comment Shared by Francis] Information about portuguese v="" and occitan alt="" is missing.

Is it possible / necessary to make metadix more general? Currently, you can insert _one_ string or tag per pardef call. What if we discover a language where it would really help to insert two strings? Something like <code><e><i>foo</i><par n="foo/a[X]b[Y]c" prm="a" prm="b"/></code> where the first attribute replaces the first <prm/> and the second replaces the second. Or maybe they should be named … If we're reworking metadix anyway, we shouldn't miss opportunities to improve the functionality, if it can be usefully improved. --unhammer (talk) 07:38, 17 March 2014 (UTC)