Difference between revisions of "User:Ergaurav3/Discussion Gsoc Unify Metadix"

From Apertium
Jump to navigation Jump to search
Line 96: Line 96:


'''Please share your comments on my understanding about the metadix formats and the project.'''
'''Please share your comments on my understanding about the metadix formats and the project.'''

[Comment Shared by Francis] Information about portuguese v="" and occitan alt="" is missing.

Revision as of 22:27, 16 March 2014

Basic understanding of the Metadix and other formats

I have been looking for the following language pair:

en-ca, en-es, fr-es, oc-ca, oc-es, es-pt, hbs-slv, nn-nb

The basic information that I found:

We need to apply the pre-processing by the xslt files to convert the metadix

The formats en-ca, en-es, fr-es, oc-ca, oc-es contains the metadix format files.

The metadix are converted into the dictionary format with the use of basically two xslts file buscPar.xsl and the principal.xsl.

buscPar.xsl : It is basically used to do the sorting and the creation of intermediate xsl file with list of verbs that use metaparadigms and also we perform the uniq to delete the duplicates.

principal.xsl : This stylsheet actually contains the rule for the conversion of the <par> and the <sa> tags to form the dictionary format (.dix).

There are two types of the meta dictionary files (metadix) files :

One contains the <par> tags and the other contains the <sa> tags

<prm/> tags: en-ca, en-es <sa/> tags: fr-es, oc-ca, oc-es

The tag <prm/> is the marker that is used to place the variable text part in the paradigm definition. The tag <sa> has to be placed where the optional grammatical symbol should appear.

For the case of <sa/>, we have: <e lm="time"> time <par n="house__n" sa="unc"/> </e>

<sa/> will be replaced by the <s n="unc">

For the case <prm/>, if we have something like:

<e lm="acu ́lher"> e acu <par n="m/ ́[T]er__vblex" prm="lh"/> e </e>

<prm/> will be replaced by the </lh>.

nn-nb:

This language pair contains of a special type of compound tags. rem-compounds.xsl : We use this xslt file to remove the compounds from the dictionary files.



Basic Understanding of the project:

Problem with Current Scenario:


1. The xslt and other pre-processing needs to applied manually on the meta dictionary files before we actually compile the dictionary files.

2. The xslt files varies from language pair to pair.

Expected:


1. There should be common xslts and pre processing for all the types of the meta dictionary files (we currently have two types.). Target: The buscPar.xsl and the principal.xsl should be common.

2. The lttoolbox should be updated so we don’t need to do the pre-processing when we are treating the pair that contains the meta dictionary files.

Approach 1

Modifying the xslt's file to make them common.

The makefile should be updated according so the dictionary file is generated with a similar name it’s for the other pairs (non-meta dictionary). So, we have everything similar after the installation is done.

Approach 2

Modifying the xslt's file to make them common.

In, lttoolbox we should update the lt-comp and allows it to compile the metadictionary files along with the dictionary files and when we will have the case of metadictionary files, we will perform the pre-processing and then the basic compilation of the dictionary file. But doesn’t mean the pre-processing always at the time of compliation ? We may compile only when there is need to compile (the pre-processing is not already done.)




Comments/Suggestion

Please share your comments on my understanding about the metadix formats and the project.

[Comment Shared by Francis] Information about portuguese v="" and occitan alt="" is missing.