Difference between revisions of "Unification of metadix and parametrized dictionaries"

From Apertium
Jump to navigation Jump to search
(New page: Different language-pair packages use different strategies to generate .dix dictionaries (monodix) and (bidix) from XML files using features not supported by the .dix format. The ob...)
 
(I'm still writing up)
Line 1: Line 1:
 
Different language-pair packages use different strategies to generate .dix dictionaries ([[monodix]]) and ([[bidix]]) from XML files using features not supported by the .dix format. The objectives of these new dix-like formats are:
 
Different language-pair packages use different strategies to generate .dix dictionaries ([[monodix]]) and ([[bidix]]) from XML files using features not supported by the .dix format. The objectives of these new dix-like formats are:
   
* being able to use parametrized paradigms (so that a general paradigm may be defined and used with small parametrized variations), as discussed in the [[metadix]] page
+
* being able to use parametrized paradigms (so that a general paradigm may be defined and used with small parametrized variations), as discussed in the [[metadix]] page; paradigms are of two kinds, word form paradigms and symbol paradigms (word form paradigms are observed in the oc-ca pair and grammatical symbol paradigms are observed in the en-ca dictionaries).
  +
 
* being able to generate different versions of a translator (for instance, for two different varieties of a language, such as Brazilian and European Portuguese) whose names could be ideally tied to [[mode]] names
 
* being able to generate different versions of a translator (for instance, for two different varieties of a language, such as Brazilian and European Portuguese) whose names could be ideally tied to [[mode]] names
   
Line 8: Line 9:
 
* having metadata (headers) in dictionaries which defines whether the dictionary is a bilingual or monolingual dictionary and the language pairs and modes it supports (perhaps this could be added to the basic .dix format
 
* having metadata (headers) in dictionaries which defines whether the dictionary is a bilingual or monolingual dictionary and the language pairs and modes it supports (perhaps this could be added to the basic .dix format
   
Here is a proposal (open to discussion) on the first two issues:
+
Here is a proposal (open to discussion) on the first two issues.
  +
  +
==Variants==
  +
 
* endowing the <code>e</code> element with a <code>vnt</code> (variant) attribute, so that the corresponding [[metadix]] entry will go to the generated .dix only if that variant is selected (entries without a <code>vnt</code> will go to the .dix unconditionally); this will replace the use of the <code>v</code> attribute as found in the es-ca dictionaries to treat the Valencian dialect or in the es-pt dictionaries to distinguish European and Brazilian Portuguese. Here is an example:
  +
  +
<pre>
  +
<e vnt="cat">
  +
<p>
  +
<l>haguéssim</l>
  +
<r>haver<s n="vbhaver"/><s n="pis"/><s n="p1"/><s n="pl"/><j/></r>
  +
</p>
  +
</e>
  +
  +
<e vnt="val">
  +
<p>
  +
<l>haguérem</l>
  +
<r>haver<s n="vbhaver"/><s n="pis"/><s n="p1"/><s n="pl"/><j/></r>
  +
</p>
  +
</e>
  +
</pre>
  +
  +
* having a way to mark a block of entries as belonging to a certain variant. To that end, a wrapping element <code>use</code> (name in discussion; currently named <code>aversion</code> [sic]) would contain a set of entries and this will be equivalent to having all these entries marked with a certain value of the <code>vnt</code> attribute so that writing
  +
  +
<pre>
  +
<use vnt="xx">
  +
<e> ... </e>
  +
<e> ... </e>
  +
</use>
  +
</pre>
  +
  +
would be equivalent to having
  +
  +
<pre>
  +
<e vnt="xx"> ... </e>
  +
<e vnt="xx"> ... </e>
  +
</pre>
  +
  +
This may be seen as a way to ''factor out'' a common value of <code>vnt</code>. Perhaps one could extend the use of <code>use</code> to factor out other attributes such as <code>r</code>.
  +
  +
This could be extended to other linguistic data files such as structural transfer files (.t1x, .t2x, etc.)
   
  +
==Parametrized paradigms==
* endowing the <code>e</code> element with a <code>vnt</code> (variant) attribute, so that the corresponding [[metadix]] entry will go to the generated .dix only if that variant is selected (entries without a <code>vnt</code> will go to the .dix unconditionally).
 
   
  +
(to be written)
* having a way to mark a block of entries with ...
 

Revision as of 15:02, 30 October 2007

Different language-pair packages use different strategies to generate .dix dictionaries (monodix) and (bidix) from XML files using features not supported by the .dix format. The objectives of these new dix-like formats are:

  • being able to use parametrized paradigms (so that a general paradigm may be defined and used with small parametrized variations), as discussed in the metadix page; paradigms are of two kinds, word form paradigms and symbol paradigms (word form paradigms are observed in the oc-ca pair and grammatical symbol paradigms are observed in the en-ca dictionaries).
  • being able to generate different versions of a translator (for instance, for two different varieties of a language, such as Brazilian and European Portuguese) whose names could be ideally tied to mode names

There is currently a debate on a unification of these formats into a single metadix format which in turn could also be used to support other desirable features such as

  • having metadata (headers) in dictionaries which defines whether the dictionary is a bilingual or monolingual dictionary and the language pairs and modes it supports (perhaps this could be added to the basic .dix format

Here is a proposal (open to discussion) on the first two issues.

Variants

  • endowing the e element with a vnt (variant) attribute, so that the corresponding metadix entry will go to the generated .dix only if that variant is selected (entries without a vnt will go to the .dix unconditionally); this will replace the use of the v attribute as found in the es-ca dictionaries to treat the Valencian dialect or in the es-pt dictionaries to distinguish European and Brazilian Portuguese. Here is an example:
       <e vnt="cat">
         <p>
           <l>haguéssim</l>
           <r>haver<s n="vbhaver"/><s n="pis"/><s n="p1"/><s n="pl"/><j/></r>
         </p>
       </e>

       <e vnt="val">
         <p>
           <l>haguérem</l>
           <r>haver<s n="vbhaver"/><s n="pis"/><s n="p1"/><s n="pl"/><j/></r>
         </p>
       </e>  
  • having a way to mark a block of entries as belonging to a certain variant. To that end, a wrapping element use (name in discussion; currently named aversion [sic]) would contain a set of entries and this will be equivalent to having all these entries marked with a certain value of the vnt attribute so that writing
<use vnt="xx">
<e> ... </e>
<e> ... </e>
</use>

would be equivalent to having

<e vnt="xx"> ... </e>
<e vnt="xx"> ... </e>

This may be seen as a way to factor out a common value of vnt. Perhaps one could extend the use of use to factor out other attributes such as r.

This could be extended to other linguistic data files such as structural transfer files (.t1x, .t2x, etc.)

Parametrized paradigms

(to be written)