Difference between revisions of "Unification of metadix and parametrized dictionaries"

From Apertium
Jump to navigation Jump to search
Line 31: Line 31:
 
</pre>
 
</pre>
   
  +
==Alternatives==
==Variant (alternatives)==
 
   
 
* Endowing the <code>e</code> element with a <code>alt</code> (variant) attribute (currently <code>v</code> in some dictionaries and <code>aversion</code> [sic] in others), so that the corresponding [[metadix]] entry will go to the generated .dix only if that alternative or variant is selected (entries without an <code>alt</code> will go to the .dix unconditionally); this will replace the use of the <code>v</code> or <code>aversion</code> attributes as found in the es-ca dictionaries to treat the Valencian dialect or in the es-pt dictionaries to distinguish European and Brazilian Portuguese, or in the oc-ca dictionaries to distinguish the Aranese variety. Here is an es-ca [[monodix]] example:
:After talking with some people and getting some more ideas, I kind of like having the idea of "alternative" (<code>alt</code>) instead of variant or variety. so in the following examples we'd have:
 
   
 
<pre>
 
<pre>
<e alt="ca@valencian">
+
<e alt="ca">
...
 
<e alt="ca">
 
...
 
</pre>
 
 
* Endowing the <code>e</code> element with a <code>vnt</code> (variant) attribute (currently <code>v</code> in some dictionaries and <code>aversion</code> [sic] in others), so that the corresponding [[metadix]] entry will go to the generated .dix only if that variant is selected (entries without a <code>vnt</code> will go to the .dix unconditionally); this will replace the use of the <code>v</code> attribute as found in the es-ca dictionaries to treat the Valencian dialect or in the es-pt dictionaries to distinguish European and Brazilian Portuguese. Here is an es-ca [[monodix]] example:
 
 
<pre>
 
<e vnt="cat">
 
 
<p>
 
<p>
 
<l>haguéssim</l>
 
<l>haguéssim</l>
Line 52: Line 43:
 
</e>
 
</e>
   
<e vnt="val">
+
<e alt="ca@valencian">
 
<p>
 
<p>
 
<l>haguérem</l>
 
<l>haguérem</l>
Line 63: Line 54:
   
 
<pre>
 
<pre>
<e lm="correto" vnt="pt_BR">
+
<e lm="correto" alt="pt_BR">
 
<i>corret</i>
 
<i>corret</i>
 
<par n="abert/o__adj"/>
 
<par n="abert/o__adj"/>
 
</e>
 
</e>
 
 
<e lm="correcto" vnt="pt_PT">
+
<e lm="correcto" alt="pt_PT">
 
<i>correct</i>
 
<i>correct</i>
 
<par n="abert/o__adj"/>
 
<par n="abert/o__adj"/>
Line 74: Line 65:
 
</pre>
 
</pre>
   
In structural transfer rule files, <code>rule</code>'s may also have variants as observed in the es-pt package. The proposal is to endow element <code>rule</code> with the <code>vnt</code> attribute (currently <code>v</code>):
+
In structural transfer rule files, <code>rule</code>'s may also have variants as observed in the es-pt package. The proposal is to endow element <code>rule</code> with the <code>alt</code> attribute (currently <code>v</code> or <code>aversion</code>):
   
* having a way to mark a block of entries as belonging to a certain variant. To that end, a wrapping element <code>e-group</code> (name in discussion; currently named <code>aversion</code> [sic]) would contain a set of entries and this will be equivalent to having all these entries marked with a certain value of the <code>vnt</code> attribute so that writing
+
* having a way to mark a group of entries as belonging to a certain variant. To that end, a wrapping element <code>e-group</code> (name in discussion; currently named <code>aversion</code> [sic]) would contain a set of entries and this will be equivalent to having all these entries marked with a certain value of the <code>alt</code> attribute so that writing
   
 
<pre>
 
<pre>
<e-group vnt="xx">
+
<e-group alt="xx">
 
<e> ... </e>
 
<e> ... </e>
 
<e> ... </e>
 
<e> ... </e>
Line 88: Line 79:
   
 
<pre>
 
<pre>
<e vnt="xx"> ... </e>
+
<e alt="xx"> ... </e>
<e vnt="xx"> ... </e>
+
<e alt="xx"> ... </e>
 
</pre>
 
</pre>
   
This may be seen as a way to ''factor out'' a common value of <code>vnt</code>. Perhaps one could extend the use of <code>e-group</code> to factor out other attributes such as <code>r</code>.
+
This may be seen as a way to ''factor out'' a common value of <code>alt</code>. Perhaps one could extend the use of <code>e-group</code> to factor out other attributes such as <code>r</code>.
   
 
This could be extended to other linguistic data files such as structural transfer files (.t1x, .t2x, etc.). The corresponding element would be <code>rule-group</code>.
 
This could be extended to other linguistic data files such as structural transfer files (.t1x, .t2x, etc.). The corresponding element would be <code>rule-group</code>.
   
A generalization of this would be a <code>wrap</code> element having three attributes: <code>element</code> (the name of the element affected), <code>attribute</code> (the name of the attribute) and <code>value</code> (the value of the attribute), so that <code><wrap element="e" attribute="vnt" value="xx"></code> would be equivalent to <code><e-group vnt="xx"></code> and <code><wrap element="rule" attribute="vnt" value="xx"></code> would be equivalent to <code><rule-group vnt="xx"></code>, but perhaps this is too general.
+
A generalization of this would be a <code>wrap</code> element having three attributes: <code>element</code> (the name of the element affected), <code>attribute</code> (the name of the attribute) and <code>value</code> (the value of the attribute), so that <code><wrap element="e" attribute="alt" value="xx"></code> would be equivalent to <code><e-group alt="xx"></code> and <code><wrap element="rule" attribute="alt" value="xx"></code> would be equivalent to <code><rule-group alt="xx"></code>, but perhaps this is too general.
   
 
==Parametrized paradigms==
 
==Parametrized paradigms==
Line 188: Line 179:
   
 
<pre>
 
<pre>
<pardef n="ab/i[T]ar__vblex" form-prm-list="cons">
+
<pardef n="ab/i[T]ar__vblex" prm-list="cons">
 
<e vnt="oc@aran">
 
<e vnt="oc@aran">
 
<p>
 
<p>
Line 194: Line 185:
 
<r>i</r>
 
<r>i</r>
 
</p>
 
</p>
<i><form-prm n="cons"/></i>
+
<i><txt-prm n="cons"/></i>
 
<par n="cànt/iga__vblex"/>
 
<par n="cànt/iga__vblex"/>
 
</e>
 
</e>
 
<e vnt="oc@aran">
 
<e vnt="oc@aran">
<i>i<form-prm n="cons"/></i>
+
<i>i<txt-prm n="cons"/></i>
 
<par n="cant/ar__vblex"/>
 
<par n="cant/ar__vblex"/>
 
</e>
 
</e>
 
<e vnt="oc">
 
<e vnt="oc">
<i>i<form-prm n="cons"/></i>
+
<i>i<txt-prm n="cons"/></i>
 
<par n="cant/as__vblex"/>
 
<par n="cant/as__vblex"/>
 
</e>
 
</e>
Line 213: Line 204:
 
<e lm="abitar">
 
<e lm="abitar">
 
<i>ab</i>
 
<i>ab</i>
<par n="ab/i[T]ar__vblex" form-prms="cons='t'"/>
+
<par n="ab/i[T]ar__vblex" prms="cons='t'"/>
 
</e>
 
</e>
 
<e lm="abocinar">
 
<e lm="abocinar">
 
<i>aboc</i>
 
<i>aboc</i>
<par n="ab/i[T]ar__vblex" form-prms="cons='n'"/>
+
<par n="ab/i[T]ar__vblex" prms="cons='n'"/>
 
</e>
 
</e>
 
<e lm="originar">
 
<e lm="originar">
 
<i>orig</i>
 
<i>orig</i>
<par n="ab/i[T]ar__vblex" form-prms="cons='n'"/>
+
<par n="ab/i[T]ar__vblex" prms="cons='n'"/>
 
</e>
 
</e>
 
 
 
<e lm="brilhar">
 
<e lm="brilhar">
 
<i>br</i>
 
<i>br</i>
<par n="ab/i[T]ar__vblex" form-prms="cons='lh'"/>
+
<par n="ab/i[T]ar__vblex" prms="cons='lh'"/>
 
</e>
 
</e>
 
</pre>
 
</pre>
   
The use of named parameters allows for a simple mechanism for multiple parameters. For instance the parameterized paradigm could have <code>form-prm-list="cons vowel"</code> and calls could have, for instance, <code>form-prms="cons='lh' vowel='a'"</code>.
+
The use of named parameters allows for a simple mechanism for multiple parameters. For instance the parameterized paradigm could have <code>prm-list="cons vowel"</code> and calls could have, for instance, <code>prms="cons='lh' vowel='a'"</code>.
 
;Fran
 
 
I think it would be better to have just one attribute for prm. It would be passed as text, and then have two elements to write out either a symbol or text.
 
 
<pre>
 
<pardef n="ab/i[T]ar__vblex" prm-list="cons">
 
<e vnt="oc@aran">
 
<p>
 
<l>í</l>
 
<r>i</r>
 
</p>
 
<i><prm-txt n="cons"/></i>
 
<par n="cànt/iga__vblex"/>
 
</e>
 
<e vnt="oc">
 
<i>i<prm-txt n="cons"/></i>
 
<par n="cant/as__vblex"/>
 
</e>
 
</pardef>
 
<pardef n="house__n" prm-list="count">
 
<e>
 
<p>
 
<l/>
 
<r><s n="n"/><prm-sym n="count"/><s n="sg"/></r>
 
</p>
 
</e>
 
<e>
 
<p>
 
<l>s</l>
 
<r><s n="n"/><prm-sym n="count"/><s n="pl"/></r>
 
</p>
 
</e>
 
</pardef>
 
 
Calling them:
 
 
<e lm="time"><i>time</i><par n="house__n" prms="countable='unc'"/></e>
 
<e lm="brilhar"><i>br</i><par n="ab/i[T]ar__vblex" prms="cons='lh'"/></e>
 
 
</pre>
 
   
 
====Symbol parameters====
 
====Symbol parameters====
Line 280: Line 230:
   
 
<pre>
 
<pre>
<pardef n="house__n" symbol-prm-list="countable">
+
<pardef n="house__n" prm-list="countable">
 
<e c="CP: nouns which add -s">
 
<e c="CP: nouns which add -s">
 
<p>
 
<p>
Line 299: Line 249:
   
 
<pre>
 
<pre>
<e lm="time"><i>time</i><par n="house__n" symbol-prms="countable='unc'"/></e>
+
<e lm="time"><i>time</i><par n="house__n" prms="countable='unc'"/></e>
 
</pre>
 
</pre>
   
in a completely analogous way to word-form parameters.
+
in a completely analogous way to word-form parameters. The only difference is that <code><symbol-prm n="x"></code> generates <code><s n="y"></code> if the value of <code>x</code> is <code>y</code>, whereas <code><txt-prm n="x"></code> generates just <code>y</code>.
   
   

Revision as of 06:34, 1 November 2007

Different language-pair packages use different strategies to generate .dix dictionaries (monodix) and (bidix) from XML files using features not supported by the .dix format. The objectives of these new dix-like formats are:

  • being able to use parametrized paradigms (so that a general paradigm may be defined and used with small parametrized variations), as discussed in the metadix page;
  • being able to generate different versions of a translator (for instance, for two different varieties of a language, such as Brazilian and European Portuguese) whose names could be ideally tied to mode names

There is currently a debate on a unification of these formats into a single metadix format which in turn could also be used to support other desirable features such as

  • having metadata (headers) in dictionaries which defines whether the dictionary is a bilingual or monolingual dictionary and the language pairs and modes it supports (perhaps this could be added to the basic .dix format)

Here is a proposal (open to discussion) on the first two issues.

Header

It would be useful to have the language name(s) and probably some other information (maybe on varieties?) specified in some kind of a header file.


<dictionary>
  <header>
    <type>monolingual</type>
    <language code="ca" full="Català"/>
    <alternatives>
      <alternative code="ca@valencian" full="Valencià"/>
      <alternative code="ca" full="Català"/>
    </alternatives>
  </header>
 
  ...

Alternatives

  • Endowing the e element with a alt (variant) attribute (currently v in some dictionaries and aversion [sic] in others), so that the corresponding metadix entry will go to the generated .dix only if that alternative or variant is selected (entries without an alt will go to the .dix unconditionally); this will replace the use of the v or aversion attributes as found in the es-ca dictionaries to treat the Valencian dialect or in the es-pt dictionaries to distinguish European and Brazilian Portuguese, or in the oc-ca dictionaries to distinguish the Aranese variety. Here is an es-ca monodix example:
       <e alt="ca">
         <p>
           <l>haguéssim</l>
           <r>haver<s n="vbhaver"/><s n="pis"/><s n="p1"/><s n="pl"/><j/></r>
         </p>
       </e>

       <e alt="ca@valencian">
         <p>
           <l>haguérem</l>
           <r>haver<s n="vbhaver"/><s n="pis"/><s n="p1"/><s n="pl"/><j/></r>
         </p>
       </e>  

And an es-pt monodix example

    <e lm="correto" alt="pt_BR">
        <i>corret</i>
        <par n="abert/o__adj"/>
      </e>
      
      <e lm="correcto" alt="pt_PT">
        <i>correct</i>
        <par n="abert/o__adj"/>
      </e>

In structural transfer rule files, rule's may also have variants as observed in the es-pt package. The proposal is to endow element rule with the alt attribute (currently v or aversion):

  • having a way to mark a group of entries as belonging to a certain variant. To that end, a wrapping element e-group (name in discussion; currently named aversion [sic]) would contain a set of entries and this will be equivalent to having all these entries marked with a certain value of the alt attribute so that writing
<e-group alt="xx">
<e> ... </e>
<e> ... </e>
</e-group>

would be equivalent to having

<e alt="xx"> ... </e>
<e alt="xx"> ... </e>

This may be seen as a way to factor out a common value of alt. Perhaps one could extend the use of e-group to factor out other attributes such as r.

This could be extended to other linguistic data files such as structural transfer files (.t1x, .t2x, etc.). The corresponding element would be rule-group.

A generalization of this would be a wrap element having three attributes: element (the name of the element affected), attribute (the name of the attribute) and value (the value of the attribute), so that <wrap element="e" attribute="alt" value="xx"> would be equivalent to <e-group alt="xx"> and <wrap element="rule" attribute="alt" value="xx"> would be equivalent to <rule-group alt="xx">, but perhaps this is too general.

Parametrized paradigms

There are two kinds of parametrized paradigms, which may be called word form paradigms and symbol paradigms (word form paradigms are observed in the oc-ca pair and grammatical symbol paradigms are observed in the en-ca dictionaries).

Current usage of word form paradigms

Here is an example of word form paradigms as currently used in the oc-ca dictionaries:

<pardef n="ab/i[T]ar__vblex">
        <e aversion="oc@aran-ca">
                <p>
                        <l>í</l>
                        <r>i</r>
                </p>
                <i><prm/></i>
                <par n="cànt/iga__vblex"/>
        </e>
        <e aversion="oc@aran-ca">
                <i>i<prm/></i>
                <par n="cant/ar__vblex"/>
        </e>
        <e aversion="oc-ca">
                <i>i<prm/></i>
                <par n="cant/as__vblex"/>
        </e>
     </pardef>

In the example, <prm/> will be substituted by the value of the single parameter, as in:

   <e lm="abitar">
                <i>ab</i>
                <par n="ab/i[T]ar__vblex" prm="t"/>
        </e>
        <e lm="abocinar">
                <i>aboc</i>
                <par n="ab/i[T]ar__vblex" prm="n"/>
        </e>
        <e lm="originar">
                <i>orig</i>
                <par n="ab/i[T]ar__vblex" prm="n"/>
        </e>
   
        <e lm="brilhar">
                <i>br</i>
                <par n="ab/i[T]ar__vblex" prm="lh"/>
        </e>

There's only a single unnamed parameter <prm/>, which may be a limitation in some applications. Don't be misled by the naming of paradigms: it isn't parsed, and it is just for the dictionary writer to remember what is substituted and kept.


Current usage of symbol paradigms

Here's an example of symbol paradigms as used in the en-ca dictionary.

 <pardef n="house__n">
   <e c="CP: nouns which add -s">
      <p>
         <l/>
         <r><s n="n"/><sa/><s n="sg"/></r>
      </p>
   </e>
   <e>
      <p>
         <l>s</l>
         <r><s n="n"/><sa/><s n="pl"/></r>
      </p>
   </e>
 </pardef>

Then, the placeholder <sa/> will be substituted by a symbol <s n="..."> named as the value of attribute sa:

<e lm="time"><i>time</i><par n="house__n" sa="unc"/></e>

This has the same limitations as in the case of word form parameters (single parameter, etc.). Furthermore, the use of the first and second method is heterogeneous and could be unified somehow.

A unifying proposal

Word-form parameters

Here are the two examples above in the new proposed notation, which will be explained below:

<pardef n="ab/i[T]ar__vblex" prm-list="cons">
        <e vnt="oc@aran">
                <p>
                        <l>í</l>
                        <r>i</r>
                </p>
                <i><txt-prm n="cons"/></i>
                <par n="cànt/iga__vblex"/>
        </e>
        <e vnt="oc@aran">
                <i>i<txt-prm n="cons"/></i>
                <par n="cant/ar__vblex"/>
        </e>
        <e vnt="oc">
                <i>i<txt-prm n="cons"/></i>
                <par n="cant/as__vblex"/>
        </e>
     </pardef>

And the call:

   <e lm="abitar">
                <i>ab</i>
                <par n="ab/i[T]ar__vblex" prms="cons='t'"/>
        </e>
        <e lm="abocinar">
                <i>aboc</i>
                <par n="ab/i[T]ar__vblex" prms="cons='n'"/>
        </e>
        <e lm="originar">
                <i>orig</i>
                <par n="ab/i[T]ar__vblex" prms="cons='n'"/>
        </e>
   
        <e lm="brilhar">
                <i>br</i>
                <par n="ab/i[T]ar__vblex" prms="cons='lh'"/>
        </e>

The use of named parameters allows for a simple mechanism for multiple parameters. For instance the parameterized paradigm could have prm-list="cons vowel" and calls could have, for instance, prms="cons='lh' vowel='a'".

Symbol parameters

Here are the two symbol-parameter examples above in the new notation.

The definition of the paradigm:

 <pardef n="house__n" prm-list="countable">
   <e c="CP: nouns which add -s">
      <p>
         <l/>
         <r><s n="n"/><symbol-prm n="countable"/><s n="sg"/></r>
      </p>
   </e>
   <e>
      <p>
         <l>s</l>
         <r><s n="n"/><symbol-prm n="countable"/><s n="pl"/></r>
      </p>
   </e>
 </pardef>

And the call:

<e lm="time"><i>time</i><par n="house__n" prms="countable='unc'"/></e>

in a completely analogous way to word-form parameters. The only difference is that <symbol-prm n="x"> generates if the value of x is y, whereas <txt-prm n="x"> generates just y.


Fran's summary

This is a brief summary detailing the various ways that the new format could be used. These are all in different languages, but it could be that any pair of languages might use all these features.

<pardef n="ab/i[T]ar__vblex" prm-list="cons">
  <e alt="oc@aran">
    <p>
      <l>í</l>
      <r>i</r>
    </p>
    <i><prm-txt n="cons"/></i>
    <par n="cànt/iga__vblex"/>
  </e>
  <e>
    <i>i<prm-txt n="cons"/></i>
    <par n="cant/as__vblex"/>
  </e>
</pardef>

<pardef n="house__n" prm-list="count">
  <e>
    <p>
      <l/>
      <r><s n="n"/><prm-sym n="count"/><s n="sg"/></r>
    </p>
  </e>
  <e>
    <p>
      <l>s</l>
      <r><s n="n"/><prm-sym n="count"/><s n="pl"/></r>
    </p>
  </e>
</pardef>

Calling them:

  <e lm="house"><i>house</i><par n="house__n"/></e>
  <e lm="time"><i>time</i><par n="house__n" prms="count='unc'"/></e>
  <e lm="brilhar"><i>br</i><par n="ab/i[T]ar__vblex" prms="cons='lh'"/></e>
  <e lm="aerodrom"><i>aerodrom</i><par n="aerodrom__n"/></e>
  <e lm="avion"><i>avion<b/>luka</i><par n="aerodrom__n"/></e>
  <e-group alt="sh_HR">
    <e lm="zrakoplov"><i>zrakoplov</i><par n="aerodrom__n"/></e>
    <e lm="zračna luka"><i>zračna<b/>luka</i><par n="luka__n"/></e>
  </e-group>
  <e lm="fajl" alt="sh_BS"><i>fajl</i><par n="aerodrom__n"/></e>
  <e lm="datoteka"><i>datoteka</i><par n="luka__n"/></e>