Difference between revisions of "User:David Nemeskey/CG XML brainstorming"

From Apertium
Jump to navigation Jump to search
Line 32: Line 32:
|Set
|Set
|(''...'')
|(''...'')
|see below
|'''<code>&lt;set&gt;</code>'''''...'''''<code>&lt;/set&gt;</code>'''
|-
|-
|Special tags
|Special tags
Line 49: Line 49:


<code>'''&lt;delimiters&gt;'''(word forms, sets, etc.)'''&lt;/delimiters&gt;'''</code>
<code>'''&lt;delimiters&gt;'''(word forms, sets, etc.)'''&lt;/delimiters&gt;'''</code>

== Alternative set handling ==

Set are defined ([[User:Unhammer|unhammer]] is right in that actual modification never happens) in the <code>'''&lt;sets&gt;'''...'''&lt;/sets&gt;'''</code> section.

There are two kinds of sets in CG: named and temporary. The obvious first question is: do we represent them as two different tags, or not? The two options are listed below:

{| class="wikitable"
! Set type
! CG syntax
! XML syntax 1
! XML syntax 2
|-
|Named
|<code>''Nominal''</code>
|'''<code>&lt;nset name="</code>'''''Nominal'''''<code>"/&gt;</code>
|'''<code>&lt;set name="</code>'''''Nominal'''''<code>"/&gt;</code>
|-
|Temporary
|<code>''(n) OR (adj)''</code>
|'''<code>&lt;set&gt;</code>'''&lt;or&gt;&lt;tag n="''n''"/&gt;&lt;tag n="''adj''"/&gt;&lt;/or&gt;'''<code>&lt;/set&gt;</code>'''
|'''<code>&lt;set&gt;</code>'''&lt;or&gt;&lt;tag n="''n''"/&gt;&lt;tag n="''adj''"/&gt;&lt;/or&gt;'''<code>&lt;/set&gt;</code>'''
|}

Well, the two options are almost the same, and both are very easy to write a parser for. However, I am a bit concerned about the (human) readability of the latter.

As for set definition, there are currently two ways to do that: with the ''LIST'' and the ''SET'' keywords. The former is a big OR of tags (inc. lemmas and sequences), while the latter builds **TO BE CONTINUED**

{| class="wikitable"
! Item
! CG syntax
! XML syntax
|-
|Set definition
|<code>LIST ''set-name'' = ''...'' ;</code>
|'''<code>&lt;define-set name="</code>'''''set-name'''''<code>"&gt;</code>'''''...'''''<code>&lt;/define-set&gt;</code>'''<br>
'''<code>&lt;dset name="</code>'''''set-name'''''<code>"&gt;</code>'''''...'''''<code>&lt;/dset&gt;</code>'''
|-
|Set modification
|<code>SET ''set-name'' = ''...'' ;</code>
|'''<code>&lt;modify-set name="</code>'''''set-name'''''<code>"&gt;</code>'''''...'''''<code>&lt;/modify-set&gt;</code>'''<br>
'''<code>&lt;mset name="</code>'''''set-name'''''<code>"&gt;</code>'''''...'''''<code>&lt;/mset&gt;</code>'''
|}

The '''<code>define-set</code>''' tag works exactly like '''<code>set</code>''', the only exception is that the former is named and can only be used in the SETS sections. The ''...'' in set modification can include the following set operations:

{| class="wikitable"
! Operation
! CG syntax
! XML syntax
|-
|Union
|''A OR B''
|'''<code>&lt;union&gt;</code>???A???B???<code>&lt;/union&gt;</code>'''<br>
'''<code>&lt;or&gt;</code>???A???B???<code>&lt;or&gt;</code>'''
|-
|Concatenation
|''A + B''
|'''<code>&lt;concat&gt;</code>???A???B???<code>&lt;/concat&gt;</code>'''
|-
|Difference
|''A - B''
|'''<code>&lt;diff&gt;</code>???A???B???<code>&lt;diff&gt;</code>'''
|-
|}

Note: I imagine the above to be akin to lisp operators, e.g. <code>(or A (concat B C) (diff D E))</code>. This format has the benefit of explicitly encoding the precedence in the formula, so grammarians won't have to memorize it.


== Sets ==
== Sets ==

Revision as of 08:46, 1 July 2013

This page lists my (and others') ideas of how the CG XML format could or should look like.

Sets and lists

The words set and list are used interchangeably in CG. This is in contrast to how these term are used in CS, and partly to the commonsensical meanings of the words as well. The current planning process might be just the right time to fix this issue. I propose to say good-bye to list.

The (XML) tags below will be used throughout the grammar for specifying tags and sets in e.g. constraint conditions.

Item CG syntax XML syntax Fran's suggestion
Regular tag nom <tag>nom</tag> <tag n="nom"/>
Sequence tag (n pl) <seq><tag>n</tag><tag>pl</tag></seq>
Reading base-form "dog" <lemma>dog</lemma>
Word-form "<dogs>" <word>dogs</word>
Set (...) see below
Special tags >>> and <<< <sbegin/> and <send/>

Observations:

  1. seq and set are very similar, which might be a problem when skimming through a CG
  2. I don't know if we even need set -- in the construction rules, you have to put sets to everywhere, and those will have separate XML tags anyway.
  3. seq could be combined(-tag)?

Delimiters

Probably the easiest of the bunch:

<delimiters>(word forms, sets, etc.)</delimiters>

Alternative set handling

Set are defined (unhammer is right in that actual modification never happens) in the <sets>...</sets> section.

There are two kinds of sets in CG: named and temporary. The obvious first question is: do we represent them as two different tags, or not? The two options are listed below:

Set type CG syntax XML syntax 1 XML syntax 2
Named Nominal <nset name="Nominal"/> <set name="Nominal"/>
Temporary (n) OR (adj) <set><or><tag n="n"/><tag n="adj"/></or></set> <set><or><tag n="n"/><tag n="adj"/></or></set>

Well, the two options are almost the same, and both are very easy to write a parser for. However, I am a bit concerned about the (human) readability of the latter.

As for set definition, there are currently two ways to do that: with the LIST and the SET keywords. The former is a big OR of tags (inc. lemmas and sequences), while the latter builds **TO BE CONTINUED**

Item CG syntax XML syntax
Set definition LIST set-name = ... ; <define-set name="set-name">...</define-set>
<dset name="set-name">...</dset>
Set modification SET set-name = ... ; <modify-set name="set-name">...</modify-set>
<mset name="set-name">...</mset>

The define-set tag works exactly like set, the only exception is that the former is named and can only be used in the SETS sections. The ... in set modification can include the following set operations:

Operation CG syntax XML syntax
Union A OR B <union>???A???B???</union>
<or>???A???B???<or>
Concatenation A + B <concat>???A???B???</concat>
Difference A - B <diff>???A???B???<diff>

Note: I imagine the above to be akin to lisp operators, e.g. (or A (concat B C) (diff D E)). This format has the benefit of explicitly encoding the precedence in the formula, so grammarians won't have to memorize it.

Sets

Set definitions and modifications. The section itself in enclosed in a <sets>...</sets> tag.

Item CG syntax XML syntax
Set definition LIST set-name = ... ; <define-set name="set-name">...</define-set>
<dset name="set-name">...</dset>
Set modification SET set-name = ... ; <modify-set name="set-name">...</modify-set>
<mset name="set-name">...</mset>

The define-set tag works exactly like set, the only exception is that the former is named and can only be used in the SETS sections. The ... in set modification can include the following set operations:

Operation CG syntax XML syntax
Union A OR B <union>???A???B???</union>
<or>???A???B???<or>
Concatenation A + B <concat>???A???B???</concat>
Difference A - B <diff>???A???B???<diff>

Note: I imagine the above to be akin to lisp operators, e.g. (or A (concat B C) (diff D E)). This format has the benefit of explicitly encoding the precedence in the formula, so grammarians won't have to memorize it.

CG never modifies sets. You can define one set based on other sets, but that's a new set definition, not an old set being changed. --unhammer 05:42, 28 June 2013 (UTC)

Constraints