User:David Nemeskey/CG XML brainstorming
This page lists my (and others') ideas of how the CG XML format could or should look like.
Contents
Sets and lists
The words set and list are used interchangeably in CG. This is in contrast to how these term are used in CS, and partly to the commonsensical meanings of the words as well. The current planning process might be just the right time to fix this issue. I propose to say good-bye to list.
The (XML) tags below will be used throughout the grammar for specifying tags and sets in e.g. constraint conditions.
Item | CG syntax | XML syntax | Fran's suggestion |
---|---|---|---|
Regular tag | nom | <tag> nom</tag>
|
<tag n=" nom"/>
|
Sequence tag | (n pl) | <seq><tag> n</tag><tag> pl</tag></seq>
| |
Reading base-form | "dog" | <lemma> dog</lemma>
| |
Word-form | "<dogs>" | <word> dogs</word>
| |
Set | (...) | see below | |
Special tags | >>> and <<< | <sbegin/> and <send/>
| |
Votes: |
Observations:
seq
andset
are very similar, which might be a problem when skimming through a CG- I don't know if we even need
set
-- in the construction rules, you have to put sets to everywhere, and those will have separate XML tags anyway. seq
could becombined(-tag)
?
Delimiters
Probably the easiest of the bunch:
<delimiters>(word forms, sets, etc.)</delimiters>
Alternative set handling
Sets are defined (unhammer is right in that actual modification never happens) in the <sets>...</sets>
section.
There are two kinds of sets in CG: named and temporary. The obvious first question is: do we represent them as two different tags, or not? The two options are listed below:
Set type | CG syntax | XML syntax 1 | XML syntax 2 |
---|---|---|---|
Named | Nominal
|
<nset name=" Nominal"/>
|
<set name=" Nominal"/>
|
Temporary | (n) OR (adj)
|
<set> <or><tag n="n"/><tag n="adj"/></or></set>
|
<set> <or><tag n="n"/><tag n="adj"/></or></set>
|
Votes: |
Well, the two options are almost the same, and both are very easy to write a parser for. However, I am a bit concerned about the (human) readability of the latter.
As for set definition, there are currently two ways to do that: with the LIST and the SET keywords. The former is a big OR of tags (inc. lemmas and sequences), while the latter builds sets from other sets. Again, we have two options here: we can either just have one tag (e.g. set
), and it is up to the user to follow the conventions; or we can have separate tags for the two. I propose the latter:
Item | CG syntax | XML syntax |
---|---|---|
Basic set | LIST set-name = ... ;
|
<basic-set name=" set-name"> ...</basic-set>
|
"Meta" set* | SET set-name = ... ;
|
<meta-set name=" set-name"> ...</meta-set>
|
(*) Suggestions on how to call this tag are welcome.
The ... in a meta set definition, as well as in case of temporary sets, can include the following set operations:
Operation | CG syntax | XML syntax |
---|---|---|
Union | A OR B | <union> <set name=" A"><set name=" B"></union>
|
Concatenation | A + B | <concat> <set name=" A"><set name=" B"></concat>
|
Difference | A - B | <diff> <set name=" A"><set name=" B"></diff>
|
Note: The operation tags above can be thought of as functions that return a new set, e.g. or(A, concat(B, C), diff (D, E))
. This format has the benefit of explicitly encoding the precedence in the formula, so grammarians won't have to memorize it.
Constraints
Similar to its SETS counterpart, the CONSTRAINTS section is enclosed in a <constraints>...</constraints>
tag.
At first I am going to cover only the three CG-2 constraint types: SELECT, REMOVE and IFF. Each type has its own tag; in this case, select
, remove
and iff
, respectively (the other option, of course, would be <constraint type="
select"
>). Each rule shall have a target
and 0 or more cond
(ition)s.
An example: "<fly>" REMOVE (V) IF (-1C (DET));
<remove>
<target><tag n="V"/></target>
<cond><word n="fly"/></cond>
<cond pos="-1" type="safe"><tag n="DET"/></cond>
</remove>
Observations / questions:
target
is always a set, so in case of a simpletag
, there is no need to convert it to a set manually (the parentheses in the original format)cond
has two parameters: the position (pos
) and the type, which is empty by default, but can be C (safe?), *, **, etc.- Word-form conditions, which were traditionally written before the constraint name, are now on the same level as the other conditions. We could enforce that one rule can only have on word-form condition, or not.
The question of link
tags is a tricky one. I don't think it makes sense to create <link><link><link><cond ... /></link></link></link>
-type monstrosities. If we stick to SAX parsing, we could simply write them after the <cond>
tag, i.e. <cond ... /><link ... /><link ... />
. Since this wouldn't look very nice if there are more than one conditions, I would enclose each such block into a <condition>
tag, e.g.
<remove>
<target><tag n="V"/></target>
<condition>
<cond><word n="fly"/></cond>
</condition>
<condition>
<cond pos="-1" type="safe"><tag n="DET"/></cond>
<link pos="-1" type="safe"><tag n="PREDET"/></link>
</condition>
</remove>
Having a condition
also makes it easier to include negation: the cond
and link
tags can have an attribute not=true
, while condition
can have a negate=true
.
BARRIERs somehow belong to the cond
and link
tags. This relation could perhaps be best represented as an attribute, but that wouldn't allow the user to specify a temporary set as the barrier. So we need a barrier
tag, which can be placed in two ways: under the condition
tag, interposed between the cond
and link
tags, or under the latters. If we go with the first option, there's still the question of whether to place the barrier
tag before ("semantic" ordering) or after (same as in the original format) the conditions/links. If we opt for the second, we have to introduce new tags for the condition part in cond
and link
.
An example from Pasi Tapanainen[1], with the two alternative formats described above, semantic ordering in the first case:
(*1 A BARRIER C LINK *1 B BARRIER C)
Interposed | Separate tag | |
---|---|---|
|
| |
Votes: |
CG-3 features
Since not all features of CG-3 are used by the grammars in Apertium (or so I've heard), first I'd like to cover those that are.
Sub-readings
Sub-reading in | CG syntax | XML syntax |
---|---|---|
target | SELECT SUB:1
|
|
condition | IF (0/1 ...)
|
|
Sets (the original idea)
Set definitions and modifications. The section itself in enclosed in a <sets>...</sets>
tag.
Item | CG syntax | XML syntax |
---|---|---|
Set definition | LIST set-name = ... ;
|
<define-set name=" set-name"> ...</define-set>
|
Set modification | SET set-name = ... ;
|
<modify-set name=" set-name"> ...</modify-set>
|
The define-set
tag works exactly like set
, the only exception is that the former is named and can only be used in the SETS sections. The ... in set modification can include the following set operations:
Operation | CG syntax | XML syntax |
---|---|---|
Union | A OR B | <union> ???A???B???</union>
|
Concatenation | A + B | <concat> ???A???B???</concat>
|
Difference | A - B | <diff> ???A???B???<diff>
|
Note: I imagine the above to be akin to lisp operators, e.g. (or A (concat B C) (diff D E))
. This format has the benefit of explicitly encoding the precedence in the formula, so grammarians won't have to memorize it.
CG never modifies sets. You can define one set based on other sets, but that's a new set definition, not an old set being changed. --unhammer 05:42, 28 June 2013 (UTC)- A nice way to annotate set operations could be something like this:
<or>
<set n="A"/>
<set n="B"/>
<set>...</set>
</or>
It's not as nice as the s-expressions, but it is after all XML :) --Krvoje 19:23, 1 July 2013 (UTC)
- Krvoje: if you look at the examples for temporary sets and
or
above, you see it is exactly
how I imagined it would look like -- of course there's still the question whether the first two should be nset
or not. As for XML, I completely agree. :)
- ↑ Tapanainen, P. 1996. The Constraint Grammar Parser CG-2. Publications 27, Department of General Linguistics. University of Helsinki.