Difference between revisions of "TSX format"

From Apertium
Jump to navigation Jump to search
m (TSX moved to TSX format)
Line 1: Line 1:
{{TOCD}}
The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works
The '''TSX format''' is used in Apertium in order to define a tagger description file.


==Defining tags==
Each 'def-label' defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute 'closed="true"' may be used to specify if the defined fine tags belong to a closed list


The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works.
Each 'tags-item' may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma

Here is a clip of a TSX file for Norwegian Bokmål:

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<tagger name="Norwegian Bokmål">
<tagset>
<def-label name="NOMM">
<tags-item tags="n.m.*"/>
</def-label>
<def-label name="BARE-ADV">
<tags-item lemma="bare" tags="adv"/>
</def-label>
<def-label name="DETF" closed="true">
<tags-item tags="det.*.f.*"/>
</def-label>
</tagset>
</tagger>
</pre>

Each <code>def-label</code> element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute <code>closed="true"</code> may be used to specify if the defined fine tags belong to a closed list. Each <code>tags-item</code> element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma.


Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list
Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list

Revision as of 19:09, 2 March 2008

Contents

The TSX format is used in Apertium in order to define a tagger description file.

Defining tags

The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works.

Here is a clip of a TSX file for Norwegian Bokmål:

<?xml version="1.0" encoding="UTF-8"?>
<tagger name="Norwegian Bokmål">
  <tagset>
    <def-label name="NOMM">
      <tags-item tags="n.m.*"/>
    </def-label>
    <def-label name="BARE-ADV">
      <tags-item lemma="bare" tags="adv"/>
    </def-label>
    <def-label name="DETF" closed="true">
      <tags-item tags="det.*.f.*"/>
    </def-label>
  </tagset>
</tagger>

Each def-label element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute closed="true" may be used to specify if the defined fine tags belong to a closed list. Each tags-item element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma.

Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list

Element 'sequence' encloses a set of tags o labels which defines a unit with more than one label

Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name.

Element 'forbid' contains sequences of morphological categories that are not allowed in a given language

Each 'label-sequence' is restricted to two 'label-items'

Element 'enforce-rules' defines sets of coarse tags that must follow specificied ones

Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute

The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set'

Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag.

Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags