Difference between revisions of "TSX format"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
The '''TSX format''' is used in Apertium in order to define a tagger description file.
+
The '''TSX format''' is used in Apertium in order to define a tagger description file. The file is used in [[tagger training]] in order to provide definitions of coarse tags, and to provide basic constraints in the form of ''forbid'' and ''enforce'' rules.
   
 
==Defining tags==
 
==Defining tags==
Line 29: Line 29:
 
Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list
 
Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list
   
Element 'sequence' encloses a set of tags o labels which defines a unit with more than one label
+
Element 'sequence' encloses a set of tags or labels which defines a unit with more than one label
 
 
 
Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name.
 
Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name.
Line 41: Line 41:
 
==Enforce==
 
==Enforce==
   
Element 'enforce-rules' defines sets of coarse tags that must follow specificied ones
+
The element 'enforce-rules' defines sets of coarse tags that must follow specified ones
   
 
Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute
 
Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute

Revision as of 19:12, 2 March 2008

The TSX format is used in Apertium in order to define a tagger description file. The file is used in tagger training in order to provide definitions of coarse tags, and to provide basic constraints in the form of forbid and enforce rules.

Defining tags

The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works.

Here is a clip of a TSX file for Norwegian Bokmål:

<?xml version="1.0" encoding="UTF-8"?>
<tagger name="Norwegian Bokmål">
  <tagset>
    <def-label name="NOMM">
      <tags-item tags="n.m.*"/>
    </def-label>
    <def-label name="BARE-ADV">
      <tags-item lemma="bare" tags="adv"/>
    </def-label>
    <def-label name="DETF" closed="true">
      <tags-item tags="det.*.f.*"/>
    </def-label>
  </tagset>
</tagger>

Each def-label element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute closed="true" may be used to specify if the defined fine tags belong to a closed list. Each tags-item element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma.

Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list

Element 'sequence' encloses a set of tags or labels which defines a unit with more than one label

Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name.

Forbid

Element 'forbid' contains sequences of morphological categories that are not allowed in a given language

Each 'label-sequence' is restricted to two 'label-items'

Enforce

The element 'enforce-rules' defines sets of coarse tags that must follow specified ones

Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute

The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set'

Prefer

Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag.

Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags