TSX format

From Apertium
Revision as of 19:10, 2 March 2008 by Francis Tyers (talk | contribs)
Jump to navigation Jump to search

The TSX format is used in Apertium in order to define a tagger description file.

Defining tags

The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works.

Here is a clip of a TSX file for Norwegian Bokmål:

<?xml version="1.0" encoding="UTF-8"?>
<tagger name="Norwegian Bokmål">
  <tagset>
    <def-label name="NOMM">
      <tags-item tags="n.m.*"/>
    </def-label>
    <def-label name="BARE-ADV">
      <tags-item lemma="bare" tags="adv"/>
    </def-label>
    <def-label name="DETF" closed="true">
      <tags-item tags="det.*.f.*"/>
    </def-label>
  </tagset>
</tagger>

Each def-label element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute closed="true" may be used to specify if the defined fine tags belong to a closed list. Each tags-item element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma.

Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list

Element 'sequence' encloses a set of tags o labels which defines a unit with more than one label

Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name.

Forbid

Element 'forbid' contains sequences of morphological categories that are not allowed in a given language

Each 'label-sequence' is restricted to two 'label-items'

Enforce

Element 'enforce-rules' defines sets of coarse tags that must follow specificied ones

Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute

The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set'

Prefer

Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag.

Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags