Difference between revisions of "TSX format"
Line 32: | Line 32: | ||
Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name. |
Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name. |
||
==Forbid== |
|||
Element 'forbid' contains sequences of morphological categories that are not allowed in a given language |
Element 'forbid' contains sequences of morphological categories that are not allowed in a given language |
||
Each 'label-sequence' is restricted to two 'label-items' |
Each 'label-sequence' is restricted to two 'label-items' |
||
==Enforce== |
|||
Element 'enforce-rules' defines sets of coarse tags that must follow specificied ones |
Element 'enforce-rules' defines sets of coarse tags that must follow specificied ones |
||
Line 42: | Line 46: | ||
The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set' |
The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set' |
||
==Prefer== |
|||
Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag. |
Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag. |
Revision as of 19:10, 2 March 2008
Contents |
The TSX format is used in Apertium in order to define a tagger description file.
Defining tags
The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works.
Here is a clip of a TSX file for Norwegian Bokmål:
<?xml version="1.0" encoding="UTF-8"?> <tagger name="Norwegian Bokmål"> <tagset> <def-label name="NOMM"> <tags-item tags="n.m.*"/> </def-label> <def-label name="BARE-ADV"> <tags-item lemma="bare" tags="adv"/> </def-label> <def-label name="DETF" closed="true"> <tags-item tags="det.*.f.*"/> </def-label> </tagset> </tagger>
Each def-label
element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute closed="true"
may be used to specify if the defined fine tags belong to a closed list. Each tags-item
element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma.
Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list
Element 'sequence' encloses a set of tags o labels which defines a unit with more than one label
Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name.
Forbid
Element 'forbid' contains sequences of morphological categories that are not allowed in a given language
Each 'label-sequence' is restricted to two 'label-items'
Enforce
Element 'enforce-rules' defines sets of coarse tags that must follow specificied ones
Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute
The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set'
Prefer
Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag.
Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags