Difference between revisions of "TSX format"
m (TSX moved to TSX format) |
|||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
⚫ | |||
The '''TSX format''' is used in Apertium in order to define a tagger description file. |
|||
==Defining tags== |
|||
⚫ | |||
⚫ | |||
Each 'tags-item' may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma |
|||
Here is a clip of a TSX file for Norwegian Bokmål: |
|||
<pre> |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<tagger name="Norwegian Bokmål"> |
|||
<tagset> |
|||
<def-label name="NOMM"> |
|||
<tags-item tags="n.m.*"/> |
|||
</def-label> |
|||
<def-label name="BARE-ADV"> |
|||
<tags-item lemma="bare" tags="adv"/> |
|||
</def-label> |
|||
<def-label name="DETF" closed="true"> |
|||
<tags-item tags="det.*.f.*"/> |
|||
</def-label> |
|||
</tagset> |
|||
</tagger> |
|||
</pre> |
|||
⚫ | Each <code>def-label</code> element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute <code>closed="true"</code> may be used to specify if the defined fine tags belong to a closed list. Each <code>tags-item</code> element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma. |
||
Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list |
Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list |
Revision as of 19:09, 2 March 2008
Contents |
The TSX format is used in Apertium in order to define a tagger description file.
Defining tags
The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works.
Here is a clip of a TSX file for Norwegian Bokmål:
<?xml version="1.0" encoding="UTF-8"?> <tagger name="Norwegian Bokmål"> <tagset> <def-label name="NOMM"> <tags-item tags="n.m.*"/> </def-label> <def-label name="BARE-ADV"> <tags-item lemma="bare" tags="adv"/> </def-label> <def-label name="DETF" closed="true"> <tags-item tags="det.*.f.*"/> </def-label> </tagset> </tagger>
Each def-label
element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute closed="true"
may be used to specify if the defined fine tags belong to a closed list. Each tags-item
element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma.
Each 'def-mult' defines one coarse tag in terms of a sequence of coarse tags previously defined as 'def-labels' or a sequence of fine tags. A mandatory name is required for each 'def-mult' which may also has an optional attribute 'closed="true"' if it belongs to a closed list
Element 'sequence' encloses a set of tags o labels which defines a unit with more than one label
Each 'label' of the 'label-item' correspond to a coarse tag previously defined as a 'def-label' by a name.
Element 'forbid' contains sequences of morphological categories that are not allowed in a given language
Each 'label-sequence' is restricted to two 'label-items'
Element 'enforce-rules' defines sets of coarse tags that must follow specificied ones
Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute
The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set'
Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag.
Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags