Difference between revisions of "TSX format"
m (TSX moved to TSX format) |
(Link to French page) |
||
(12 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
[[Le format TSX|En français]] |
|||
⚫ | |||
{{TOCD}} |
|||
⚫ | |||
The '''TSX format''' is used in Apertium in order to define a tagger description file. The file is used in [[tagger training]] in order to provide definitions of coarse tags, and to provide basic constraints in the form of ''forbid'' and ''enforce'' rules. |
|||
==Defining tags== |
|||
Each 'tags-item' may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma |
|||
⚫ | |||
⚫ | |||
Here is a clip of a TSX file for Norwegian Bokmål: |
|||
Element 'sequence' encloses a set of tags o labels which defines a unit with more than one label |
|||
<pre> |
|||
⚫ | |||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<tagger name="Norwegian Bokmål"> |
|||
<tagset> |
|||
<def-label name="NOMM"> |
|||
<tags-item tags="n.m.*"/> |
|||
</def-label> |
|||
<def-label name="BARE-ADV"> |
|||
<tags-item lemma="bare" tags="adv"/> |
|||
</def-label> |
|||
<def-label name="DETF" closed="true"> |
|||
<tags-item tags="det.*.f.*"/> |
|||
</def-label> |
|||
</tagset> |
|||
... |
|||
</tagger> |
|||
</pre> |
|||
⚫ | Each <code>def-label</code> element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute <code>closed="true"</code> may be used to specify if the defined fine tags belong to a closed list. Each <code>tags-item</code> element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma. |
||
⚫ | Under the <code>tagset</code> element you can also define sequences of fine tags and coarse tags, using <code>def-mult</code>. Each <code>def-mult</code> defines one coarse tag in terms of a sequence of coarse tags previously defined as <code>def-label</code>s or a sequence of fine tags. A mandatory name is required for each <code>def-mult</code> which may also has an optional attribute <code>closed="true"</code> if it belongs to a closed list. |
||
For example if we want to define a group of "preposition" followed by a masculine definite article: |
|||
<pre> |
|||
<def-mult name="PREPDET" closed="true"> |
|||
<sequence> |
|||
<label-item label="PREP"/> |
|||
<tags-item tags="det.def.m.*"/> |
|||
</sequence> |
|||
</def-mult> |
|||
</pre> |
|||
⚫ | |||
==Forbid== |
|||
Element 'forbid' contains sequences of morphological categories that are not allowed in a given language |
Element 'forbid' contains sequences of morphological categories that are not allowed in a given language |
||
Line 15: | Line 53: | ||
Each 'label-sequence' is restricted to two 'label-items' |
Each 'label-sequence' is restricted to two 'label-items' |
||
==Enforce== |
|||
⚫ | |||
⚫ | |||
Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute |
Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute |
||
The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set' |
The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set' |
||
==Prefer== |
|||
Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag. |
Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag. |
||
Line 25: | Line 67: | ||
Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags |
Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags |
||
==See also== |
|||
⚫ | |||
* [[Constraint Grammar]] |
|||
* [https://github.com/jimregan/tag-clusterer tag-clusterer] – takes pre-tagged text (non-ambiguous tagged text made with e.g. apertium-tagger or CG or what have you) and turns that into a TSX file. |
|||
[[Category:Formats]] |
|||
⚫ |
Latest revision as of 09:55, 6 October 2014
Contents |
The TSX format is used in Apertium in order to define a tagger description file. The file is used in tagger training in order to provide definitions of coarse tags, and to provide basic constraints in the form of forbid and enforce rules.
Defining tags[edit]
The 'tagset' section defines the correspondance between simple or multiple morphological categories defining a lexical form and the coarser ones with which the part-of-speech tagger works.
Here is a clip of a TSX file for Norwegian Bokmål:
<?xml version="1.0" encoding="UTF-8"?> <tagger name="Norwegian Bokmål"> <tagset> <def-label name="NOMM"> <tags-item tags="n.m.*"/> </def-label> <def-label name="BARE-ADV"> <tags-item lemma="bare" tags="adv"/> </def-label> <def-label name="DETF" closed="true"> <tags-item tags="det.*.f.*"/> </def-label> </tagset> ... </tagger>
Each def-label
element defines one coarse tag in terms of a list of fine tags and has a mandatory unique name. The optional attribute closed="true"
may be used to specify if the defined fine tags belong to a closed list. Each tags-item
element may be a dot-separated subsequence of the morphological tags corresponding to a coarse tag optionally in association with a given lemma.
Under the tagset
element you can also define sequences of fine tags and coarse tags, using def-mult
. Each def-mult
defines one coarse tag in terms of a sequence of coarse tags previously defined as def-label
s or a sequence of fine tags. A mandatory name is required for each def-mult
which may also has an optional attribute closed="true"
if it belongs to a closed list.
For example if we want to define a group of "preposition" followed by a masculine definite article:
<def-mult name="PREPDET" closed="true"> <sequence> <label-item label="PREP"/> <tags-item tags="det.def.m.*"/> </sequence> </def-mult>
The element sequence
encloses a set of tags or labels which defines a unit with more than one label. Each label
of the label-item
correspond to a coarse tag previously defined as a 'def-label' by a name.
Forbid[edit]
Element 'forbid' contains sequences of morphological categories that are not allowed in a given language
Each 'label-sequence' is restricted to two 'label-items'
Enforce[edit]
The element 'enforce-rules' defines sets of coarse tags that must follow specified ones
Each 'enforce-after' encloses the set of coarse tags ('label-set') that must follow the one defined in 'label', as a mandatory attribute
The set of 'label-items' enforced after a 'label' are enclosed inside element 'label-set'
Prefer[edit]
Element 'preferences' allows to decide amongst two or more fine tag sequences which are grouped in the same coarse tag.
Each 'prefer' element has a mandatory attribute 'tags' made of a sequence of fine tags
See also[edit]
- Constraint Grammar
- tag-clusterer – takes pre-tagged text (non-ambiguous tagged text made with e.g. apertium-tagger or CG or what have you) and turns that into a TSX file.