Difference between revisions of "MTX format"

From Apertium
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 51: Line 51:


Each <feat> tag can generate zero, one or many features for each wordoid.
Each <feat> tag can generate zero, one or many features for each wordoid.

== Designing new features ==

One process to design a new feature might go as follows:

# Start from an existing feature file.
# Find a mistagged word.
# Run the tagger in debug mode as detailed on [[Perceptron tagger]].
# Come up with a possible feature which would end up fire for the correct analysis and end up positive given the training data or come up with a feature which would fire for the incorrect analysis and would end up negative given the training data.
# Rerun to see if it's fixed your mistagged word.
# Run cross validation on your whole training corpus to check the overall accuracy hasn't gone down.


== Tag reference ==
== Tag reference ==

=== Macro definition and use ===

To define a macro foo with arguments bar and baz:

<nowiki>
<def-macro as="foo" args="bar baz">
...
<var name="bar" />
...
<var name="baz" />
...
</def-macro></nowiki>

To use it in a feature.

<nowiki>
<feat>
<macro name="foo">
<int val="42" />
<str val="ook?" />
</feat></nowiki>

=== Other definitions and their use ===

<def-set> <def-str>


=== Boolean operators ===
=== Boolean operators ===


<and>, <not>, <or>
<and>, <not>, <or>

The commutative operators <and> and <or> can take 2 or more operands. Not can take one operand.


=== Arithmetic operators ===
=== Arithmetic operators ===


<nowiki>
<add>, <sub>
<add>, <sub></nowiki>


=== Feature extraction ===
=== Feature extraction ===

<ex-wordoids>


=== Wordoid addressing ===
=== Wordoid addressing ===

Lexical units are addressed with integers, but addressing wordoids must be done by...

<addr-of-ints>
<clamp>
<carry>


=== Sets ===
=== Sets ===

<set-has>


=== String operators ===
=== String operators ===

<substr>


=== Loops ===
=== Loops ===


<for-each>
=== Macros ===

Latest revision as of 18:42, 29 August 2016

This page serves a reference to the MTX format. The MTX format describes features to be used by the Perceptron tagger.

Example[edit]

Here is an example of the basic outline of an MTX file to illustrate the structure and some common constructs:

<?xml version="1.0" ?>①
<!DOCTYPE metatag [
  <!ENTITY commondefns SYSTEM "commondefns.mtx">②
]>
<!-- Comment -->③
<metatag>
  <coarse-tags tag="mytsx.tsx" />④
  <beam-width val="10" />⑤
  <defns>⑥
    &commondefns;②
    <def-str name="plus" val="+" />
    <def-macro name="foo">
      ...
    </def-macro>
    ...
  </defns>
  <feats>⑦
    <!-- Major tag (all wordoids) -->
    <feat>⑧
      ...
      <pred>...</pred>
      <out>
        <macro name="foo"></macro>
        ...
      </out>
      <out-many>...</out-many>
    </feat>
  </feats>
</metatag>

  1. The format is an XML format.
  2. So files can be included using XML entities as illustrated.
  3. And XML comments can be used.
  4. If you want to make use of coarse tags you can reference a TSX file using a relative file path.
  5. You can change the beam width of used in decoding with this tag.
  6. The defns section contains constants and macros.
  7. The feats section contains feature definitions
  8. Each feature definition can contain many boolean predicates with <pred>, normal output with <out> and generation of many features from an array type with <out-many>

Operational explanation[edit]

Features are generated for each word/subword/inflection group (hereafter referred to here as wordoids). Note that each lexical unit (as defined in Apertium stream format) can have many possible analyses and each analysis can be made up of many wordoids, each with a lemma and list of tags.

Each <feat> tag can generate zero, one or many features for each wordoid.

Designing new features[edit]

One process to design a new feature might go as follows:

  1. Start from an existing feature file.
  2. Find a mistagged word.
  3. Run the tagger in debug mode as detailed on Perceptron tagger.
  4. Come up with a possible feature which would end up fire for the correct analysis and end up positive given the training data or come up with a feature which would fire for the incorrect analysis and would end up negative given the training data.
  5. Rerun to see if it's fixed your mistagged word.
  6. Run cross validation on your whole training corpus to check the overall accuracy hasn't gone down.

Tag reference[edit]

Macro definition and use[edit]

To define a macro foo with arguments bar and baz:

<def-macro as="foo" args="bar baz">
  ...
  <var name="bar" />
  ...
  <var name="baz" />
  ...
</def-macro>

To use it in a feature.

<feat>
  <macro name="foo">
    <int val="42" />
    <str val="ook?" />
</feat>

Other definitions and their use[edit]

<def-set> <def-str>

Boolean operators[edit]

<and>, <not>, <or>

The commutative operators <and> and <or> can take 2 or more operands. Not can take one operand.

Arithmetic operators[edit]

<add>, <sub>

Feature extraction[edit]

<ex-wordoids>

Wordoid addressing[edit]

Lexical units are addressed with integers, but addressing wordoids must be done by...

<addr-of-ints> <clamp> <carry>

Sets[edit]

<set-has>

String operators[edit]

<substr>

Loops[edit]

<for-each>