Tools for TMX

strip out translation units for any given two languages (tmx-extract). Bitextor and other tools generate TMX files with many possible combinations of languages, for a file of "no is en da sv", just give me all TUs which are "en-da".
strip out duplicate translation units (tmx-uniq).
sort the file by: line length, language, etc. (tmx-sort)
trim the file of dubious TUs (tmx-trim) — very short translations of long segments, very different punctuation, translations where the translation is exactly the same as the reference, translations which only consist of numbers, etc.
re-perform language identification (tmx-rident) of all segments given a number of options (e.g. you know the file is in either Swedish or Danish, but some entries come up as Icelandic).
re-format a TMX so that it fits the standard (tmx-reformat), e.g. turns '&' into & etc.

Code

You can find any code in the apertium-tools/apertium-tmx-tools sub-directory in subversion.

Examples

tmx-extract

$ python tmx-extract.py bitext.tmx en da
  <tu tuid="532" datatype="Text">
    <note>norden_org/start/start00a8.html_norden_org/start/start77a6.html</note>
    <tuv xml:lang="en">
      <seg>Prizes for literature, music, film and the environment.</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>Priser inden for litteratur, musik, film og miljø.</seg>
    </tuv>
  </tu>
  ...

tmx-uniq

$ python tmx-uniq.py bitext.en-da.tmx1 > bitext-en-da.uniq.tmx
Total: 11111
Unique: 1164

$ cat bitext-en-da.uniq.tmx
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<body>  
  <tu tuid="38" datatype="Text">
    <note>norden_org/printbf9b.html_norden_org/printed72.html</note>
    <tuv xml:lang="en">
      <seg>Tools for authors - Nordic Council of Ministers/Nordic Council</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>Verkfæri fyrir höfunda - Norræna ráðherranefndin/Noðurlandaráð</seg>
    </tuv>
  </tu>
  ...

Typical errors

There follow some typical errors from automated TMX generation.

Language identification

The translation is good, but the languages are wrong, in this case sv should be is and da should be en.

  <tu tuid="148" datatype="Text">
    <note>norden_org/printedfa.html_norden_org/printb9a7.html</note>
    <tuv xml:lang="sv">
      <seg>Markmiðið er að upplýsa um norræna stefnuskrá og setja í brennidepil atburði og niðurstöður opinbers samstarfs á Norðurlöndum.</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>The purpose is to inform about issues on the Nordic agenda and focus on some of the events and results of the official Nordic cooperation.</seg>
    </tuv>
  </tu>

Segment is too short

The TU is too short to be useful, although the "translation" is correct. Probably in this case it would be good to keep "words", but discard acronyms etc.

  <tu tuid="165" datatype="Text">
    <note>norden_org/printedfa.html_norden_org/printb9a7.html</note>
    <tuv xml:lang="sv">
      <seg>HTML</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>HTML</seg>
    </tuv>
  </tu>

Segment only consists of numbers

Aside from the fact that it is impossible^[1] to do language identification on numbers, it isn't much use having these in the TMX file.

  <tu tuid="108" datatype="Text">
    <note>norden_org/start/start00a8.html_norden_org/start/start.html</note>
    <tuv xml:lang="sv">
      <seg>2009-02-27</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>27-02-2009</seg>
    </tuv>
  </tu>

Notes

↑ Ok, just really difficult

[1] Ok, just really difficult

[1]

Tools for TMX

Contents

Code

Examples

tmx-extract

tmx-uniq

Typical errors

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools