Difference between revisions of "Tools for TMX"

From Apertium
Jump to navigation Jump to search
 
(32 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Github-unmigrated-tool}}
{{TOCD}}
As it is now quite easy to make lots of large TMXes with [[Bitextor]] and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:
As it is now quite easy to make lots of large TMXes with [[Bitextor]] and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:


* strip out translation units for any given two languages. Bitextor and other tools generate TMX files with many possible combinations of languages, it would be good to be able to say, for a file of "no is en da sv", just give me all TUs which are "en-da".
* strip out translation units for any given two languages (<code>tmx-extract</code>). Bitextor and other tools generate TMX files with many possible combinations of languages, for a file of "no is en da sv", just give me all TUs which are "en-da".
* strip out duplicate translation units. Kind of like a <code>uniq</code> for TMX.
* strip out duplicate translation units (<code>tmx-uniq</code>).
* sort the file by: line length, language, etc. <code>sort</code> for TMX.
* sort the file by: line length, language, etc. (<code>tmx-sort</code>)
* trim the file of dubious TUs &mdash; very short translations of long segments, very different punctuation, etc.
* trim the file of dubious TUs (<code>tmx-trim</code>) &mdash; very short translations of long segments, very different punctuation, translations where the translation is exactly the same as the reference, translations which only consist of numbers, would be nice to have an option to give an MT of the target language to try and do better edit-distance, etc.
* re-perform language identification (<code>tmx-rident</code>) of all segments given a number of options (e.g. you know the file is in either Swedish or Danish, but some entries come up as Icelandic).
* re-format a TMX so that it fits the standard (<code>tmx-clean</code>), e.g. turns '&' into &amp; etc. and optionally removes formatting.<ref>Can you do this with <code>xmllint</code> maybe?</ref>
* merge TMX files (<code>tmx-merge</code>), merge TMX files and uniq them on the way.
* split a TMX file with many different languages (<code>tmx-split</code>) into tmx files with each of the different language pairs, optionally while re-identifying the language of each segment before placing it in a separate file.


==Code==
[[Category:Development]]

You can find some example code in the <code>apertium-tools/apertium-tmx-tools</code> sub-directory in [[subversion]].

;Efficiency

* <code>tmx-extract</code>
** For 188,807,877 TUs (file size: 16G), processed in 3 hours using max. 4M RAM.
* <code>tmx-uniq</code>
** For 1,055,574 TUs (file size: 298M), processed in 12 minutes using max. 1.7Gb RAM resulting in 25,546 TUs
* <code>tmx-trim</code>
** For 25,546 TUs (file size: 8.5M), processed in 11 seconds resulting in 13,009 TUs
** For 16,373,421 TUs (file size: 1.5G), processed in 57 minutes resulting in 9,492,081 TUs

;Pending jobs

* Make it work with <code>stdin</code> and <code>stdout</code>
* Make <code>tmx-uniq</code> work with a previously sorted TMX and not do the sorting itself.
* Make <code>tmx-sort</code> faster (maybe use a similar strategy to paradigm chopper?)
* Make better heuristics for <code>tmx-trim</code>
* Rewrite python in C++ where it makes sense.

==Examples==

===tmx-trim===

<code>tmx-trim</code> takes a TMX file and performs some basic discard operations, it checks for and discards the sentences if:

* the segments are exactly the same
* one segment is twice as long or twice as short as the other
* one segment as two more or two fewer punctuation marks than the other
* either one of the segments contains no alphabetic characters

It would be quite easy to add other heuristics, and tweak the above heuristics.

===tmx-extract===

<pre>
$ python tmx-extract.py bitext.tmx en da
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<body>
<tu tuid="532" datatype="Text">
<note>norden_org/start/start00a8.html_norden_org/start/start77a6.html</note>
<tuv xml:lang="en">
<seg>Prizes for literature, music, film and the environment.</seg>
</tuv>
<tuv xml:lang="da">
<seg>Priser inden for litteratur, musik, film og miljø.</seg>
</tuv>
</tu>
...
</pre>

===tmx-uniq===

<pre>
$ python tmx-uniq.py bitext.en-da.tmx1 > bitext-en-da.tmx.uniq
Total: 11111
Unique: 1164

$ cat bitext-en-da.tmx.uniq
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<body>
<tu tuid="38" datatype="Text">
<note>norden_org/printbf9b.html_norden_org/printed72.html</note>
<tuv xml:lang="en">
<seg>Tools for authors - Nordic Council of Ministers/Nordic Council</seg>
</tuv>
<tuv xml:lang="da">
<seg>Verkfæri fyrir höfunda - Norræna ráðherranefndin/Noðurlandaráð</seg>
</tuv>
</tu>
...
</pre>

===tmx-sort===

<pre>
$ python tmx-sort.py bitext.en-da.tmx > bitext.en-da.tmx.sort
Total: 11111

$ cat bitext.en-da.tmx.sort
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<body>
<tu tuid="991" datatype="Text">
<note>norden_org/print9b35.html_norden_org/print0eb2.html</note>
<tuv xml:lang="en">
<seg>to this address:</seg>
</tuv>
<tuv xml:lang="da">
<seg>til adressen:</seg>
</tuv>
</tu>
<tu tuid="1012" datatype="Text">
<note>norden_org/printb26a.html_norden_org/print0eb2.html</note>
<tuv xml:lang="en">
<seg>to this address:</seg>
</tuv>
<tuv xml:lang="da">
<seg>til adressen:</seg>
</tuv>
</tu>
...
</pre>

==Typical errors==

There follow some typical errors from automated TMX generation.

;Language identification

The translation is good, but the languages are wrong, in this case <code>sv</code> should be <code>is</code> and <code>da</code> should be <code>en</code>.

<pre>
<tu tuid="148" datatype="Text">
<note>norden_org/printedfa.html_norden_org/printb9a7.html</note>
<tuv xml:lang="sv">
<seg>Markmiðið er að upplýsa um norræna stefnuskrá og setja í brennidepil atburði og niðurstöður opinbers samstarfs á Norðurlöndum.</seg>
</tuv>
<tuv xml:lang="da">
<seg>The purpose is to inform about issues on the Nordic agenda and focus on some of the events and results of the official Nordic cooperation.</seg>
</tuv>
</tu>
</pre>

;Segment is too short

The TU is too short to be useful, although the "translation" is correct. Probably in this case it would be good to keep "words", but discard acronyms etc.

<pre>
<tu tuid="165" datatype="Text">
<note>norden_org/printedfa.html_norden_org/printb9a7.html</note>
<tuv xml:lang="sv">
<seg>HTML</seg>
</tuv>
<tuv xml:lang="da">
<seg>HTML</seg>
</tuv>
</tu>
</pre>

;Segment only consists of numbers

Aside from the fact that it is impossible<ref>Ok, just really difficult</ref> to do language identification on numbers, it isn't much use having these in the TMX file.

<pre>
<tu tuid="108" datatype="Text">
<note>norden_org/start/start00a8.html_norden_org/start/start.html</note>
<tuv xml:lang="sv">
<seg>2009-02-27</seg>
</tuv>
<tuv xml:lang="da">
<seg>27-02-2009</seg>
</tuv>
</tu>
</pre>

==Notes==
<references/>

==See also==
* [[TMX]]
* [[Translation memory]]

[[Category:Tools]]
[[Category:TMX]]

Latest revision as of 02:43, 10 March 2018

Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

As it is now quite easy to make lots of large TMXes with Bitextor and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:

  • strip out translation units for any given two languages (tmx-extract). Bitextor and other tools generate TMX files with many possible combinations of languages, for a file of "no is en da sv", just give me all TUs which are "en-da".
  • strip out duplicate translation units (tmx-uniq).
  • sort the file by: line length, language, etc. (tmx-sort)
  • trim the file of dubious TUs (tmx-trim) — very short translations of long segments, very different punctuation, translations where the translation is exactly the same as the reference, translations which only consist of numbers, would be nice to have an option to give an MT of the target language to try and do better edit-distance, etc.
  • re-perform language identification (tmx-rident) of all segments given a number of options (e.g. you know the file is in either Swedish or Danish, but some entries come up as Icelandic).
  • re-format a TMX so that it fits the standard (tmx-clean), e.g. turns '&' into & etc. and optionally removes formatting.[1]
  • merge TMX files (tmx-merge), merge TMX files and uniq them on the way.
  • split a TMX file with many different languages (tmx-split) into tmx files with each of the different language pairs, optionally while re-identifying the language of each segment before placing it in a separate file.

Code[edit]

You can find some example code in the apertium-tools/apertium-tmx-tools sub-directory in subversion.

Efficiency
  • tmx-extract
    • For 188,807,877 TUs (file size: 16G), processed in 3 hours using max. 4M RAM.
  • tmx-uniq
    • For 1,055,574 TUs (file size: 298M), processed in 12 minutes using max. 1.7Gb RAM resulting in 25,546 TUs
  • tmx-trim
    • For 25,546 TUs (file size: 8.5M), processed in 11 seconds resulting in 13,009 TUs
    • For 16,373,421 TUs (file size: 1.5G), processed in 57 minutes resulting in 9,492,081 TUs
Pending jobs
  • Make it work with stdin and stdout
  • Make tmx-uniq work with a previously sorted TMX and not do the sorting itself.
  • Make tmx-sort faster (maybe use a similar strategy to paradigm chopper?)
  • Make better heuristics for tmx-trim
  • Rewrite python in C++ where it makes sense.

Examples[edit]

tmx-trim[edit]

tmx-trim takes a TMX file and performs some basic discard operations, it checks for and discards the sentences if:

  • the segments are exactly the same
  • one segment is twice as long or twice as short as the other
  • one segment as two more or two fewer punctuation marks than the other
  • either one of the segments contains no alphabetic characters

It would be quite easy to add other heuristics, and tweak the above heuristics.

tmx-extract[edit]

$ python tmx-extract.py bitext.tmx en da
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<body>  
  <tu tuid="532" datatype="Text">
    <note>norden_org/start/start00a8.html_norden_org/start/start77a6.html</note>
    <tuv xml:lang="en">
      <seg>Prizes for literature, music, film and the environment.</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>Priser inden for litteratur, musik, film og miljø.</seg>
    </tuv>
  </tu>
  ...

tmx-uniq[edit]

$ python tmx-uniq.py bitext.en-da.tmx1 > bitext-en-da.tmx.uniq
Total: 11111
Unique: 1164

$ cat bitext-en-da.tmx.uniq
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<body>  
  <tu tuid="38" datatype="Text">
    <note>norden_org/printbf9b.html_norden_org/printed72.html</note>
    <tuv xml:lang="en">
      <seg>Tools for authors - Nordic Council of Ministers/Nordic Council</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>Verkfæri fyrir höfunda - Norræna ráðherranefndin/Noðurlandaráð</seg>
    </tuv>
  </tu>
  ...

tmx-sort[edit]

$ python tmx-sort.py bitext.en-da.tmx > bitext.en-da.tmx.sort
Total: 11111

$ cat bitext.en-da.tmx.sort
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<body>  
  <tu tuid="991" datatype="Text">
    <note>norden_org/print9b35.html_norden_org/print0eb2.html</note>
    <tuv xml:lang="en">
      <seg>to this address:</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>til adressen:</seg>
    </tuv>
  </tu>
  <tu tuid="1012" datatype="Text">
    <note>norden_org/printb26a.html_norden_org/print0eb2.html</note>
    <tuv xml:lang="en">
      <seg>to this address:</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>til adressen:</seg>
    </tuv>
  </tu>
  ...

Typical errors[edit]

There follow some typical errors from automated TMX generation.

Language identification

The translation is good, but the languages are wrong, in this case sv should be is and da should be en.

  <tu tuid="148" datatype="Text">
    <note>norden_org/printedfa.html_norden_org/printb9a7.html</note>
    <tuv xml:lang="sv">
      <seg>Markmiðið er að upplýsa um norræna stefnuskrá og setja í brennidepil atburði og niðurstöður opinbers samstarfs á Norðurlöndum.</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>The purpose is to inform about issues on the Nordic agenda and focus on some of the events and results of the official Nordic cooperation.</seg>
    </tuv>
  </tu>
Segment is too short

The TU is too short to be useful, although the "translation" is correct. Probably in this case it would be good to keep "words", but discard acronyms etc.

  <tu tuid="165" datatype="Text">
    <note>norden_org/printedfa.html_norden_org/printb9a7.html</note>
    <tuv xml:lang="sv">
      <seg>HTML</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>HTML</seg>
    </tuv>
  </tu>
Segment only consists of numbers

Aside from the fact that it is impossible[2] to do language identification on numbers, it isn't much use having these in the TMX file.

  <tu tuid="108" datatype="Text">
    <note>norden_org/start/start00a8.html_norden_org/start/start.html</note>
    <tuv xml:lang="sv">
      <seg>2009-02-27</seg>
    </tuv>
    <tuv xml:lang="da">
      <seg>27-02-2009</seg>
    </tuv>
  </tu>

Notes[edit]

  1. Can you do this with xmllint maybe?
  2. Ok, just really difficult

See also[edit]