Difference between revisions of "Tools for TMX"
Jump to navigation
Jump to search
Line 24: | Line 24: | ||
... |
... |
||
</pre> |
</pre> |
||
==Typical errors in Bitextor== |
|||
;Language identification |
|||
The translation is good, but the languages are wrong, in this case <code>sv</code> should be <code>is</code> and <code>da</code> should be <code>en</code>. |
|||
<pre> |
|||
<tu tuid="148" datatype="Text"> |
|||
<note>norden_org/printedfa.html_norden_org/printb9a7.html</note> |
|||
<tuv xml:lang="sv"> |
|||
<seg>Markmiðið er að upplýsa um norræna stefnuskrá og setja í brennidepil atburði og niðurstöður opinbers samstarfs á Norðurlöndum. |
|||
</seg> |
|||
</tuv> |
|||
<tuv xml:lang="da"> |
|||
<seg>The purpose is to inform about issues on the Nordic agenda and focus on some of the events and results of the official Nordic |
|||
cooperation.</seg> |
|||
</tuv> |
|||
</tu> |
|||
</pre> |
|||
[[Category:Development]] |
[[Category:Development]] |
Revision as of 10:42, 11 March 2009
As it is now quite easy to make lots of large TMXes with Bitextor and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:
- strip out translation units for any given two languages (
tmx-extract
). Bitextor and other tools generate TMX files with many possible combinations of languages, it would be good to be able to say, for a file of "no is en da sv", just give me all TUs which are "en-da". - strip out duplicate translation units (
tmx-uniq
). - sort the file by: line length, language, etc. (
tmx-sort
) - trim the file of dubious TUs (
tmx-trim
) — very short translations of long segments, very different punctuation, translations where the translation is exactly the same as the reference, translations which only consist of numbers, etc. - re-perform language identification of all segments given a number of options (e.g. you know the file is in either Swedish or Danish, but some entries come up as Icelandic).
Examples
tmx-extract
$ python tmx-extract.py bitext.tmx en da <tu tuid="532" datatype="Text"> <note>norden_org/start/start00a8.html_norden_org/start/start77a6.html</note> <tuv xml:lang="en"> <seg>Prizes for literature, music, film and the environment.</seg> </tuv> <tuv xml:lang="da"> <seg>Priser inden for litteratur, musik, film og miljø.</seg> </tuv> </tu> ...
Typical errors in Bitextor
- Language identification
The translation is good, but the languages are wrong, in this case sv
should be is
and da
should be en
.
<tu tuid="148" datatype="Text"> <note>norden_org/printedfa.html_norden_org/printb9a7.html</note> <tuv xml:lang="sv"> <seg>Markmiðið er að upplýsa um norræna stefnuskrá og setja í brennidepil atburði og niðurstöður opinbers samstarfs á Norðurlöndum. </seg> </tuv> <tuv xml:lang="da"> <seg>The purpose is to inform about issues on the Nordic agenda and focus on some of the events and results of the official Nordic cooperation.</seg> </tuv> </tu>