Difference between revisions of "Tools for TMX"

From Apertium
Jump to navigation Jump to search
(New page: As it is now quite easy to make lots of large TMXes with Bitextor and other tools, it would be good to have some tools for processing them. There are various things that it would be us...)
 
Line 1: Line 1:
As it is now quite easy to make lots of large TMXes with [[Bitextor]] and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example:
+
As it is now quite easy to make lots of large TMXes with [[Bitextor]] and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:
 
* Given a TMX file, strip out translation units for any given two languages. Bitextor and other tools generate TMX files with many possible combinations of languages, it would be good to be able to say, for a file of "no is en da sv", just give me all TUs which are "en-da".
 
* Given a TMX file, strip out duplicate translation units. Kind of like a <code>uniq</code> for TMX.
 
* Given a TMX file, sort the file by: line length, language, etc. <code>sort</code> for TMX.
 
   
 
* strip out translation units for any given two languages. Bitextor and other tools generate TMX files with many possible combinations of languages, it would be good to be able to say, for a file of "no is en da sv", just give me all TUs which are "en-da".
 
* strip out duplicate translation units. Kind of like a <code>uniq</code> for TMX.
 
* sort the file by: line length, language, etc. <code>sort</code> for TMX.
  +
* trim the file of dubious TUs &mdash; very short translations of long segments, very different punctuation, etc.
   
 
[[Category:Development]]
 
[[Category:Development]]

Revision as of 10:25, 11 March 2009

As it is now quite easy to make lots of large TMXes with Bitextor and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:

  • strip out translation units for any given two languages. Bitextor and other tools generate TMX files with many possible combinations of languages, it would be good to be able to say, for a file of "no is en da sv", just give me all TUs which are "en-da".
  • strip out duplicate translation units. Kind of like a uniq for TMX.
  • sort the file by: line length, language, etc. sort for TMX.
  • trim the file of dubious TUs — very short translations of long segments, very different punctuation, etc.