Difference between revisions of "Bitextor"
Jump to navigation
Jump to search
(6 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
'''Bitextor''': is an application whose objective is to generate translation memories using multilingual websites as a corpus source. It downloads all the HTML files in a website, it performs a preprocess to convert them to a coherent and suitable format and, finally, applies a set of heuristics (based mainly on HTML tag structure and text block length) to make pairs of files which are candidates to contain the same text in different languages. From these candidates, translation memories are generated in TMX format using the library LibTagAligner, which uses the HTML tags and the length of text chunks to perform the alignment. |
|||
'''Bitextor''' is a program for crawling multilingual websites and creating TMX files from aligned pages. |
|||
==See also== |
==See also== |
||
Line 7: | Line 7: | ||
==External links== |
==External links== |
||
* [http:// |
* [http://bitex2tmx.sourceforge.net Bitext2tmx project web] |
||
* [http://bitextor.sourceforge.net Bitextor project web] or [http://sourceforge.net/p/bitextor/wiki/Home/ Bitextor's wiki (documentation)] |
|||
* [http://omegatplus.sourceforge.net OmegaT+ project web] |
|||
* [http://tag-aligner.sourceforge.net TagAligner's project web] or [http://tag-aligner.wiki.sourceforge.net TagAligner's wiki (documentation)] |
|||
* [http://en.wikipedia.org/wiki/Translation_Memory_eXchange TMX on Wikipedia] |
* [http://en.wikipedia.org/wiki/Translation_Memory_eXchange TMX on Wikipedia] |
||
* [http://www.lisa.org/fileadmin/standards/tmx1.4/tmx.htm TMX 1.4b specification] |
|||
==Tools used by Bitextor== |
|||
* [http://software.wise-guys.nl/libtextcat/ LibTextCat (non official)] |
|||
* [http://tidy.sourceforge.net/ TidyHTML] |
|||
* [http://xmlsoft.org/ LibXML2] |
|||
* [https://sourceforge.net/projects/freshmeat_enca/ LibEnca] |
|||
* [https://sourceforge.net/projects/freshmeat_tre/ LibTRE] |
|||
[[Category:Tools]] |
[[Category:Tools]] |
Latest revision as of 13:49, 9 September 2015
Bitextor: is an application whose objective is to generate translation memories using multilingual websites as a corpus source. It downloads all the HTML files in a website, it performs a preprocess to convert them to a coherent and suitable format and, finally, applies a set of heuristics (based mainly on HTML tag structure and text block length) to make pairs of files which are candidates to contain the same text in different languages. From these candidates, translation memories are generated in TMX format using the library LibTagAligner, which uses the HTML tags and the length of text chunks to perform the alignment.
See also[edit]
External links[edit]
- Bitext2tmx project web
- Bitextor project web or Bitextor's wiki (documentation)
- OmegaT+ project web
- TagAligner's project web or TagAligner's wiki (documentation)
- TMX on Wikipedia
- TMX 1.4b specification