Ideas for Google Summer of Code/Regular expressions in lt-tmxproc

From Apertium
< Ideas for Google Summer of Code
Revision as of 17:51, 27 March 2011 by Jimregan (talk | contribs) (some blurb)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Gintrowicz and Jassem describe an idea for getting more reuse from translation memories, by extending them with regular expressions.

For example, the sample rule:

Rule 1:
1.  <instance>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</instance>
2.  <source>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</source>
3.  <target>([0-9]{1,2})[\/]([0-9]{1,2})[\/]([0-9]{2,4})</target>
4.  <orders>
5.  <order sourceGroup=”1” suffix=”/” />
6.  <order sourceGroup=”2” suffix=”/” />
7.  <order sourceGroup=”3” suffix=”” />
8.  </orders>

takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011).

lttoolbox has support for simple regexes; lt-tmxproc builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for similar numbers, by inserting the special symbol <n> in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output.

The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox.