Ideas for Google Summer of Code/Regular expressions in lt-tmxproc

From Apertium
Jump to navigation Jump to search

Gintrowicz and Jassem describe an idea for getting more reuse from translation memories, by extending them with regular expressions.

For example, the sample rule:

Rule 1:
1.  <instance>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</instance>
2.  <source>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</source>
3.  <target>([0-9]{1,2})[\/]([0-9]{1,2})[\/]([0-9]{2,4})</target>
4.  <orders>
5.  <order sourceGroup=”1” suffix=”/” />
6.  <order sourceGroup=”2” suffix=”/” />
7.  <order sourceGroup=”3” suffix=”” />
8.  </orders>

takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011).

lttoolbox has support for simple regexes; lt-tmxproc builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for similar numbers, by inserting the special symbol <n> in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output.

The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox.

Because regular expressions are used in Apertium's dictionaries already, it would be desirable to reuse the existing dictionary format, so regular expressions can be reused between the translator and translation memories. Instead of rules in the Gintrowicz/Jassem format, we would use complicated dictionary entries: the changes between source and target can be given as <p> elements, or even <pardef>.

The idea is to extend the TMX compiler to take a precompiled set of regex entries. For each <tu>, check if the regex transducers match against the text. If they do, insert the relevant transducer. This would probably only be feasible if entries in the regex "dictionary" are treated like <section> - each entry gets its own transducer, which are unified at runtime. (see fst_processor.cc in lttoolbox).