Difference between revisions of "Ideas for Google Summer of Code/Regular expressions in lt-tmxproc"
m (some more) |
m (bah) |
||
Line 17: | Line 17: | ||
takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011). |
takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011). |
||
− | <code>lttoolbox</code> has support for simple regexes; <code>lt-tmxproc</code> builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for |
+ | <code>lttoolbox</code> has support for simple regexes; <code>lt-tmxproc</code> builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for simple numbers, by inserting the special symbol <n> in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output. |
The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox. |
The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox. |
Revision as of 18:02, 27 March 2011
Gintrowicz and Jassem describe an idea for getting more reuse from translation memories, by extending them with regular expressions.
For example, the sample rule:
Rule 1: 1. <instance>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</instance> 2. <source>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</source> 3. <target>([0-9]{1,2})[\/]([0-9]{1,2})[\/]([0-9]{2,4})</target> 4. <orders> 5. <order sourceGroup=”1” suffix=”/” /> 6. <order sourceGroup=”2” suffix=”/” /> 7. <order sourceGroup=”3” suffix=”” /> 8. </orders>
takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011).
lttoolbox
has support for simple regexes; lt-tmxproc
builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for simple numbers, by inserting the special symbol <n> in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output.
The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox.
Because regular expressions are used in Apertium's dictionaries already, it would be desirable to reuse the existing dictionary format, so regular expressions can be reused between the translator and translation memories. Instead of rules in the Gintrowicz/Jassem format, we would use complicated dictionary entries: the changes between source and target can be given as <p>
elements, or even <pardef>
.
The idea is to extend the TMX compiler to take a precompiled set of regex entries. For each <tu>, check if the regex transducers match against the text. If they do, insert the relevant transducer. This would probably only be feasible if entries in the regex "dictionary" are treated like <section> - each entry gets its own transducer, which are unified at runtime. (see fst_processor.cc
in lttoolbox).