Difference between revisions of "Ideas for Google Summer of Code/Regular expressions in lt-tmxproc"

Revision as of 18:02, 27 March 2011

Gintrowicz and Jassem describe an idea for getting more reuse from translation memories, by extending them with regular expressions.

For example, the sample rule:

Rule 1:
1.  <instance>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</instance>
2.  <source>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</source>
3.  <target>([0-9]{1,2})[\/]([0-9]{1,2})[\/]([0-9]{2,4})</target>
4.  <orders>
5.  <order sourceGroup=”1” suffix=”/” />
6.  <order sourceGroup=”2” suffix=”/” />
7.  <order sourceGroup=”3” suffix=”” />
8.  </orders>

takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011).

lttoolbox has support for simple regexes; lt-tmxproc builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for simple numbers, by inserting the special symbol <n> in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output.

The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox.

Because regular expressions are used in Apertium's dictionaries already, it would be desirable to reuse the existing dictionary format, so regular expressions can be reused between the translator and translation memories. Instead of rules in the Gintrowicz/Jassem format, we would use complicated dictionary entries: the changes between source and target can be given as <p> elements, or even <pardef>.

The idea is to extend the TMX compiler to take a precompiled set of regex entries. For each <tu>, check if the regex transducers match against the text. If they do, insert the relevant transducer. This would probably only be feasible if entries in the regex "dictionary" are treated like <section> - each entry gets its own transducer, which are unified at runtime. (see fst_processor.cc in lttoolbox).

@@ Line 17: / Line 17: @@
 takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011).
-<code>lttoolbox</code> has support for simple regexes; <code>lt-tmxproc</code> builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for similar numbers, by inserting the special symbol &lt;n&gt; in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output.
+<code>lttoolbox</code> has support for simple regexes; <code>lt-tmxproc</code> builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for simple numbers, by inserting the special symbol &lt;n&gt; in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output.
 The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox.

Difference between revisions of "Ideas for Google Summer of Code/Regular expressions in lt-tmxproc"

Revision as of 18:02, 27 March 2011

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools