Difference between revisions of "Ideas for Google Summer of Code/Regular expressions in lt-tmxproc"

From Apertium
Jump to navigation Jump to search
m (show how it could look in apertium format)
m (categorize)
 
Line 36: Line 36:
   
 
The idea is to extend the TMX compiler to take a precompiled set of regex entries. For each &lt;tu&gt;, check if the regex transducers match against the text. If they do, insert the relevant transducer. This would probably only be feasible if entries in the regex "dictionary" are treated like &lt;section&gt; - each entry gets its own transducer, which are unified at runtime. (see <code>fst_processor.cc</code> in lttoolbox).
 
The idea is to extend the TMX compiler to take a precompiled set of regex entries. For each &lt;tu&gt;, check if the regex transducers match against the text. If they do, insert the relevant transducer. This would probably only be feasible if entries in the regex "dictionary" are treated like &lt;section&gt; - each entry gets its own transducer, which are unified at runtime. (see <code>fst_processor.cc</code> in lttoolbox).
  +
  +
[[Category:Ideas_for_Google_Summer_of_Code]]

Latest revision as of 19:55, 24 March 2020

Gintrowicz and Jassem describe an idea for getting more reuse from translation memories, by extending them with regular expressions.

For example, the sample rule:

Rule 1:
1.  <instance>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</instance>
2.  <source>([0-9]{1,2})[\.]([0-9]{1,2})[\.]([0-9]{2,4})</source>
3.  <target>([0-9]{1,2})[\/]([0-9]{1,2})[\/]([0-9]{2,4})</target>
4.  <orders>
5.  <order sourceGroup=”1” suffix=”/” />
6.  <order sourceGroup=”2” suffix=”/” />
7.  <order sourceGroup=”3” suffix=”” />
8.  </orders>

takes a Polish date (27.03.2011) and reformats it as an English date (27/03/2011).

lttoolbox has support for simple regexes; lt-tmxproc builds on lttoolbox, to build a finite state transducer from TMX files. At present, it includes similar support for simple numbers, by inserting the special symbol <n> in place of the number in the transducer; at runtime, when this symbol is encountered, numbers are copied straight from input to output.

The idea of this project is to extend lt-tmxproc to include the regular expressions support in lttoolbox.

Because regular expressions are used in Apertium's dictionaries already, it would be desirable to reuse the existing dictionary format, so regular expressions can be reused between the translator and translation memories. Instead of rules in the Gintrowicz/Jassem format, we would use complicated dictionary entries: the changes between source and target can be given as <p> elements, or even <pardef>.

The above could be expressed in an Apertium dictionary as:

<e>
  <re>[0-9]?[0-9]</re>
  <p><l>.</l><r>/</r>
  <re>[0-9]?[0-9]</re>
  <p><l>.</l><r>/</r>
  <re>([0-9][0-9])?[0-9][0-9]</re>
</e>

The idea is to extend the TMX compiler to take a precompiled set of regex entries. For each <tu>, check if the regex transducers match against the text. If they do, insert the relevant transducer. This would probably only be feasible if entries in the regex "dictionary" are treated like <section> - each entry gets its own transducer, which are unified at runtime. (see fst_processor.cc in lttoolbox).