Difference between revisions of "Lsx module report"
Jump to navigation
Jump to search
(Blanked the page) |
|||
Line 1: | Line 1: | ||
− | ==Project description== |
||
− | The purpose of this project is to allow Apertium language-pair developers to better translate "seperable" or "discontiguous" multiwords. We do this by re-ordering word tokens before translation occurs. For example, "take something out" becomes "take out something" so that "take out" can be translated as a single unit. |
||
− | |||
− | To do this, a finite-state transducer was used. The transducer accepted certain patterns of words (paradigms), such as adj-noun or det-adj-noun, that could separate the multiword. If the pattern was accepted, then the transducer would output the re-ordered words for better translation quality. |
||
− | |||
− | ==Work done== |
||
− | * established dictionary format |
||
− | ** <j/>, <t/>, <w/> are supported within pair entries and loop as expected |
||
− | ** see https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/examples/apertium-eng-spa.eng-spa.lsx for an example dictionary |
||
− | |||
− | * implemented a compiler for the separable-words dictionary and a processor to process tagged input |
||
− | ** see https://svn.code.sf.net/p/apertium/svn/branches/apertium-separable/src/ |
||
− | |||
− | * all spacing, punctuation, and superblanks were preserved |
||
− | |||
− | * support for the "plus thing": |
||
− | <pre> |
||
− | echo "^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$" | lsx-proc kaz-kir.autoseq.bin |
||
− | ^абай бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$ |
||
− | </pre> |
||
− | |||
− | * (for language developers: have the language-data writer write it explicitly in the .lsx file) |
||
− | |||
− | * For a full list of commits, see https://apertium.projectjj.com/gsoc2017/irene-tang.html |
||
− | * For further documentation usage instructions, see [[Lsx_module]] |
||
− | |||
− | ==Future work== |
||
− | * 10:53 firespeaker: pektii: if we offload multiwords from the transducers to lsx, (1) how do we do N N compounds with lsx? (2) how does translation *to* a multiword work? |
||
− | * recycling dictionaries and/or paradigms? lsx-dictionaries are packaged in language pairs. the eng-spa lsx-dictionary can mostly be reaped by eng-cat. could we make use of the similarity? |
||
− | * Support for language pairs: we haven't gotten much extensive beta testing. The following are language pairs that have packaged the lsx-module: |
||
− | ** eng-cat |
||
− | ** eng-deu (?) |