Difference between revisions of "Apertium separable/report2017"

From Apertium
Jump to navigation Jump to search
(Created page with "==Project description== The purpose of this project is to allow Apertium language-pair developers to better translate "seperable" or "discontiguous" multiwords. We do this by ...")
 
m (Irene moved page Lsx module/report2017 to Apertium separable/report2017: Rename page)
 
(4 intermediate revisions by 2 users not shown)
Line 13: Line 13:
   
 
* all spacing, punctuation, and superblanks were preserved
 
* all spacing, punctuation, and superblanks were preserved
 
* support for the "plus thing":
 
<pre>
 
echo "^абай<adj>$ ^бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$" | lsx-proc kaz-kir.autoseq.bin
 
^абай бол<v><iv><neg><aor><p2><sg>+ма<qst>$^.<sent>$
 
</pre>
 
   
 
* (for language developers: have the language-data writer write it explicitly in the .lsx file)
 
* (for language developers: have the language-data writer write it explicitly in the .lsx file)
   
* For a full list of commits, see https://apertium.projectjj.com/gsoc2017/irene-tang.html
+
* For a '''full list of commits''', see https://apertium.projectjj.com/gsoc2017/irene-tang.html
 
* For further documentation usage instructions, see [[Lsx_module]]
 
* For further documentation usage instructions, see [[Lsx_module]]
   
 
==Future work==
 
==Future work==
  +
See [[Lsx_module#Future_work]].
* 10:53 firespeaker: pektii: if we offload multiwords from the transducers to lsx, (1) how do we do N N compounds with lsx? (2) how does translation *to* a multiword work?
 
* recycling dictionaries and/or paradigms? lsx-dictionaries are packaged in language pairs. the eng-spa lsx-dictionary can mostly be reaped by eng-cat. could we make use of the similarity?
 
* Support for language pairs: we haven't gotten much extensive beta testing. The following are language pairs that have packaged the lsx-module:
 
** eng-cat
 
** eng-deu (?)
 

Latest revision as of 18:36, 15 November 2017

Project description[edit]

The purpose of this project is to allow Apertium language-pair developers to better translate "seperable" or "discontiguous" multiwords. We do this by re-ordering word tokens before translation occurs. For example, "take something out" becomes "take out something" so that "take out" can be translated as a single unit.

To do this, a finite-state transducer was used. The transducer accepted certain patterns of words (paradigms), such as adj-noun or det-adj-noun, that could separate the multiword. If the pattern was accepted, then the transducer would output the re-ordered words for better translation quality.

Work done[edit]

  • all spacing, punctuation, and superblanks were preserved
  • (for language developers: have the language-data writer write it explicitly in the .lsx file)

Future work[edit]

See Lsx_module#Future_work.