Difference between revisions of "User-based chunking"

From Apertium
Jump to navigation Jump to search
(Created page with "Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation....")
 
 
(4 intermediate revisions by the same user not shown)
Line 13: Line 13:


** And then he came back and said "Bu ne lan?"
** And then he came back and said "Bu ne lan?"

==Limitations==

* The words in the chunk should be contiguous.


==Solutions==
==Solutions==
Line 25: Line 29:
+Microsoft +365 +for +business is right choice for your company
+Microsoft +365 +for +business is right choice for your company


after lt-proc | agger:
^+Microsoft/Microsoft<np><al>$ ^+365/365<num>$ ^+for/for<pr>$ ^+business/business$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company
^+Microsoft/Microsoft<np><al>$ ^+365/365<num>$ ^+for/for<pr>$ ^+business/business<n>$ ^is/be<vbser><pres><p3><sg>$ ^the/the<det><def><sp>$ right choice for your company


after apertium-separable:
^+Microsoft 365 for business/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company.
^+Microsoft 365 for business/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company.


after bidix lookup:
...
^+Microsoft 365 for business/Microsoft 365 for business<np><al>/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>/ser<vbser><pres><p3><sg>$ ^the/the<det><def><sp>/el<det><def><GD><ND>$ right choice for your company.

after apertium-transfer:
^+Microsoft 365 for business<np><al>$ ^ser<vbser><pri><p3><sg>$ ^el<det><def><f><sg>$ ...
Microsoft 365 for business és la ...
</pre>
</pre>

; Automatic

For dealing with quoted text we might be better to have an automatic system that does language identification combined with rules. The same technique could be used, words that receive an analysis would get the surface form prefixed with <tt>+</tt> after running through the module.




Line 36: Line 52:


* Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.
* Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.

[[Category:Ideas]]

Latest revision as of 21:15, 21 June 2020

Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation.

Some use-cases:

  • Product names, long named entities
    • Microsoft 365 for business is the right choice for your company
    • A very popular film adaptation is "Gone with the Wind".
  • The use of quoted chunks as modifiers
  • Quotations in a second (or third) language
    • And then he came back and said "Bu ne lan?"

Limitations[edit]

  • The words in the chunk should be contiguous.

Solutions[edit]

User-based, online

1. A user could mark in an interface that they don't want a particular span translating, each word would be prefixed then with a symbol. Sequences of these words would be merged by -separable or something like it. And the whole chunk would be either given a default tag, a tag determined by a classifier or a tag determined by rules.

Microsoft 365 for business is right choice for your company

+Microsoft +365 +for +business is right choice for your company

after lt-proc | agger:
^+Microsoft/Microsoft<np><al>$ ^+365/365<num>$ ^+for/for<pr>$ ^+business/business<n>$ ^is/be<vbser><pres><p3><sg>$ ^the/the<det><def><sp>$ right choice for your company

after apertium-separable:
^+Microsoft 365 for business/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company.

after bidix lookup:
^+Microsoft 365 for business/Microsoft 365 for business<np><al>/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>/ser<vbser><pres><p3><sg>$ ^the/the<det><def><sp>/el<det><def><GD><ND>$ right choice for your company.

after apertium-transfer:
^+Microsoft 365 for business<np><al>$ ^ser<vbser><pri><p3><sg>$ ^el<det><def><f><sg>$ ... 
 
Microsoft 365 for business és la ...
Automatic

For dealing with quoted text we might be better to have an automatic system that does language identification combined with rules. The same technique could be used, words that receive an analysis would get the surface form prefixed with + after running through the module.


Engine-based, offline
  • Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.