User-based chunking

From Apertium
Revision as of 21:01, 21 June 2020 by Francis Tyers (talk | contribs) (Created page with "Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation....")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation.

Some use-cases:

  • Product names, long named entities
    • Microsoft 365 for business is the right choice for your company
    • A very popular film adaptation is "Gone with the Wind".
  • The use of quoted chunks as modifiers
  • Quotations in a second (or third) language
    • And then he came back and said "Bu ne lan?"

Solutions

User-based, online

1. A user could mark in an interface that they don't want a particular span translating, each word would be prefixed then with a symbol. Sequences of these words would be merged by -separable or something like it. And the whole chunk would be either given a default tag, a tag determined by a classifier or a tag determined by rules.

Microsoft 365 for business is right choice for your company

+Microsoft +365 +for +business is right choice for your company

^+Microsoft/Microsoft<np><al>$ ^+365/365<num>$ ^+for/for<pr>$ ^+business/business$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company

^+Microsoft 365 for business/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company.

...


Engine-based, offline
  • Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.