User-based chunking

From Apertium
Revision as of 21:03, 21 June 2020 by Francis Tyers (talk | contribs)
Jump to navigation Jump to search

Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation.

Some use-cases:

  • Product names, long named entities
    • Microsoft 365 for business is the right choice for your company
    • A very popular film adaptation is "Gone with the Wind".
  • The use of quoted chunks as modifiers
  • Quotations in a second (or third) language
    • And then he came back and said "Bu ne lan?"


User-based, online

1. A user could mark in an interface that they don't want a particular span translating, each word would be prefixed then with a symbol. Sequences of these words would be merged by -separable or something like it. And the whole chunk would be either given a default tag, a tag determined by a classifier or a tag determined by rules.

Microsoft 365 for business is right choice for your company

+Microsoft +365 +for +business is right choice for your company

^+Microsoft/Microsoft<np><al>$ ^+365/365<num>$ ^+for/for<pr>$ ^+business/business$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company

^+Microsoft 365 for business/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company.


For dealing with quoted text we might be better to have an automatic system that does language identification combined with rules. The same technique could be used, words that receive an analysis would get the surface form prefixed with + after running through the module.

Engine-based, offline
  • Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.