User-based chunking

Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation.

Some use-cases:

Product names, long named entities

- Microsoft 365 for business is the right choice for your company
- A very popular film adaptation is "Gone with the Wind".

The use of quoted chunks as modifiers

Quotations in a second (or third) language

- And then he came back and said "Bu ne lan?"

Solutions

User-based, online

1. A user could mark in an interface that they don't want a particular span translating, each word would be prefixed then with a symbol. Sequences of these words would be merged by -separable or something like it. And the whole chunk would be either given a default tag, a tag determined by a classifier or a tag determined by rules.

Microsoft 365 for business is right choice for your company

+Microsoft +365 +for +business is right choice for your company

^+Microsoft/Microsoft<np><al>$ ^+365/365<num>$ ^+for/for<pr>$ ^+business/business$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company

^+Microsoft 365 for business/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company.

...

Automatic

For dealing with quoted text we might be better to have an automatic system that does language identification combined with rules. The same technique could be used, words that receive an analysis would get the surface form prefixed with + after running through the module.

Engine-based, offline

Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.

User-based chunking

Solutions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools