User-based chunking
Revision as of 21:01, 21 June 2020 by Francis Tyers (talk | contribs) (Created page with "Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation....")
Sometimes in translations it would be useful to be able to mark particular segments/chunks not for translation, but to give them an analysis for the purposes of translation.
Some use-cases:
- Product names, long named entities
- Microsoft 365 for business is the right choice for your company
- A very popular film adaptation is "Gone with the Wind".
- The use of quoted chunks as modifiers
- Quotations in a second (or third) language
- And then he came back and said "Bu ne lan?"
Solutions
- User-based, online
1. A user could mark in an interface that they don't want a particular span translating, each word would be prefixed then with a symbol. Sequences of these words would be merged by -separable or something like it. And the whole chunk would be either given a default tag, a tag determined by a classifier or a tag determined by rules.
Microsoft 365 for business is right choice for your company +Microsoft +365 +for +business is right choice for your company ^+Microsoft/Microsoft<np><al>$ ^+365/365<num>$ ^+for/for<pr>$ ^+business/business$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company ^+Microsoft 365 for business/Microsoft 365 for business<np><al>$ ^is/be<vbser><pres><p3><sg>$ the right choice for your company. ...
- Engine-based, offline
- Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.