User:Khannatanmai/Secondary info apertium stream format

From Apertium
Jump to navigation Jump to search

Original Proposal: Modifying the apertium stream format and eliminating dictionary trimming: User:Khannatanmai/GSoC2020Proposal_Trimming

Development of the original proposal: User:Khannatanmai/New_Apertium_stream_format

New Proposal: User:Khannatanmai/Alternate_stream_modification

This page will follow the development of the new proposal for adding secondary information in the Apertium stream format.

Formalism

Instead of putting secondary information inside Lexical Units, we will put all information inside word bound blanks, and the only information that will be put inside a Lexical Unit, will be global reading IDs. These IDs will identify readings in a window uniquely, so that information inside word bound blanks can refer to specific IDs if need be.

Example Output of biltrans:

What was earlier:

^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$

Will now be:

^de<pr><!11>/of<pr><!67>/from<pr><!68>$[{sf:del}{sl_ids:11; W:1.6787}{sl_ids:11; tl_ids:67; W:5.0984}{sl_ids:11; tl_ids:68; W:0.0065}]

Features

  • A word bound blank will be defined by the syntax -> [{...}]
  • It can have multiple blocks of information inside: -> [{...}{...}{...}]
  • If a block of information doesn't have sl_id or tl_id, it refers to the entire LU (can be changed to {sl_id:11; tl_id:67,68} if we want that each block in a word bound blank should have IDs.
  • sl_id and tl_id can take multiple ids (from source or from target). This helps deal with many-to-many relationships between the tokens of the two languages.

Rationale

Uses

Surface form

Preserving Input token IDs

Markup Information

Reading specific weights

Reading specific dependencies

Examples