Difference between revisions of "User:Khannatanmai/Secondary info apertium stream format"

From Apertium
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
<strong style="color:maroon;font-size:1.5em;>The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using [[User:Khannatanmai/Wordbound_blanks | wordbound blanks]]. </strong>

Original Proposal: Modifying the apertium stream format and eliminating dictionary trimming: [[User:Khannatanmai/GSoC2020Proposal_Trimming]]
Original Proposal: Modifying the apertium stream format and eliminating dictionary trimming: [[User:Khannatanmai/GSoC2020Proposal_Trimming]]


Line 68: Line 70:


Special tag in the LU that will be ignored by FSTs. Format: <!..>
Special tag in the LU that will be ignored by FSTs. Format: <!..>

W: weight
<pre>
<pre>
^«/«<lquot><!1>$^Ajawxel/Ajawaxel<n><!2>$^,/,<cm><!3>$ ^ri/ri<det><!4>$ ^in/in<prn><pers><p1><sg><!5>/in<prn><pro><p1><sg><!6>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$ ^che/chi<pr><!9>+<px3sg>ech<n><rel><!10>$ ^ri/ri<det><!11>$ ^at/at<prn><pers><p2><sg><!12>/at<prn><pro><p2><sg><!13>$^,/,<cm><!14>$ ^at/at<prn><pers><p2><sg><!15>/at<prn><pro><p2><sg><!16>$ ^jun/jun<adj><!17>/jun<n><!18>/jun<num><!19>$ ^qʼaxal/qʼaxal<adj><!20>$ ^utzij/<px3sg>tzij<n><!21>$ ^ri/ri<det><!22>$ ^Dyos/Dyos<n><!23>$
^«/«<lquot><!1>$[{W:0.987}]^Ajawxel/Ajawaxel<n><!2>$[{W:0.587}]^,/,<cm><!3>$[{W:0.953}] ^ri/ri<det><!4>$[{W:0.675}] ^in/in<prn><pers><p1><sg><!5>/in<prn><pro><p1><sg><!6>$[{sl_ids:5; W:0.203}{sl_ids:6; W:0.457}] ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sl_ids:7; W:0.621}{sl_ids:8; W:0.760}] ^che/chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sl_ids:9; W:0.832}{sl_ids:10; W:0.562}] ^ri/ri<det><!11>$[{W:0.675}] ^at/at<prn><pers><p2><sg><!12>/at<prn><pro><p2><sg><!13>$[{sl_ids:12; W:0.709}{sl_ids:13; W:0.302}]^,/,<cm><!14>$[{W:0.953}] ^at/at<prn><pers><p2><sg><!15>/at<prn><pro><p2><sg><!16>$[{sl_ids:15; W:0.709}{sl_ids:16; W:0.302}] ^jun/jun<adj><!17>/jun<n><!18>/jun<num><!19>$[{sl_ids:17; W:0.123}{sl_ids:18; W:0.346}{sl_ids:19; W:0.841}] ^qʼaxal/qʼaxal<adj><!20>$[{W:0.301}] ^utzij/<px3sg>tzij<n><!21>$[{W:0.205}] ^ri/ri<det><!22>$[{W:0.675}] ^Dyos/Dyos<n><!23>$[{W:0.442}]
</pre>
</pre>


Line 76: Line 80:
sf-> surface form
sf-> surface form
<pre>
<pre>
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sf:kinwilo}] ^chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal<adj><!20>$[{sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{sf:utzij}] ^ri<det><!22>$[{sf:ri}] ^Dyos<n><!23>$[{sf:Dyos}]
^«<lquot><!1>$[{W:0.987; sf:«}]^Ajawaxel<n><!2>$[{W:0.587; sf:Ajawxel}]^,<cm><!3>$[{W:0.953; sf:,}] ^ri<det><!4>$[{W:0.675; sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sl_ids:6; W:0.457; sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sf:kinwilo}{sl_ids:7; W:0.621}{sl_ids:8; W:0.760}] ^chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sf:che}{sl_ids:9; W:0.832}{sl_ids:10; W:0.562}] ^ri<det><!11>$[{W:0.675; sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sl_ids:12; W:0.709; sf:at}]^,<cm><!14>$[{W:0.953; sf:,}] ^at<prn><pro><p2><sg><!16>$[{sl_ids:16; W:0.302; sf:at}] ^jun<num><!19>$[{sl_ids:19; W:0.841; sf:jun}] ^qʼaxal<adj><!20>$[{W:0.301; sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{W:0.205; sf:utzij}] ^ri<det><!22>$[{W:0.675; sf:ri}] ^Dyos<n><!23>$[{W:0.442; sf:Dyos}]
</pre>
</pre>


Line 83: Line 87:
cmp_id: compound id
cmp_id: compound id
<pre>
<pre>
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{cmp_id:2; sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal<adj><!20>$[{sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{sf:utzij}] ^ri<det><!22>$[{sf:ri}] ^Dyos<n><!23>$[{sf:Dyos}]
^«<lquot><!1>$[{W:0.987; sf:«}]^Ajawaxel<n><!2>$[{W:0.587; sf:Ajawxel}]^,<cm><!3>$[{W:0.953; sf:,}] ^ri<det><!4>$[{W:0.675; sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sl_ids:6; W:0.457; sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{sl_ids:7; W:0.621; cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{sl_ids:8; W:0.760; cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{sl_ids:9; W:0.832; cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{sl_ids:10; W:0.562; cmp_id:2; sf:che}] ^ri<det><!11>$[{W:0.675; sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sl_ids:12; W:0.709; sf:at}]^,<cm><!14>$[{W:0.953; sf:,}] ^at<prn><pro><p2><sg><!16>$[{sl_ids:16; W:0.302; sf:at}] ^jun<num><!19>$[{sl_ids:19; W:0.841; sf:jun}] ^qʼaxal<adj><!20>$[{W:0.301; sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{W:0.205; sf:utzij}] ^ri<det><!22>$[{W:0.675; sf:ri}] ^Dyos<n><!23>$[{W:0.442; sf:Dyos}]
</pre>
</pre>


Line 90: Line 94:
mw_part: parts of a multi-word unit
mw_part: parts of a multi-word unit
<pre>
<pre>
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{cmp_id:2; sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal tzij ri Dyos<n><!24>$[{sl_ids:20; mw_part:1; sf:qʼaxal}{sl_ids:21; mw_part:2; sf:utzij}{sl_ids:22; mw_part:3; sf:ri}{sl_ids:23; mw_part:4; sf:Dyos}]
^«<lquot><!1>$[{W:0.987; sf:«}]^Ajawaxel<n><!2>$[{W:0.587; sf:Ajawxel}]^,<cm><!3>$[{W:0.953; sf:,}] ^ri<det><!4>$[{W:0.675; sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sl_ids:6; W:0.457; sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{sl_ids:7; W:0.621; cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{sl_ids:8; W:0.760; cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{sl_ids:9; W:0.832; cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{sl_ids:10; W:0.562; cmp_id:2; sf:che}] ^ri<det><!11>$[{W:0.675; sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sl_ids:12; W:0.709; sf:at}]^,<cm><!14>$[{W:0.953; sf:,}] ^at<prn><pro><p2><sg><!16>$[{sl_ids:16; W:0.302; sf:at}] ^jun<num><!19>$[{sl_ids:19; W:0.841; sf:jun}] ^qʼaxal tzij ri Dyos<n><!24>$[{sl_ids:20; W:0.301; sf:qʼaxal; mw_part:1}{sl_ids:21; W:0.205; sf:utzij; mw_part:2}{sl_ids:22; W:0.675; sf:ri; mw_part:3}{sl_ids:23; W:0.442; sf:Dyos; mw_part:4}]
</pre>
</pre>



Latest revision as of 09:25, 17 July 2020

The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using wordbound blanks.

Original Proposal: Modifying the apertium stream format and eliminating dictionary trimming: User:Khannatanmai/GSoC2020Proposal_Trimming

Development of the original proposal: User:Khannatanmai/New_Apertium_stream_format

New Proposal: User:Khannatanmai/Alternate_stream_modification

This page will follow the development of the new proposal for adding secondary information in the Apertium stream format.

Formalism[edit]

Instead of putting secondary information inside Lexical Units, we will put all information inside word bound blanks, and the only information that will be put inside a Lexical Unit, will be global reading IDs. These IDs will identify readings in a window uniquely, so that information inside word bound blanks can refer to specific IDs if need be.

Example Output of biltrans:

What was earlier:

^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$

Will now be:

^de<pr><!11>/of<pr><!67>/from<pr><!68>$[{sf:del}{sl_ids:11; W:1.6787}{sl_ids:11; tl_ids:67; W:5.0984}{sl_ids:11; tl_ids:68; W:0.0065}]

Features[edit]

  • A word bound blank will be defined by the syntax -> [{...}]
  • It can have multiple blocks of information inside: -> [{...}{...}{...}]
  • If a block of information doesn't have sl_ids or tl_ids, it refers to the entire LU (can be changed to {sl_ids:11; tl_ids:67,68} if we want that each block in a word bound blank should have IDs.
  • sl_ids and tl_ids can take multiple ids (from source or from target). This helps deal with many-to-many relationships between the tokens of the two languages.
  • When LUs merge, a new blank block is added that preserves information about the order of the original LUs with their ids. Also, the secondary information from all those LUs is added to the wordbound blank of the new merged LU, with each of their IDs inside the blocks.
  • When LUs break, secondary information is duplicated onto all parts, and each part gets the same compound id. Right now the order of the parts of the compound isn't preserved but it can be, if needed.

Rationale[edit]

Uses[edit]

Surface form[edit]

Preserving Input token IDs[edit]

Markup Information[edit]

Reading specific weights[edit]

Reading specific dependencies[edit]

Examples[edit]

Compounds & Surface Form[edit]

Secondary Tags:

^intrastruktuur<n><id:12><sf:intrastruktuurontwikkelingsplan>$ ^ontwikkelings<n><id:13><sf:intrastruktuurontwikkelingsplan>$ ^plan<n><id:14><sf:intrastruktuurontwikkelingsplan>$

Word bound blanks:

^intrastruktuur<n><!12>$[{sf:intrastruktuurontwikkelingsplan}] ^ontwikkelings<n><!13>$[{sf:intrastruktuurontwikkelingsplan$}] ^plan<n><!14>$[{sf:intrastruktuurontwikkelingsplan}]


After biltrans:

Secondary Tags:

^intrastruktuur<n><id:12><sf:intrastruktuurontwikkelingsplan>/infrastructure<n><id:12><sf:intrastruktuurontwikkelingsplan>$ ^ontwikkelings<n><id:13><sf:intrastruktuurontwikkelingsplan>/development<n><id:13><sf:intrastruktuurontwikkelingsplan>$ ^plan<n><id:14><sf:intrastruktuurontwikkelingsplan>/plan<n><id:14><sf:intrastruktuurontwikkelingsplan>$

Word bound blanks:

^intrastruktuur<n><!12>/infrastructure<n><!65>$[{sf:intrastruktuurontwikkelingsplan}] ^ontwikkelings<n><!13>/development<n><!66>$[{sf:intrastruktuurontwikkelingsplan$}] ^plan<n><!14>/plan<n><!67>$[{sf:intrastruktuurontwikkelingsplan}]

Another one[edit]

Original Analyser Output:

^«/«<lquot>$^Ajawxel/Ajawaxel<n>$^,/,<cm>$ ^ri/ri<det>$ ^in/in<prn><pers><p1><sg>/in<prn><pro><p1><sg>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv>+o<mark>$ ^che/chi<pr>+<px3sg>ech<n><rel>$ ^ri/ri<det>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$^,/,<cm>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$ ^jun/jun<adj>/jun<n>/jun<num>$ ^qʼaxal/qʼaxal<adj>$ ^utzij/<px3sg>tzij<n>$ ^ri/ri<det>$
^Dyos/Dyos<n>$

New Analyser Output:

Special tag in the LU that will be ignored by FSTs. Format: <!..>

W: weight

^«/«<lquot><!1>$[{W:0.987}]^Ajawxel/Ajawaxel<n><!2>$[{W:0.587}]^,/,<cm><!3>$[{W:0.953}] ^ri/ri<det><!4>$[{W:0.675}] ^in/in<prn><pers><p1><sg><!5>/in<prn><pro><p1><sg><!6>$[{sl_ids:5; W:0.203}{sl_ids:6; W:0.457}] ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sl_ids:7; W:0.621}{sl_ids:8; W:0.760}] ^che/chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sl_ids:9; W:0.832}{sl_ids:10; W:0.562}] ^ri/ri<det><!11>$[{W:0.675}] ^at/at<prn><pers><p2><sg><!12>/at<prn><pro><p2><sg><!13>$[{sl_ids:12; W:0.709}{sl_ids:13; W:0.302}]^,/,<cm><!14>$[{W:0.953}] ^at/at<prn><pers><p2><sg><!15>/at<prn><pro><p2><sg><!16>$[{sl_ids:15; W:0.709}{sl_ids:16; W:0.302}] ^jun/jun<adj><!17>/jun<n><!18>/jun<num><!19>$[{sl_ids:17; W:0.123}{sl_ids:18; W:0.346}{sl_ids:19; W:0.841}] ^qʼaxal/qʼaxal<adj><!20>$[{W:0.301}] ^utzij/<px3sg>tzij<n><!21>$[{W:0.205}] ^ri/ri<det><!22>$[{W:0.675}] ^Dyos/Dyos<n><!23>$[{W:0.442}]

Tagger Output:

sf-> surface form

^«<lquot><!1>$[{W:0.987; sf:«}]^Ajawaxel<n><!2>$[{W:0.587; sf:Ajawxel}]^,<cm><!3>$[{W:0.953; sf:,}] ^ri<det><!4>$[{W:0.675; sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sl_ids:6; W:0.457; sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sf:kinwilo}{sl_ids:7; W:0.621}{sl_ids:8; W:0.760}] ^chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sf:che}{sl_ids:9; W:0.832}{sl_ids:10; W:0.562}] ^ri<det><!11>$[{W:0.675; sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sl_ids:12; W:0.709; sf:at}]^,<cm><!14>$[{W:0.953; sf:,}] ^at<prn><pro><p2><sg><!16>$[{sl_ids:16; W:0.302; sf:at}] ^jun<num><!19>$[{sl_ids:19; W:0.841; sf:jun}] ^qʼaxal<adj><!20>$[{W:0.301; sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{W:0.205; sf:utzij}] ^ri<det><!22>$[{W:0.675; sf:ri}] ^Dyos<n><!23>$[{W:0.442; sf:Dyos}]

Pretransfer Output:

cmp_id: compound id

^«<lquot><!1>$[{W:0.987; sf:«}]^Ajawaxel<n><!2>$[{W:0.587; sf:Ajawxel}]^,<cm><!3>$[{W:0.953; sf:,}] ^ri<det><!4>$[{W:0.675; sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sl_ids:6; W:0.457; sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{sl_ids:7; W:0.621; cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{sl_ids:8; W:0.760; cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{sl_ids:9; W:0.832; cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{sl_ids:10; W:0.562; cmp_id:2; sf:che}] ^ri<det><!11>$[{W:0.675; sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sl_ids:12; W:0.709; sf:at}]^,<cm><!14>$[{W:0.953; sf:,}] ^at<prn><pro><p2><sg><!16>$[{sl_ids:16; W:0.302; sf:at}] ^jun<num><!19>$[{sl_ids:19; W:0.841; sf:jun}] ^qʼaxal<adj><!20>$[{W:0.301; sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{W:0.205; sf:utzij}] ^ri<det><!22>$[{W:0.675; sf:ri}] ^Dyos<n><!23>$[{W:0.442; sf:Dyos}]

Separable Output:

mw_part: parts of a multi-word unit

^«<lquot><!1>$[{W:0.987; sf:«}]^Ajawaxel<n><!2>$[{W:0.587; sf:Ajawxel}]^,<cm><!3>$[{W:0.953; sf:,}] ^ri<det><!4>$[{W:0.675; sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sl_ids:6; W:0.457; sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{sl_ids:7; W:0.621; cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{sl_ids:8; W:0.760; cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{sl_ids:9; W:0.832; cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{sl_ids:10; W:0.562; cmp_id:2; sf:che}] ^ri<det><!11>$[{W:0.675; sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sl_ids:12; W:0.709; sf:at}]^,<cm><!14>$[{W:0.953; sf:,}] ^at<prn><pro><p2><sg><!16>$[{sl_ids:16; W:0.302; sf:at}] ^jun<num><!19>$[{sl_ids:19; W:0.841; sf:jun}] ^qʼaxal tzij ri Dyos<n><!24>$[{sl_ids:20; W:0.301; sf:qʼaxal; mw_part:1}{sl_ids:21; W:0.205; sf:utzij; mw_part:2}{sl_ids:22; W:0.675; sf:ri; mw_part:3}{sl_ids:23; W:0.442; sf:Dyos; mw_part:4}]

Thoughts[edit]

  • compound id stores the id of the compound, but currently doesn't store the order of the parts in the compound (assumed to be ascending order of ids). mw_part does preserve the order.
  • I'm trying to make all blank blocks go with a general philosophy of having sl&tl ids and the other information pertaining to these.

Possibilities:

^qʼaxal tzij ri Dyos<n><!24>$[{mw_ids:20,21,22,23}{sl_ids:20; sf:qʼaxal}{sl_ids:21; sf:utzij}{sl_ids:22; sf:ri}{sl_ids:23; sf:Dyos}]
^qʼaxal tzij ri Dyos<n><!24>$[{sl_ids:20; sf:qʼaxal}{sl_ids:21; sf:utzij}{sl_ids:22; sf:ri}{sl_ids:23; sf:Dyos}]