Difference between revisions of "User:Khannatanmai/Secondary info apertium stream format"
Khannatanmai (talk | contribs) |
Khannatanmai (talk | contribs) |
||
Line 57: | Line 57: | ||
== Another one == |
== Another one == |
||
Original Analyser Output: |
'''Original Analyser Output:''' |
||
<pre> |
<pre> |
||
^«/«<lquot>$^Ajawxel/Ajawaxel<n>$^,/,<cm>$ ^ri/ri<det>$ ^in/in<prn><pers><p1><sg>/in<prn><pro><p1><sg>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv>+o<mark>$ ^che/chi<pr>+<px3sg>ech<n><rel>$ ^ri/ri<det>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$^,/,<cm>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$ ^jun/jun<adj>/jun<n>/jun<num>$ ^qʼaxal/qʼaxal<adj>$ ^utzij/<px3sg>tzij<n>$ ^ri/ri<det>$ |
^«/«<lquot>$^Ajawxel/Ajawaxel<n>$^,/,<cm>$ ^ri/ri<det>$ ^in/in<prn><pers><p1><sg>/in<prn><pro><p1><sg>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv>+o<mark>$ ^che/chi<pr>+<px3sg>ech<n><rel>$ ^ri/ri<det>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$^,/,<cm>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$ ^jun/jun<adj>/jun<n>/jun<num>$ ^qʼaxal/qʼaxal<adj>$ ^utzij/<px3sg>tzij<n>$ ^ri/ri<det>$ |
||
Line 63: | Line 63: | ||
</pre> |
</pre> |
||
New Analyser Output: |
'''New Analyser Output:''' |
||
Special tag in the LU that will be ignored by FSTs. Format: <!..> |
|||
<pre> |
<pre> |
||
^«/«<lquot><!1>$^Ajawxel/Ajawaxel<n><!2>$^,/,<cm><!3>$ ^ri/ri<det><!4>$ ^in/in<prn><pers><p1><sg><!5>/in<prn><pro><p1><sg><!6>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$ ^che/chi<pr><!9>+<px3sg>ech<n><rel><!10>$ ^ri/ri<det><!11>$ ^at/at<prn><pers><p2><sg><!12>/at<prn><pro><p2><sg><!13>$^,/,<cm><!14>$ ^at/at<prn><pers><p2><sg><!15>/at<prn><pro><p2><sg><!16>$ ^jun/jun<adj><!17>/jun<n><!18>/jun<num><!19>$ ^qʼaxal/qʼaxal<adj><!20>$ ^utzij/<px3sg>tzij<n><!21>$ ^ri/ri<det><!22>$ ^Dyos/Dyos<n><!23>$ |
^«/«<lquot><!1>$^Ajawxel/Ajawaxel<n><!2>$^,/,<cm><!3>$ ^ri/ri<det><!4>$ ^in/in<prn><pers><p1><sg><!5>/in<prn><pro><p1><sg><!6>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$ ^che/chi<pr><!9>+<px3sg>ech<n><rel><!10>$ ^ri/ri<det><!11>$ ^at/at<prn><pers><p2><sg><!12>/at<prn><pro><p2><sg><!13>$^,/,<cm><!14>$ ^at/at<prn><pers><p2><sg><!15>/at<prn><pro><p2><sg><!16>$ ^jun/jun<adj><!17>/jun<n><!18>/jun<num><!19>$ ^qʼaxal/qʼaxal<adj><!20>$ ^utzij/<px3sg>tzij<n><!21>$ ^ri/ri<det><!22>$ ^Dyos/Dyos<n><!23>$ |
||
</pre> |
</pre> |
||
Tagger Output: |
'''Tagger Output:''' |
||
sf-> surface form |
|||
<pre> |
<pre> |
||
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sf:kinwilo}] ^chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal<adj><!20>$[{sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{sf:utzij}] ^ri<det><!22>$[{sf:ri}] ^Dyos<n><!23>$[{sf:Dyos}] |
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sf:kinwilo}] ^chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal<adj><!20>$[{sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{sf:utzij}] ^ri<det><!22>$[{sf:ri}] ^Dyos<n><!23>$[{sf:Dyos}] |
||
</pre> |
|||
'''Pretransfer Output:''' |
|||
cmp_id: compound id |
|||
<pre> |
|||
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{cmp_id:2; sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal<adj><!20>$[{sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{sf:utzij}] ^ri<det><!22>$[{sf:ri}] ^Dyos<n><!23>$[{sf:Dyos}] |
|||
</pre> |
|||
'''Separable Output:''' |
|||
mw_id: multiword id |
|||
<pre> |
|||
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{cmp_id:2; sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal tzij ri Dyos<n><!24>$[{mw_id:24; sl_ids:20,21,22,23}{sl_ids:20; sf:qʼaxal}{sl_ids:21; sf:utzij}{sl_ids:22; sf:ri}{sl_ids:23; sf:Dyos}] |
|||
</pre> |
</pre> |
Revision as of 19:37, 7 June 2020
Original Proposal: Modifying the apertium stream format and eliminating dictionary trimming: User:Khannatanmai/GSoC2020Proposal_Trimming
Development of the original proposal: User:Khannatanmai/New_Apertium_stream_format
New Proposal: User:Khannatanmai/Alternate_stream_modification
This page will follow the development of the new proposal for adding secondary information in the Apertium stream format.
Contents
Formalism
Instead of putting secondary information inside Lexical Units, we will put all information inside word bound blanks, and the only information that will be put inside a Lexical Unit, will be global reading IDs. These IDs will identify readings in a window uniquely, so that information inside word bound blanks can refer to specific IDs if need be.
Example Output of biltrans:
What was earlier:
^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$
Will now be:
^de<pr><!11>/of<pr><!67>/from<pr><!68>$[{sf:del}{sl_ids:11; W:1.6787}{sl_ids:11; tl_ids:67; W:5.0984}{sl_ids:11; tl_ids:68; W:0.0065}]
Features
- A word bound blank will be defined by the syntax ->
[{...}]
- It can have multiple blocks of information inside: ->
[{...}{...}{...}]
- If a block of information doesn't have
sl_ids
ortl_ids
, it refers to the entire LU (can be changed to{sl_ids:11; tl_ids:67,68}
if we want that each block in a word bound blank should have IDs. sl_ids
andtl_ids
can take multiple ids (from source or from target). This helps deal with many-to-many relationships between the tokens of the two languages.
Rationale
Uses
Surface form
Preserving Input token IDs
Markup Information
Reading specific weights
Reading specific dependencies
Examples
Compounds & Surface Form
Secondary Tags:
^intrastruktuur<n><id:12><sf:intrastruktuurontwikkelingsplan>$ ^ontwikkelings<n><id:13><sf:intrastruktuurontwikkelingsplan>$ ^plan<n><id:14><sf:intrastruktuurontwikkelingsplan>$
Word bound blanks:
^intrastruktuur<n><!12>$[{sf:intrastruktuurontwikkelingsplan}] ^ontwikkelings<n><!13>$[{sf:intrastruktuurontwikkelingsplan$}] ^plan<n><!14>$[{sf:intrastruktuurontwikkelingsplan}]
After biltrans:
Secondary Tags:
^intrastruktuur<n><id:12><sf:intrastruktuurontwikkelingsplan>/infrastructure<n><id:12><sf:intrastruktuurontwikkelingsplan>$ ^ontwikkelings<n><id:13><sf:intrastruktuurontwikkelingsplan>/development<n><id:13><sf:intrastruktuurontwikkelingsplan>$ ^plan<n><id:14><sf:intrastruktuurontwikkelingsplan>/plan<n><id:14><sf:intrastruktuurontwikkelingsplan>$
Word bound blanks:
^intrastruktuur<n><!12>/infrastructure<n><!65>$[{sf:intrastruktuurontwikkelingsplan}] ^ontwikkelings<n><!13>/development<n><!66>$[{sf:intrastruktuurontwikkelingsplan$}] ^plan<n><!14>/plan<n><!67>$[{sf:intrastruktuurontwikkelingsplan}]
Another one
Original Analyser Output:
^«/«<lquot>$^Ajawxel/Ajawaxel<n>$^,/,<cm>$ ^ri/ri<det>$ ^in/in<prn><pers><p1><sg>/in<prn><pro><p1><sg>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv>+o<mark>$ ^che/chi<pr>+<px3sg>ech<n><rel>$ ^ri/ri<det>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$^,/,<cm>$ ^at/at<prn><pers><p2><sg>/at<prn><pro><p2><sg>$ ^jun/jun<adj>/jun<n>/jun<num>$ ^qʼaxal/qʼaxal<adj>$ ^utzij/<px3sg>tzij<n>$ ^ri/ri<det>$ ^Dyos/Dyos<n>$
New Analyser Output:
Special tag in the LU that will be ignored by FSTs. Format: <!..>
^«/«<lquot><!1>$^Ajawxel/Ajawaxel<n><!2>$^,/,<cm><!3>$ ^ri/ri<det><!4>$ ^in/in<prn><pers><p1><sg><!5>/in<prn><pro><p1><sg><!6>$ ^kinwilo/<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$ ^che/chi<pr><!9>+<px3sg>ech<n><rel><!10>$ ^ri/ri<det><!11>$ ^at/at<prn><pers><p2><sg><!12>/at<prn><pro><p2><sg><!13>$^,/,<cm><!14>$ ^at/at<prn><pers><p2><sg><!15>/at<prn><pro><p2><sg><!16>$ ^jun/jun<adj><!17>/jun<n><!18>/jun<num><!19>$ ^qʼaxal/qʼaxal<adj><!20>$ ^utzij/<px3sg>tzij<n><!21>$ ^ri/ri<det><!22>$ ^Dyos/Dyos<n><!23>$
Tagger Output:
sf-> surface form
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>+o<mark><!8>$[{sf:kinwilo}] ^chi<pr><!9>+<px3sg>ech<n><rel><!10>$[{sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal<adj><!20>$[{sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{sf:utzij}] ^ri<det><!22>$[{sf:ri}] ^Dyos<n><!23>$[{sf:Dyos}]
Pretransfer Output:
cmp_id: compound id
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{cmp_id:2; sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal<adj><!20>$[{sf:qʼaxal}] ^<px3sg>tzij<n><!21>$[{sf:utzij}] ^ri<det><!22>$[{sf:ri}] ^Dyos<n><!23>$[{sf:Dyos}]
Separable Output:
mw_id: multiword id
^«<lquot><!1>$[{sf:«}]^Ajawaxel<n><!2>$[{sf:Ajawxel}]^,<cm><!3>$[{sf:,}] ^ri<det><!4>$[{sf:ri}] ^in<prn><pro><p1><sg><!6>$[{sf:in}] ^<impf><o_sg3><s_sg1>il<v><tv><!7>$[{cmp_id:1; sf:kinwilo}]^o<mark><!8>$[{cmp_id:1; sf:kinwilo}] ^chi<pr><!9>$[{cmp_id:2; sf:che}] ^<px3sg>ech<n><rel><!10>$[{cmp_id:2; sf:che}] ^ri<det><!11>$[{sf:ri}] ^at<prn><pers><p2><sg><!12>$[{sf:at}]^,<cm><!14>$[{sf:,}] ^at<prn><pro><p2><sg><!16>$[{sf:at}] ^jun<num><!19>$[{sf:jun}] ^qʼaxal tzij ri Dyos<n><!24>$[{mw_id:24; sl_ids:20,21,22,23}{sl_ids:20; sf:qʼaxal}{sl_ids:21; sf:utzij}{sl_ids:22; sf:ri}{sl_ids:23; sf:Dyos}]