Difference between revisions of "Surface forms in the pipe"

From Apertium
Jump to navigation Jump to search
Line 97: Line 97:




: But here it has tags, so it's included in the monodix, but with the lemma changed to surface form?

: What about CG rules referring to lemmas?


[[Category:Ideas]]
[[Category:Ideas]]

Revision as of 08:58, 24 June 2020

Currently the surface form is thrown away after the tagger. It might be handy to be able to keep it until transfer in order to be able to substitute things unknown to the bidix.

Another usage could be allowing surface-form embeddings, like those produced by word2vec to be used in the tagger and lexical selection modules. Lexical selection could also potentially use surface forms too.

Idea #1

Input:

Machiavelli took it for granted that would-be leaders naturally aim at glory or honor.

Morph:

^Machiavelli/Machiavelli<np><cog><sg>$ ^took/take<vblex><past>$ ^it/prpers<prn><subj><p3><nt><sg>/prpers<prn><obj><p3><nt><sg>$ ^for/for<cnjadv>/for<pr>$ ^granted/grant<vblex><pp>/grant<vblex><past>$ ^that/that<cnjsub>/that<det><dem><sg>/that<prn><dem><mf><sg>/that<prn><rel><an><mf><sp>$ ^would-be/would-be<adj>$ ^leaders/leader<n><pl>$ ^naturally/naturally<adv>$ ^aim at/aim<vblex><inf># at/aim<vblex><pres># at/aim<vblex><imp># at$ ^glory/glory<n><sg>$ ^or/or<cnjcoo>$ ^honor/honour<vblex><inf>/honour<vblex><pres>/honour<vblex><imp>/honour<n><sg>$^./.<sent>$

Tagger:
^Machiavelli/Machiavelli<np><cog><sg>$ ^took/take<vblex><past>$ ^it/prpers<prn><obj><p3><nt><sg>$ ^for/for<pr>$ ^granted/grant<vblex><pp>$ ^that/that<cnjsub>$ ^would-be/would-be<adj>$ ^leaders/leader<n><pl>$ ^naturally/naturally<adv>$ ^aim at/aim# at<vblex><pres>$ ^glory/glory<n><sg>$ ^or/or<cnjcoo>$ ^honor/honour<n><sg>$^./.<sent>$

Separable:
^Machiavelli/Machiavelli<np><cog><sg>$ ^took it for granted/take<vblex><past># for granted+prpers<prn><obj><p3><nt><sg>$ ^that/that<cnjsub>$ ^would-be/would-be<adj>$ ^leaders/leader<n><pl>$ ^naturally/naturally<adv>$ ^aim at/aim# at<vblex><pres>$ ^glory/glory<n><sg>$ ^or/or<cnjcoo>$ ^honor/honour<n><sg>$^./.<sent>$

Pretransfer:

Biltrans:
^Machiavelli/Machiavelli<np><cog><sg>/Machiavelli<np><cog>$ ^took it for granted/take# for granted<vblex><past>+prpers<prn><obj><p3><nt><sg>/dar# por hecho<vblex><past>+lo<prn><tn><p3><nt><sg>$ ^that/that<cnjsub>/que<cnjsub>$ ^would-be/would-be<adj>/@would-be<adj>$ ^leaders/leader<n><pl>/@leader<n><pl>$ ^naturally/naturally<adv>/naturalmente<adv>$ ^aim/aim<vblex><inf>/apuntar<vblex><inf>$ ^at/at<pr>/en<pr>$ ^glory/glory<n><sg>/gloria<n><f><sg>$ ^or/or<cnjcoo>/o<cnjcoo>$ ^honour/honour<n><sg>/honor<n><m><sg>$^./.<sent>/.<sent>$ 

Transfer:

^Machiavelli<np><cog>$ ^lo<prn><tn><p3><nt><sg>$ ^dar# por hecho<vblex><ifi><p3><sg>$ ^que<cnjsub>$ ^*would-be$ ^*leaders$ ^naturalmente<adv>$ ^apuntar<vblex><inf>$ ^en<pr>$ ^gloria<n><f><sg>$ ^o<cnjcoo>$ ^honor<n><m><sg>$^.<sent>/.<sent>$ 


Generation (?):

Machiavelli lo dio por hecho que *would-be *leaders apuntar en gloria o honor.

Potentially generation could output something like ^dar# por hecho<vblex><ifi><p3><sg>/dio por hecho$ but then how would postgeneration work? e.g. for

^de<pr>$ ^el<det><def><m><sg>$

Could it be:

Generation:
^de<pr>/de$ ^el<det><def><m><sg>/el$

Postgeneration:
^de<pr>+el<det><def><m><sg>/del$

Questions

  • This will complicate the code for bidix lookup and for lexical selection (we'll need to be able to support '+' and output the product of each part


Idea #2

It occurs to me that if we want to "eliminate trimming" in order to get better context, then one thing we could do is do trimming, but pass the surface form instead of the lemma for those words not found in the bilingual dictionary.

For example:

Sparti dâ ammiguitati:

"<Sparti>"
	"spartiri" vblex pri p3 sg SELECT:291
	"spàrtiri" vblex pri p3 sg SELECT:291
;	"sparti" adv SELECT:291
;	"spartiri" vblex imp p2 sg REMOVE:227
;	"spartiri" vblex pri p2 sg SELECT:291
;	"spàrtiri" vblex imp p2 sg REMOVE:227
;	"spàrtiri" vblex pri p2 sg SELECT:291
"<dâ>"
	"di" pr
		"lu" det def f sg
"<ammiguitati>"
	"*ammiguitati<n><f><sp>"

This could then also be combined with guessers. The surface form would never change, and could just be output as is at the other end.

It's not like we're going to be able to use the lemma anyway.

E.g. we could have a program in the pipeline... maybe it could be pretransfer, before separable, that reads the bilingual dictionary and reduces analyses.

in:
^ammiguitati/ammiguitati<n><f><sp>$

out:
^ammiguitati/*ammiguitati<n><f><sp>$


But here it has tags, so it's included in the monodix, but with the lemma changed to surface form?
What about CG rules referring to lemmas?