Difference between revisions of "Semantic tagging"

From Apertium
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
== Uses ==
 
== Uses ==
   
  +
== Approaches ==
   
  +
* Giellatekno
== Other MT systems ==
 
  +
* Grammatical Framework
 
   
 
== Data sources ==
 
== Data sources ==
   
  +
* WikiData ?
  +
** https://www.wikidata.org/wiki/Q169
  +
** https://www.wikidata.org/wiki/Q200266
   
  +
* Embeddings ?
 
   
 
== Implementation ==
 
== Implementation ==
   
  +
=== Bilingual or multilingual ===
  +
  +
* Often a word can be disambiguated using its translation in another language, for example the triple (estació, gare, station) defines a building meaning.
  +
  +
=== Existing examples ===
  +
  +
  +
<pre>
  +
SET MangoFruitWords = ("aguacate"i) OR ("albahaca"i) OR ("alimentario"i) OR ("alimenticio"i) OR ("aloe"i) OR ("anacardo"i) OR ("ananás"i) OR ("anchoa"i) OR (
  +
"arroz"i) OR ("atún"i) OR ("azúcar"i) OR ("banana"i) OR ("banano"i) OR ("batido"i) OR ("boniato"i) OR ("brocheta"i) OR ("cacahuete"i) OR ("cacao"i) OR ("caram
  +
elizar"i) OR ("caramelo"i) OR ("carpaccio"i) OR ("caviar"i) OR ("cereal"i) OR ("chirimoya"i) OR ("chocolate"i) OR ("<chutney>"i) OR ("clima"i) OR ("coco"i) OR
  +
("cocotero"i) OR ("codorniz"i) OR ("comer"i) OR ("comercial"i) OR ("comida"i) OR ("cosecha"i) OR ("crema"i) OR ("cultivar"i) OR ("cultivo"i) OR ("cítrico"i)
  +
OR ("dátil"i) OR ("deshidratar"i) OR ("ensalada"i) OR ("exportación"i) OR ("foie"i) OR ("fragancia"i) OR ("fresa"i) OR ("fresco"i) OR ("fruta"i) OR ("fruto"i)
  +
OR ("gamba"i) OR ("gazpacho"i) OR ("guayaba"i) OR ("gustar"i) OR ("helado"i) OR ("hortaliza"i) OR ("ingrediente"i) OR ("jamón"i) OR ("jarabe"i) OR ("jardín"i
  +
) OR ("jengibre"i) OR ("judía"i) OR ("langosta"i) OR ("langostino"i) OR ("lechuga"i) OR ("legumbre"i) OR ("maduro"i) OR ("mandarina"i) OR ("mandioca"i) OR ("m
  +
anzana"i) OR ("maní"i) OR ("maracuyá"i) OR ("maíz"i) OR ("melocotón"i) OR ("melón"i) OR ("mono"i) OR ("naranja"i) OR ("naranjo"i) OR ("orquídea"i) OR ("orégan
  +
o"i) OR ("palma"i) OR ("palmera"i) OR ("papaya"i) OR ("parmesano"i) OR ("patata"i) OR ("piscina"i) OR ("piña"i) OR ("plantación"i) OR ("plátano"i) OR ("pollo"
  +
i) OR ("probar"i)OR ("puré"i) OR ("rodaja"i) OR ("ron"i) OR ("salsa"i) OR ("sorbete"i) OR ("sorgo"i) OR ("subsistencia"i) OR ("sésamo"i) OR ("tabaco"i) OR ("t
  +
empura"i) OR ("tomate"i) OR ("trigo"i) OR ("triturar"i) OR ("tropical"i) OR ("tubérculo"i) OR ("vainilla"i) OR ("vinagre"i) OR ("yogur"i) OR ("zumo"i) OR ("Á
  +
frica");
  +
  +
SET MangoNotFruitWords = ("acero"i) OR ("azada"i) OR ("levantar"i) OR ("alzar"i) OR ("plata"i) OR ("arpón"i) OR ("azote"i) OR ("cuerno"i) OR ("bastón"i) OR ("
  +
bolsa"i) OR ("brazo"i) OR ("silla"i) OR ("centímetro"i) OR ("cinturón"i) OR ("llave"i) OR ("clavar"i) OR ("cubierto"i) OR ("golpear"i) OR ("cuchillo"i) OR ("c
  +
orazón"i) OR ("cuerda"i) OR ("cuerpo"i) OR ("cocina"i) OR ("cuero"i) OR ("cuchara"i) OR ("corto"i) OR ("hacha"i) OR ("herramienta"i) OR ("emplear"i) OR ("empu
  +
ñar"i) OR ("escoba"i) OR ("espada"i) OR ("estirar"i) OR ("apretar"i) OR ("extremo"i) OR ("trabajo"i) OR ("meter"i) OR ("fuego"i) OR ("forma"i) OR ("látigo"i)
  +
OR ("hoja"i) OR ("madera"i) OR ("cuchillo"i) OR ("girar"i) OR ("escoba"i) OR ("grabar"i) OR ("grueso"i) OR ("instrumento"i) OR ("marfil"i) OR ("lanza"i) OR ("
  +
lanzar"i) OR ("largo"i) OR ("maza"i) OR ("martillo"i) OR ("metal"i) OR ("mover"i) OR ("movimiento"i) OR ("navaja"i) OR ("limpiar"i) OR ("sartén"i) OR ("paella
  +
"i) OR ("palo"i) OR ("pala"i) OR ("papel"i) OR ("pieza"i) OR ("piedra"i) OR ("pequeño"i) OR ("picar"i) OR ("pistola"i) OR ("plástico"i) OR ("plata"i) OR ("plu
  +
ma"i) OR ("puerta"i) OR ("precioso"i) OR ("punta"i) OR ("puñal"i) OR ("cepillo"i) OR ("cepillo"i) OR ("redondo"i) OR ("ropa"i) OR ("rueda"i) OR ("sujetar"i) O
  +
R ("mesa"i) OR ("atravesar"i) OR ("utilizar"i) OR ("alrededor"i) OR ("marfil"i);
  +
  +
SELECT:mango_fruta ("mango_fruta"i) IF (0 ("mango_fruta"i)) (0*/* MangoFruitWords) (NOT 0* MangoNotFruitWords) ;
  +
REMOVE:mango_0 ("mango_fruta"i) IF (0 ("mango"i)) (0 ("mango_fruta"i)) ;
  +
</pre>
  +
  +
  +
== Ideas and notes ==
  +
  +
<pre>
  +
  +
Thanks Xavi for the ideas...
  +
  +
What I've been thinking about is a module that would go after
  +
biltrans and before lexical selection. It would essentially reweight
  +
the possible translations based on a bag of words over a fixed
  +
window of words or "sentences" (delimited with '.').
  +
  +
You could have source and target components, so e.g. you might
  +
say that "fruit" is a semantic field or domain which includes,
  +
  +
"mango", "manzana", "plátano", "naranja", ...
  +
  +
and
  +
  +
"mango", "taronja", "poma"
  +
  +
In Catalan. These would be in the monolingual pairs. The
  +
module would take both lists and the input
  +
  +
^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
  +
^mango<n><m><pl>/mànec<n><m><pl>/mango<n><m><pl>$
  +
^y<cnjcoo>/i<cnjcoo>$
  +
^manzana<n><f><pl>/poma<n><f><pl>$
  +
  +
And try and maximise semantic coherence, then it could reweight,
  +
so e.g.
  +
  +
^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
  +
^mango<n><m><pl>/mango<n><m><pl><2.0>/mànec<n><m><pl><0.0>$
  +
^y<cnjcoo>/i<cnjcoo>$
  +
^manzana<n><f><pl>/poma<n><f><pl>$
  +
  +
And pass it to the lexical selection module which will choose the
  +
one with the highest weight.
  +
  +
This would mean a new module, but it would require only minor
  +
changes to the bilingual dictionary and lexical selection, and
  +
wouldn't have any effect on transfer.
   
  +
</pre>
   
 
== References ==
 
== References ==

Latest revision as of 15:27, 15 June 2020

This page is to keep a list of ideas for semantic tagging in Apertium.

Uses[edit]

Approaches[edit]

  • Giellatekno
  • Grammatical Framework

Data sources[edit]

  • Embeddings ?

Implementation[edit]

Bilingual or multilingual[edit]

  • Often a word can be disambiguated using its translation in another language, for example the triple (estació, gare, station) defines a building meaning.

Existing examples[edit]

SET MangoFruitWords = ("aguacate"i)  OR ("albahaca"i) OR ("alimentario"i) OR ("alimenticio"i) OR ("aloe"i) OR ("anacardo"i) OR ("ananás"i) OR ("anchoa"i) OR (
"arroz"i) OR ("atún"i) OR ("azúcar"i) OR ("banana"i) OR ("banano"i) OR ("batido"i) OR ("boniato"i) OR ("brocheta"i) OR ("cacahuete"i) OR ("cacao"i) OR ("caram
elizar"i) OR ("caramelo"i) OR ("carpaccio"i) OR ("caviar"i) OR ("cereal"i) OR ("chirimoya"i) OR ("chocolate"i) OR ("<chutney>"i) OR ("clima"i) OR ("coco"i) OR
 ("cocotero"i) OR ("codorniz"i) OR ("comer"i) OR ("comercial"i) OR ("comida"i) OR ("cosecha"i) OR ("crema"i) OR ("cultivar"i) OR ("cultivo"i) OR ("cítrico"i) 
OR ("dátil"i) OR ("deshidratar"i) OR ("ensalada"i) OR ("exportación"i) OR ("foie"i) OR ("fragancia"i) OR ("fresa"i) OR ("fresco"i) OR ("fruta"i) OR ("fruto"i)
 OR ("gamba"i) OR ("gazpacho"i) OR ("guayaba"i) OR ("gustar"i) OR ("helado"i) OR ("hortaliza"i) OR ("ingrediente"i) OR ("jamón"i) OR ("jarabe"i) OR ("jardín"i
) OR ("jengibre"i) OR ("judía"i) OR ("langosta"i) OR ("langostino"i) OR ("lechuga"i) OR ("legumbre"i) OR ("maduro"i) OR ("mandarina"i) OR ("mandioca"i) OR ("m
anzana"i) OR ("maní"i) OR ("maracuyá"i) OR ("maíz"i) OR ("melocotón"i) OR ("melón"i) OR ("mono"i) OR ("naranja"i) OR ("naranjo"i) OR ("orquídea"i) OR ("orégan
o"i) OR ("palma"i) OR ("palmera"i) OR ("papaya"i) OR ("parmesano"i) OR ("patata"i) OR ("piscina"i) OR ("piña"i) OR ("plantación"i) OR ("plátano"i) OR ("pollo"
i) OR ("probar"i)OR ("puré"i) OR ("rodaja"i) OR ("ron"i) OR ("salsa"i) OR ("sorbete"i) OR ("sorgo"i) OR ("subsistencia"i) OR ("sésamo"i) OR ("tabaco"i) OR ("t
empura"i) OR ("tomate"i) OR ("trigo"i) OR ("triturar"i) OR ("tropical"i) OR ("tubérculo"i) OR ("vainilla"i) OR ("vinagre"i) OR ("yogur"i)  OR ("zumo"i) OR ("Á
frica");

SET MangoNotFruitWords = ("acero"i) OR ("azada"i) OR ("levantar"i) OR ("alzar"i) OR ("plata"i) OR ("arpón"i) OR ("azote"i) OR ("cuerno"i) OR ("bastón"i) OR ("
bolsa"i) OR ("brazo"i) OR ("silla"i) OR ("centímetro"i) OR ("cinturón"i) OR ("llave"i) OR ("clavar"i) OR ("cubierto"i) OR ("golpear"i) OR ("cuchillo"i) OR ("c
orazón"i) OR ("cuerda"i) OR ("cuerpo"i) OR ("cocina"i) OR ("cuero"i) OR ("cuchara"i) OR ("corto"i) OR ("hacha"i) OR ("herramienta"i) OR ("emplear"i) OR ("empu
ñar"i) OR ("escoba"i) OR ("espada"i) OR ("estirar"i) OR ("apretar"i) OR ("extremo"i) OR ("trabajo"i) OR ("meter"i) OR ("fuego"i) OR ("forma"i) OR ("látigo"i) 
OR ("hoja"i) OR ("madera"i) OR ("cuchillo"i) OR ("girar"i) OR ("escoba"i) OR ("grabar"i) OR ("grueso"i) OR ("instrumento"i) OR ("marfil"i) OR ("lanza"i) OR ("
lanzar"i) OR ("largo"i) OR ("maza"i) OR ("martillo"i) OR ("metal"i) OR ("mover"i) OR ("movimiento"i) OR ("navaja"i) OR ("limpiar"i) OR ("sartén"i) OR ("paella
"i) OR ("palo"i) OR ("pala"i) OR ("papel"i) OR ("pieza"i) OR ("piedra"i) OR ("pequeño"i) OR ("picar"i) OR ("pistola"i) OR ("plástico"i) OR ("plata"i) OR ("plu
ma"i) OR ("puerta"i) OR ("precioso"i) OR ("punta"i) OR ("puñal"i) OR ("cepillo"i) OR ("cepillo"i) OR ("redondo"i) OR ("ropa"i) OR ("rueda"i) OR ("sujetar"i) O
R ("mesa"i) OR ("atravesar"i) OR ("utilizar"i) OR ("alrededor"i) OR ("marfil"i);

SELECT:mango_fruta ("mango_fruta"i) IF (0 ("mango_fruta"i)) (0*/* MangoFruitWords) (NOT 0* MangoNotFruitWords) ;
REMOVE:mango_0 ("mango_fruta"i) IF (0 ("mango"i)) (0 ("mango_fruta"i)) ;


Ideas and notes[edit]


Thanks Xavi for the ideas...

What I've been thinking about is a module that would go after
biltrans and before lexical selection. It would essentially reweight
the possible translations based on a bag of words over a fixed
window of words or "sentences" (delimited with '.').

You could have source and target components, so e.g. you might
say that "fruit" is a semantic field or domain which includes,

"mango", "manzana", "plátano", "naranja", ...

and

"mango", "taronja", "poma"

In Catalan. These would be in the monolingual pairs. The
module would take both lists and the input

^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
^mango<n><m><pl>/mànec<n><m><pl>/mango<n><m><pl>$
^y<cnjcoo>/i<cnjcoo>$
^manzana<n><f><pl>/poma<n><f><pl>$

And try and maximise semantic coherence, then it could reweight,
so e.g.

^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
^mango<n><m><pl>/mango<n><m><pl><2.0>/mànec<n><m><pl><0.0>$
^y<cnjcoo>/i<cnjcoo>$
^manzana<n><f><pl>/poma<n><f><pl>$

And pass it to the lexical selection module which will choose the
one with the highest weight.

This would mean a new module, but it would require only minor
changes to the bilingual dictionary and lexical selection, and
wouldn't have any effect on transfer.

References[edit]