Semantic tagging
Revision as of 15:27, 15 June 2020 by Francis Tyers (talk | contribs)
This page is to keep a list of ideas for semantic tagging in Apertium.
Uses[edit]
Approaches[edit]
- Giellatekno
- Grammatical Framework
Data sources[edit]
- Embeddings ?
Implementation[edit]
Bilingual or multilingual[edit]
- Often a word can be disambiguated using its translation in another language, for example the triple (estació, gare, station) defines a building meaning.
Existing examples[edit]
SET MangoFruitWords = ("aguacate"i) OR ("albahaca"i) OR ("alimentario"i) OR ("alimenticio"i) OR ("aloe"i) OR ("anacardo"i) OR ("ananás"i) OR ("anchoa"i) OR ( "arroz"i) OR ("atún"i) OR ("azúcar"i) OR ("banana"i) OR ("banano"i) OR ("batido"i) OR ("boniato"i) OR ("brocheta"i) OR ("cacahuete"i) OR ("cacao"i) OR ("caram elizar"i) OR ("caramelo"i) OR ("carpaccio"i) OR ("caviar"i) OR ("cereal"i) OR ("chirimoya"i) OR ("chocolate"i) OR ("<chutney>"i) OR ("clima"i) OR ("coco"i) OR ("cocotero"i) OR ("codorniz"i) OR ("comer"i) OR ("comercial"i) OR ("comida"i) OR ("cosecha"i) OR ("crema"i) OR ("cultivar"i) OR ("cultivo"i) OR ("cítrico"i) OR ("dátil"i) OR ("deshidratar"i) OR ("ensalada"i) OR ("exportación"i) OR ("foie"i) OR ("fragancia"i) OR ("fresa"i) OR ("fresco"i) OR ("fruta"i) OR ("fruto"i) OR ("gamba"i) OR ("gazpacho"i) OR ("guayaba"i) OR ("gustar"i) OR ("helado"i) OR ("hortaliza"i) OR ("ingrediente"i) OR ("jamón"i) OR ("jarabe"i) OR ("jardín"i ) OR ("jengibre"i) OR ("judía"i) OR ("langosta"i) OR ("langostino"i) OR ("lechuga"i) OR ("legumbre"i) OR ("maduro"i) OR ("mandarina"i) OR ("mandioca"i) OR ("m anzana"i) OR ("maní"i) OR ("maracuyá"i) OR ("maíz"i) OR ("melocotón"i) OR ("melón"i) OR ("mono"i) OR ("naranja"i) OR ("naranjo"i) OR ("orquídea"i) OR ("orégan o"i) OR ("palma"i) OR ("palmera"i) OR ("papaya"i) OR ("parmesano"i) OR ("patata"i) OR ("piscina"i) OR ("piña"i) OR ("plantación"i) OR ("plátano"i) OR ("pollo" i) OR ("probar"i)OR ("puré"i) OR ("rodaja"i) OR ("ron"i) OR ("salsa"i) OR ("sorbete"i) OR ("sorgo"i) OR ("subsistencia"i) OR ("sésamo"i) OR ("tabaco"i) OR ("t empura"i) OR ("tomate"i) OR ("trigo"i) OR ("triturar"i) OR ("tropical"i) OR ("tubérculo"i) OR ("vainilla"i) OR ("vinagre"i) OR ("yogur"i) OR ("zumo"i) OR ("Á frica"); SET MangoNotFruitWords = ("acero"i) OR ("azada"i) OR ("levantar"i) OR ("alzar"i) OR ("plata"i) OR ("arpón"i) OR ("azote"i) OR ("cuerno"i) OR ("bastón"i) OR (" bolsa"i) OR ("brazo"i) OR ("silla"i) OR ("centímetro"i) OR ("cinturón"i) OR ("llave"i) OR ("clavar"i) OR ("cubierto"i) OR ("golpear"i) OR ("cuchillo"i) OR ("c orazón"i) OR ("cuerda"i) OR ("cuerpo"i) OR ("cocina"i) OR ("cuero"i) OR ("cuchara"i) OR ("corto"i) OR ("hacha"i) OR ("herramienta"i) OR ("emplear"i) OR ("empu ñar"i) OR ("escoba"i) OR ("espada"i) OR ("estirar"i) OR ("apretar"i) OR ("extremo"i) OR ("trabajo"i) OR ("meter"i) OR ("fuego"i) OR ("forma"i) OR ("látigo"i) OR ("hoja"i) OR ("madera"i) OR ("cuchillo"i) OR ("girar"i) OR ("escoba"i) OR ("grabar"i) OR ("grueso"i) OR ("instrumento"i) OR ("marfil"i) OR ("lanza"i) OR (" lanzar"i) OR ("largo"i) OR ("maza"i) OR ("martillo"i) OR ("metal"i) OR ("mover"i) OR ("movimiento"i) OR ("navaja"i) OR ("limpiar"i) OR ("sartén"i) OR ("paella "i) OR ("palo"i) OR ("pala"i) OR ("papel"i) OR ("pieza"i) OR ("piedra"i) OR ("pequeño"i) OR ("picar"i) OR ("pistola"i) OR ("plástico"i) OR ("plata"i) OR ("plu ma"i) OR ("puerta"i) OR ("precioso"i) OR ("punta"i) OR ("puñal"i) OR ("cepillo"i) OR ("cepillo"i) OR ("redondo"i) OR ("ropa"i) OR ("rueda"i) OR ("sujetar"i) O R ("mesa"i) OR ("atravesar"i) OR ("utilizar"i) OR ("alrededor"i) OR ("marfil"i); SELECT:mango_fruta ("mango_fruta"i) IF (0 ("mango_fruta"i)) (0*/* MangoFruitWords) (NOT 0* MangoNotFruitWords) ; REMOVE:mango_0 ("mango_fruta"i) IF (0 ("mango"i)) (0 ("mango_fruta"i)) ;
Ideas and notes[edit]
Thanks Xavi for the ideas... What I've been thinking about is a module that would go after biltrans and before lexical selection. It would essentially reweight the possible translations based on a bag of words over a fixed window of words or "sentences" (delimited with '.'). You could have source and target components, so e.g. you might say that "fruit" is a semantic field or domain which includes, "mango", "manzana", "plátano", "naranja", ... and "mango", "taronja", "poma" In Catalan. These would be in the monolingual pairs. The module would take both lists and the input ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$ ^mango<n><m><pl>/mànec<n><m><pl>/mango<n><m><pl>$ ^y<cnjcoo>/i<cnjcoo>$ ^manzana<n><f><pl>/poma<n><f><pl>$ And try and maximise semantic coherence, then it could reweight, so e.g. ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$ ^mango<n><m><pl>/mango<n><m><pl><2.0>/mànec<n><m><pl><0.0>$ ^y<cnjcoo>/i<cnjcoo>$ ^manzana<n><f><pl>/poma<n><f><pl>$ And pass it to the lexical selection module which will choose the one with the highest weight. This would mean a new module, but it would require only minor changes to the bilingual dictionary and lexical selection, and wouldn't have any effect on transfer.