Semantic tagging
Jump to navigation
Jump to search
This page is to keep a list of ideas for semantic tagging in Apertium.
Uses
Approaches
- Giellatekno
- Grammatical Framework
Data sources
- Embeddings ?
Implementation
Bilingual or multilingual
- Often a word can be disambiguated using its translation in another language, for example the triple (estació, gare, station) defines a building meaning.
Existing examples
SET MangoFruitWords = ("aguacate"i) OR ("albahaca"i) OR ("alimentario"i) OR ("alimenticio"i) OR ("aloe"i) OR ("anacardo"i) OR ("ananás"i) OR ("anchoa"i) OR ( "arroz"i) OR ("atún"i) OR ("azúcar"i) OR ("banana"i) OR ("banano"i) OR ("batido"i) OR ("boniato"i) OR ("brocheta"i) OR ("cacahuete"i) OR ("cacao"i) OR ("caram elizar"i) OR ("caramelo"i) OR ("carpaccio"i) OR ("caviar"i) OR ("cereal"i) OR ("chirimoya"i) OR ("chocolate"i) OR ("<chutney>"i) OR ("clima"i) OR ("coco"i) OR ("cocotero"i) OR ("codorniz"i) OR ("comer"i) OR ("comercial"i) OR ("comida"i) OR ("cosecha"i) OR ("crema"i) OR ("cultivar"i) OR ("cultivo"i) OR ("cítrico"i) OR ("dátil"i) OR ("deshidratar"i) OR ("ensalada"i) OR ("exportación"i) OR ("foie"i) OR ("fragancia"i) OR ("fresa"i) OR ("fresco"i) OR ("fruta"i) OR ("fruto"i) OR ("gamba"i) OR ("gazpacho"i) OR ("guayaba"i) OR ("gustar"i) OR ("helado"i) OR ("hortaliza"i) OR ("ingrediente"i) OR ("jamón"i) OR ("jarabe"i) OR ("jardín"i ) OR ("jengibre"i) OR ("judía"i) OR ("langosta"i) OR ("langostino"i) OR ("lechuga"i) OR ("legumbre"i) OR ("maduro"i) OR ("mandarina"i) OR ("mandioca"i) OR ("m anzana"i) OR ("maní"i) OR ("maracuyá"i) OR ("maíz"i) OR ("melocotón"i) OR ("melón"i) OR ("mono"i) OR ("naranja"i) OR ("naranjo"i) OR ("orquídea"i) OR ("orégan o"i) OR ("palma"i) OR ("palmera"i) OR ("papaya"i) OR ("parmesano"i) OR ("patata"i) OR ("piscina"i) OR ("piña"i) OR ("plantación"i) OR ("plátano"i) OR ("pollo" i) OR ("probar"i)OR ("puré"i) OR ("rodaja"i) OR ("ron"i) OR ("salsa"i) OR ("sorbete"i) OR ("sorgo"i) OR ("subsistencia"i) OR ("sésamo"i) OR ("tabaco"i) OR ("t empura"i) OR ("tomate"i) OR ("trigo"i) OR ("triturar"i) OR ("tropical"i) OR ("tubérculo"i) OR ("vainilla"i) OR ("vinagre"i) OR ("yogur"i) OR ("zumo"i) OR ("Á frica"); SET MangoNotFruitWords = ("acero"i) OR ("azada"i) OR ("levantar"i) OR ("alzar"i) OR ("plata"i) OR ("arpón"i) OR ("azote"i) OR ("cuerno"i) OR ("bastón"i) OR (" bolsa"i) OR ("brazo"i) OR ("silla"i) OR ("centímetro"i) OR ("cinturón"i) OR ("llave"i) OR ("clavar"i) OR ("cubierto"i) OR ("golpear"i) OR ("cuchillo"i) OR ("c orazón"i) OR ("cuerda"i) OR ("cuerpo"i) OR ("cocina"i) OR ("cuero"i) OR ("cuchara"i) OR ("corto"i) OR ("hacha"i) OR ("herramienta"i) OR ("emplear"i) OR ("empu ñar"i) OR ("escoba"i) OR ("espada"i) OR ("estirar"i) OR ("apretar"i) OR ("extremo"i) OR ("trabajo"i) OR ("meter"i) OR ("fuego"i) OR ("forma"i) OR ("látigo"i) OR ("hoja"i) OR ("madera"i) OR ("cuchillo"i) OR ("girar"i) OR ("escoba"i) OR ("grabar"i) OR ("grueso"i) OR ("instrumento"i) OR ("marfil"i) OR ("lanza"i) OR (" lanzar"i) OR ("largo"i) OR ("maza"i) OR ("martillo"i) OR ("metal"i) OR ("mover"i) OR ("movimiento"i) OR ("navaja"i) OR ("limpiar"i) OR ("sartén"i) OR ("paella "i) OR ("palo"i) OR ("pala"i) OR ("papel"i) OR ("pieza"i) OR ("piedra"i) OR ("pequeño"i) OR ("picar"i) OR ("pistola"i) OR ("plástico"i) OR ("plata"i) OR ("plu ma"i) OR ("puerta"i) OR ("precioso"i) OR ("punta"i) OR ("puñal"i) OR ("cepillo"i) OR ("cepillo"i) OR ("redondo"i) OR ("ropa"i) OR ("rueda"i) OR ("sujetar"i) O R ("mesa"i) OR ("atravesar"i) OR ("utilizar"i) OR ("alrededor"i) OR ("marfil"i); SELECT:mango_fruta ("mango_fruta"i) IF (0 ("mango_fruta"i)) (0*/* MangoFruitWords) (NOT 0* MangoNotFruitWords) ; REMOVE:mango_0 ("mango_fruta"i) IF (0 ("mango"i)) (0 ("mango_fruta"i)) ;
Ideas and notes
Thanks Xavi for the ideas... What I've been thinking about is a module that would go after biltrans and before lexical selection. It would essentially reweight the possible translations based on a bag of words over a fixed window of words or "sentences" (delimited with '.'). You could have source and target components, so e.g. you might say that "fruit" is a semantic field or domain which includes, "mango", "manzana", "plátano", "naranja", ... and "mango", "taronja", "poma" In Catalan. These would be in the monolingual pairs. The module would take both lists and the input ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$ ^mango<n><m><pl>/mànec<n><m><pl>/mango<n><m><pl>$ ^y<cnjcoo>/i<cnjcoo>$ ^manzana<n><f><pl>/poma<n><f><pl>$ And try and maximise semantic coherence, then it could reweight, so e.g. ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$ ^mango<n><m><pl>/mango<n><m><pl><2.0>/mànec<n><m><pl><0.0>$ ^y<cnjcoo>/i<cnjcoo>$ ^manzana<n><f><pl>/poma<n><f><pl>$ And pass it to the lexical selection module which will choose the one with the highest weight. This would mean a new module, but it would require only minor changes to the bilingual dictionary and lexical selection, and wouldn't have any effect on transfer.