How to get started with lexical selection rules

From Apertium
Jump to navigation Jump to search

First approach

Choose your words

Before you start making lexical selection rules, you first want to choose a word in your source language (e.g. English) which has more than one translation in your target language (e.g. Spanish). For example

  • argument → discusión
  • argument → polémica
  • argument → argumento

Think about context

The words around our word often help us decide how to translate it, for example, a verb might inform us of how to translate a noun, or a noun might inform us of how to translate an adjective.

If we say "to have an argument", then it probably means "discusión", whereas if we say "to accept the argument", then you probably want to translate it as "argumento".

Think about synonyms and antonyms

Once you have a rule, one way of making it more general is to think of synonyms and antonyms for the context words. For example, if you have the rule:

	<rule>
	  <match lemma="positive" tags="*"/>
	  <match lemma="charge" tags="n.*">
	    <select lemma="carga" tags="n.*"/>
	  </match>
	</rule>

You could quite easily think that the antonym of "positive" is "negative", and add that too:


	<rule>
          <or>
	    <match lemma="positive" tags="*"/>
	    <match lemma="negative" tags="*"/>
          </or>
	  <match lemma="charge" tags="n.*">
	    <select lemma="carga" tags="n.*"/>
	  </match>
	</rule>

Think about semantically related words

If you have the rule:

	<rule>
	  <match lemma="wind" tags="*"/>
	  <match lemma="power" tags="n.*">
	    <select lemma="energía" tags="n.*"/>
	  </match>
	</rule>

You might think that the translation of "power" as "energía" (instead of the default translation poder) can happen more times than only after "wind", for example, "solar power" energía solar, "wave power" energía olamotriz

And even thinking about these can bring you to other rules, for example "electrical power" is not energía eléctrica, it's potencia eléctrica.

Look at a concordance

A concordance (or "key word in context") is a set of sentences where they are centred on a single word (sometimes called the "key word").


Try a parallel corpus

You can look at which contexts are used in one translation, but not another by looking at a parallel corpus.

$ paste europarl-v6.es-en.en europarl-v6.es-en.es | grep ' power .* potencia '
Since the Union as a whole is a world-class fishing '''power''' and one of the largest markets for fish produce, ...
Por ser la Unión en conjunto una '''potencia''' pesquera en el nivel mundial y uno de los mayores mercados de productos pesqueros, ...

Second approach

See also