How to get started with lexical selection rules
This page is about how to start writing rules for lexical selection. It documents a few approaches, with example rules in the constraint-based lexical selection format.
First approach
Choose your words
Before you start making lexical selection rules, you first want to choose a word in your source language (e.g. English) which has more than one translation in your target language (e.g. Spanish). For example
- argument → discusión
- argument → polémica
- argument → argumento
Think about context
The words around our word often help us decide how to translate it, for example, a verb might inform us of how to translate a noun, or a noun might inform us of how to translate an adjective.
If we say "to have an argument", then it probably means "discusión", whereas if we say "to accept the argument", then you probably want to translate it as "argumento".
Think about synonyms and antonyms
Once you have a rule, one way of making it more general is to think of synonyms and antonyms for the context words. For example, if you have the rule:
<rule> <match lemma="positive" tags="*"/> <match lemma="charge" tags="n.*"> <select lemma="carga" tags="n.*"/> </match> </rule>
You could quite easily think that the antonym of "positive" is "negative", and add that too:
<rule> <or> <match lemma="positive" tags="*"/> <match lemma="negative" tags="*"/> </or> <match lemma="charge" tags="n.*"> <select lemma="carga" tags="n.*"/> </match> </rule>
If you have the rule:
<rule> <match lemma="wind" tags="*"/> <match lemma="power" tags="n.*"> <select lemma="energía" tags="n.*"/> </match> </rule>
You might think that the translation of "power" as "energía" (instead of the default translation poder) can happen more times than only after "wind", for example, "solar power" energía solar, "wave power" energía olamotriz. This might give, for example:
<rule> <or> <match lemma="wind" tags="*"/> <match lemma="solar" tags="*"/> <match lemma="hydro" tags="*"/> <match lemma="geothermal" tags="*"/> <match lemma="tidal" tags="*"/> </or> <match lemma="power" tags="n.*"> <select lemma="energía" tags="n.*"/> </match> </rule>
And even thinking about these can bring you to other rules, for example "electrical power" is not energía eléctrica, it's potencia eléctrica.
Look at a concordance
A concordance (or "key word in context") is a set of sentences where they are centred on a single word (sometimes called the "key word"). To make a concordance you can use a concordancer (e.g. apertium-concord).
Here is an example from EuroParl:
represent, to the President and to the Governor of Texas, Mr Bush, who has the power to order a stay We should do everything within our We should do everything within our power to force the On the market, the balance of On the market, the balance of power between supply and The scandalous concentration of The scandalous concentration of power in sectors of strategic fact, retaining not only the The Commission is, in fact, retaining not only the power to your questions about the nuclear Turning to your questions about the nuclear power stations in financing required for improving the degree of efficiency and safety of nuclear power stations in certain have the Mr President, it is clear that the European Union does not have the power to intervene in the the balance of I therefore feel we must carefully consider the balance of power that we are in the sea's Having spent a lot of time at sea myself I am well aware of the sea's power and destructive force, The Commission is following with interest the planned construction of a nuclear power plant in Akkuyu, Turkey siting, construction, commissioning, operation and decommissioning of nuclear power plants in Turkey rests a serious risk that some idiot will decide that the new geopolitical balance of power in the Caucasus calls in the development of that nuclear If we see in the development of that nuclear power build a nuclear If the conclusion is that Turkey is planning to build a nuclear power plant that does not none of the upheaval would have been caused had we not acted with parliamentary power to press for changes yet the police are being forced into a position where they will not have the power to resist the terrorist right to interfere in the formation of a government even though it has assumed power on the basis of unusual about what is happening in Austria: there has been a changeover of power following democratic for this Intergovernmental Conference to score a hat trick; that of the power to act, democratic course, we need to create the At the same time of course, we need to create the power to act in order The European Union must also have the The European Union must also have the power to act ...
If you do the concordance yourself, particularly interesting are the sequences: "nuclear power", "nuclear power station", "nuclear power plant", "parliamentary power", "sea's power", "balance of power", "concentration of power", "power to act", "motive power", "abuse of power", "decision-making power", "power supply", "come into power", "economic power", "power structures", "combined power and heat", "political power".
Try a parallel corpus
You can look at which contexts are used in one translation, but not another by looking at a parallel corpus.
$ paste europarl-v6.es-en.en europarl-v6.es-en.es | grep ' power .* potencia ' Since the Union as a whole is a world-class fishing '''power''' and one of the largest markets for fish produce, ... Por ser la Unión en conjunto una '''potencia''' pesquera en el nivel mundial y uno de los mayores mercados de productos pesqueros, ...
Second approach
Another approach is to write rules to fix translation errors that you come across. In order to try this out, take a big text (for example a newspaper article), and run it through the translator.
For example, if we take this article, the translation is pretty bad, but there are some places where lexical selection could improve the picture.
MPs who had spent almost six hours debating the state of the UK economy voted by 213 to 79, a majority of 134.
MPs Quién había gastado casi seis horas debatiendo el estado de la economía de UK votada por 213 a 79, una mayoría de 134.
In English, "spend" can have a number of meanings, among them "to pass time" pasar and "to pay money" gastar. In this case, we see that the context demands the translation of pasar because it is talking about time spent. So, we might make a rule like the following:
<rule> <!-- MPs spent almost six hours debating... --> <match lemma="spend" tags="vblex.*"> <select lemma="pasar" tags="vblex.*"/> </match> <match/> <match/> <or> <match lemma="minute"/> <match lemma="hour"/> <match lemma="year"/> </or> </rule>