Difference between revisions of "How to get started with lexical selection rules"
(17 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{TOCD}} |
||
+ | |||
+ | This page is about how to start writing rules for [[lexical selection]]. It documents a few approaches, with example rules in the [[constraint-based lexical selection]] format. |
||
+ | |||
== First approach == |
== First approach == |
||
Line 29: | Line 33: | ||
You could quite easily think that the antonym of "positive" is "negative", and add that too: |
You could quite easily think that the antonym of "positive" is "negative", and add that too: |
||
− | |||
<pre> |
<pre> |
||
Line 49: | Line 52: | ||
<pre> |
<pre> |
||
<rule> |
<rule> |
||
− | <match lemma="wind |
+ | <match lemma="wind"/> |
<match lemma="power" tags="n.*"> |
<match lemma="power" tags="n.*"> |
||
<select lemma="energía" tags="n.*"/> |
<select lemma="energía" tags="n.*"/> |
||
Line 56: | Line 59: | ||
</pre> |
</pre> |
||
− | You might think that the translation of "power" as "energía" (instead of the default translation ''poder'') can happen more times than only after "wind", for example, "solar power" ''energía solar'', "wave power" ''energía olamotriz'' |
+ | You might think that the translation of "power" as "energía" (instead of the default translation ''poder'') can happen more times than only after "wind", for example, "solar power" ''energía solar'', "wave power" ''energía olamotriz''. This might give, for example: |
+ | |||
+ | <pre> |
||
+ | <rule> |
||
+ | <or> |
||
+ | <match lemma="wind"/> |
||
+ | <match lemma="solar"/> |
||
+ | <match lemma="hydro"/> |
||
+ | <match lemma="geothermal"/> |
||
+ | <match lemma="tidal"/> |
||
+ | </or> |
||
+ | <match lemma="power" tags="n.*"> |
||
+ | <select lemma="energía" tags="n.*"/> |
||
+ | </match> |
||
+ | </rule> |
||
+ | </pre> |
||
And even thinking about these can bring you to other rules, for example "electrical power" is not ''energía eléctrica'', it's ''potencia eléctrica''. |
And even thinking about these can bring you to other rules, for example "electrical power" is not ''energía eléctrica'', it's ''potencia eléctrica''. |
||
Line 72: | Line 90: | ||
The scandalous concentration of The scandalous concentration of power in sectors of strategic |
The scandalous concentration of The scandalous concentration of power in sectors of strategic |
||
fact, retaining not only the The Commission is, in fact, retaining not only the power |
fact, retaining not only the The Commission is, in fact, retaining not only the power |
||
+ | to your questions about the nuclear Turning to your questions about the nuclear power stations in |
||
+ | financing required for improving the degree of efficiency and safety of nuclear power stations in certain |
||
+ | have the Mr President, it is clear that the European Union does not have the power to intervene in the |
||
the balance of I therefore feel we must carefully consider the balance of power that we are in |
the balance of I therefore feel we must carefully consider the balance of power that we are in |
||
the sea's Having spent a lot of time at sea myself I am well aware of the sea's power and destructive force, |
the sea's Having spent a lot of time at sea myself I am well aware of the sea's power and destructive force, |
||
Line 86: | Line 107: | ||
course, we need to create the At the same time of course, we need to create the power to act in order |
course, we need to create the At the same time of course, we need to create the power to act in order |
||
The European Union must also have the The European Union must also have the power to act |
The European Union must also have the The European Union must also have the power to act |
||
+ | |||
− | unconscious but undoubtedly rational initiatives dictated by the desire for power and pointing in the |
||
− | + | ... |
|
− | improve both the functioning of the internal market and the competitiveness and power to act of enterprises. |
||
− | We must create forms of motive We must create forms of motive power which are not |
||
− | If we develop motive If we develop motive power of the kind I have |
||
− | up the IGC agenda beyond issues strictly related to redressing the balance of power between Member States, |
||
− | conflict with the Treaty and apportion to the institutions of the Union more power than that to which |
||
− | political respectability on right wing extremism and also giving them access to power - both of which |
||
− | a party which is openly pro-Nazi, racist and xenophobic has gained access to power in a European country. |
||
− | It turned its threatened loss of It turned its threatened loss of power into an "heroic |
||
− | we need to focus our attention is the possibility of nepotism and abuse of power within regional and local |
||
− | But the fact that we now fear abuse of But the fact that we now fear abuse of power |
||
− | have taken place on major strategic options which the Commission, under its power of initiative, launched, |
||
− | today, we do not have definite answers regarding the monitoring of the nuclear power stations in the eastern |
||
− | grow more distant will leave the way open for an oligarchy that will take power and leave control in |
||
− | to give all decision-making You are now proposing to give all decision-making power to |
||
− | If we are to ensure that we defeat the fear-mongers amongst us - those seeking power on the backs of |
||
− | Will you do everything in your Will you do everything in your power to ensure that |
||
− | Will you do everything in your Will you do everything in your power to ensure that |
||
− | substance and yet, at the same time, is asking for ever greater decision-making power to the detriment of |
||
− | in direct dialogue with the public, and not just with the élite or those in power whom we meet as |
||
− | of the text to be submitted to the Heads of State and Government, this power must be exercised with |
||
− | reach The fact that an openly extremist, racist and xenophobic party can reach power in a Member State |
||
− | It is good for us that we have the It is good for us that we have the power |
||
− | A new government has come into A new government has come into power in Austria, comprising |
||
− | in recent weeks in helping to cope with the considerable difficulties with the power supply in Kosovo. |
||
− | given the beginning of such a True, we are now given the beginning of such a power in this new system. |
||
− | Convention, I would like to urge the Member States to do everything in their power to speedily ratify this |
||
− | advocates reorganising European and national institutions, strengthening the power of the state, increasing |
||
− | In this way, the In this way, the power of globalisation can be |
||
− | party in This is occurring in my country, despite the government, the party in power being the only one |
||
− | I do not understand why the I do not understand why the power centres of Europe |
||
− | have been a public affair and, as such, just like public transport and the power supply, a core task |
||
− | third period, the composition of the bodies which wield political and economic power needs to be changed. |
||
− | responsible undertook to implement specific measures and policies to change the power ratio. |
||
− | of women on committees who wield The number of women on committees who wield power and |
||
− | have been effective as far as the decision-making centres of the parties in power is concerned, the same |
||
− | Excluding women from the centres of Excluding women from the centres of power means, first, democratic |
||
− | to surrender positions of Men have never wished to surrender positions of power when |
||
− | for the same reasons of Inevitably, we should like, for the same reasons of power |
||
− | since been banned in all those countries in which fascism and racism came to power during and between the |
||
− | my colleagues in supporting and assisting fellow citizens in my country to gain power and access to |
||
− | simply, I would say that there are several ways in which to attain positions of power and I am particularly |
||
− | plan would be a benchmarking system to guarantee women' s full participation in power structures and decision |
||
− | between the sexes, for distributing public offices and political and economic power fairly and evenly between |
||
− | instead of a redundancy notice. So too did the Danish Government which came to power in 1993. |
||
− | powerful European approach, where the fifteen Member States do not try to gain power but where the power |
||
− | to technology, but without any change to social structures, we need a debate on power and the distribution of |
||
− | Or have industrialisation, nuclear Or have industrialisation, nuclear power and biotechnology led to |
||
− | Council must have as much clout as the ECOFIN Council and that political power must remain in the |
||
− | but we believe it can be more effectively enforced if Member States retain the power to establish their own |
||
− | everything in our We shall then be grateful if we have done everything in our power to prevent oil spills. |
||
− | to give the European Council the Any proposal to give the European Council the power of |
||
− | Such Such power must at all times be |
||
− | President, rules make sense if their purpose is to protect people against the power of money, against health |
||
− | a great concentration of We do this because there is a great concentration of power at the centre of |
||
− | exercise far too much The EU organs and institutions can exercise far too much power to permit them that |
||
− | IGC should not simply be a kind of discussion of the division and management of power in an enlarged Union. |
||
− | hard on Bulgaria and to go far too far, and because safety standards at these power stations were said not |
||
− | that a recycling industry is developing which possesses such competence and power in terms of financial |
||
− | and of refuse collection and disposal than of, for example, generating combined power and heat. |
||
− | It is in the It is in the power of the United Kingdom |
||
− | Her strength and Her strength and power for evil purposes is |
||
− | not be possible to conduct inquiries into corruption and the abuse of political power without fear of political |
||
− | based on the fairer distribution of wealth, an increase in the purchasing power of households and the |
||
− | were to improve the standard of living of workers, increase their purchasing power and guarantee safe jobs, |
||
− | Therefore, the Therefore, the power that is given to the |
||
− | at defining exactly the respective role of the executive and legislative power in the basic legislative |
||
− | confers a I submit that this is inappropriate, because it effectively confers a power on the Commission to |
||
− | lose In what way do the European Council and the European institutions lose power and the potential to |
||
− | of Amsterdam, I think it is important for us to bear in mind that transfers of power have taken place, and |
||
− | in Mr President-in-Office of the Council, you said that there is a party in power in Austria which has |
||
− | not think that it is rather difficult to accept that an economic and political power such as Europe, which |
||
− | to your questions about the nuclear Turning to your questions about the nuclear power stations in |
||
− | financing required for improving the degree of efficiency and safety of nuclear power stations in certain |
||
− | have the Mr President, it is clear that the European Union does not have the power to intervene in the |
||
− | legislative work in open meetings, only parliaments do that, and as legislative power is divided between |
||
− | these fears about Europe losing ground in Mexico against the exporting power of the United States. |
||
− | in the context of the development of a technology operated chiefly by a big power which, while teaching |
||
− | has potential to transfer The information society has potential to transfer power from government |
||
− | to support the view that those in This is not to support the view that those in power can do as they |
||
− | |||
</pre> |
</pre> |
||
− | + | If you do the concordance yourself, particularly interesting are the sequences: "nuclear power", "nuclear power station", "nuclear power plant", "parliamentary power", "sea's power", "balance of power", "concentration of power", "power to act", "motive power", "abuse of power", "decision-making power", "power supply", "come into power", "economic power", "power structures", "combined power and heat", "political power". |
|
=== Try a parallel corpus === |
=== Try a parallel corpus === |
||
Line 182: | Line 123: | ||
</pre> |
</pre> |
||
− | == Second approach == |
+ | == Second approach == |
+ | |||
+ | Another approach is to write rules to fix translation errors that you come across. In order to try this out, take a big text (for example a newspaper article), and run it through the translator. |
||
+ | |||
+ | For example, if we take [http://www.guardian.co.uk/politics/2011/dec/07/coalition-government-first-commons-defeat this article], the translation is pretty bad, but there are some places where lexical selection could improve the picture. |
||
+ | |||
+ | <blockquote> |
||
+ | MPs who had '''spent''' almost six hours debating the state of the UK economy voted by 213 to 79, a majority of 134.<br/> |
||
+ | |||
+ | MPs Quién había '''gastado''' casi seis horas debatiendo el estado de la economía de UK votada por 213 a 79, una mayoría de 134. |
||
+ | </blockquote> |
||
+ | |||
+ | In English, "spend" can have a number of meanings, among them "to pass time" ''pasar'' and "to pay money" ''gastar''. In this case, we see that the context demands the translation of ''pasar'' because it is talking about time spent. So, we might make a rule like the following: |
||
+ | |||
+ | <pre> |
||
+ | <rule> <!-- MPs spent almost six hours debating... --> |
||
+ | <match lemma="spend" tags="vblex.*"> |
||
+ | <select lemma="pasar" tags="vblex.*"/> |
||
+ | </match> |
||
+ | <match/> |
||
+ | <match/> |
||
+ | <or> |
||
+ | <match lemma="minute"/> |
||
+ | <match lemma="hour"/> |
||
+ | <match lemma="year"/> |
||
+ | </or> |
||
+ | </rule> |
||
+ | </pre> |
||
+ | |||
+ | |||
+ | == Rule weighting == |
||
+ | |||
+ | Rule weighting is a bit complicated, partly because of how it works. So, suppose you have the following input: |
||
+ | |||
+ | <pre> |
||
+ | ^a<pr>/to<pr>$ ^un<det><ind><f><sg>/a<det><ind><sg>$ |
||
+ | ^estació<n><f><sg>/station<n><sg>/season<n><sg>$ |
||
+ | ^llarg<adj><f><sg>/long<adj><sint>$ |
||
+ | </pre> |
||
+ | |||
+ | If you have the following rules: |
||
+ | |||
+ | <pre> |
||
+ | <rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> |
||
+ | <rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule> |
||
+ | </pre> |
||
+ | |||
+ | Then both translations with no context have an equal weight. Which one is picked will depend on how the transducer is minimised (e.g. it will be non-deterministic). If you want one translation to have a higher weight than another, then you can just adjust the weight: |
||
+ | |||
+ | <pre> |
||
+ | <rule weight="1.2"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> |
||
+ | <rule weight="0.8"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule> |
||
+ | </pre> |
||
+ | |||
+ | Now "station" will be picked, as <math>1.2 > 0.8</math>. Now, how about if we want to add some context in there. We can just add another rule: |
||
+ | |||
+ | <pre> |
||
+ | <rule weight="1.2"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> |
||
+ | <rule weight="0.8"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule> |
||
+ | <rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match> |
||
+ | <or><match lemma="llarg"/><match lemma="curt"/></or></rule> |
||
+ | </pre> |
||
+ | |||
+ | So, if we have the sequence above "a una estació llarga", then the "season" translation will get <math>1.8</math> and the station translation will get <math>1.2</math> meaning that "season" will be picked... |
||
+ | |||
+ | But let's not stop there, the preposition can also give us some information, so let's add another rule: |
||
+ | |||
+ | <pre> |
||
+ | <rule weight="1.0"><or><match lemma="a"/><match lemma="en"/></or><match/> |
||
+ | <match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></rule> |
||
+ | <rule weight="1.2"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> |
||
+ | <rule weight="0.8"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule> |
||
+ | <rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match> |
||
+ | <or><match lemma="llarg"/><match lemma="curt"/></or></rule> |
||
+ | </pre> |
||
+ | |||
+ | Now, if we have the sequence "a una estación llarga", we would get: |
||
+ | |||
+ | * a _ estació → station = 1.0 |
||
+ | * estació → station = 1.2 |
||
+ | * estació → season = 0.8 |
||
+ | * estació → season llarga = 1.0 |
||
+ | |||
+ | So which translation would win ? The "station" translation because <math>1.0 + 1.2 > 0.8 + 1.0</math>. |
||
+ | |||
+ | ==Why do my rules not match?== |
||
+ | First off, read the above about rule weighting (lrx-proc will not always pick the longest match; you have to give it a higher weight to make it the preferred one). |
||
+ | Also, note that <code><match tags="n.*"></match></code> never matches anything, you have to write it like <code><match tags="n.*"/></code> (see https://sourceforge.net/p/apertium/tickets/64/ ) |
||
== See also == |
== See also == |
||
Line 190: | Line 218: | ||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
+ | [[Category:Lexical selection]] |
||
+ | [[Category:Documentation in English]] |
Latest revision as of 13:21, 7 December 2015
This page is about how to start writing rules for lexical selection. It documents a few approaches, with example rules in the constraint-based lexical selection format.
First approach[edit]
Choose your words[edit]
Before you start making lexical selection rules, you first want to choose a word in your source language (e.g. English) which has more than one translation in your target language (e.g. Spanish). For example
- argument → discusión
- argument → polémica
- argument → argumento
Think about context[edit]
The words around our word often help us decide how to translate it, for example, a verb might inform us of how to translate a noun, or a noun might inform us of how to translate an adjective.
If we say "to have an argument", then it probably means "discusión", whereas if we say "to accept the argument", then you probably want to translate it as "argumento".
Think about synonyms and antonyms[edit]
Once you have a rule, one way of making it more general is to think of synonyms and antonyms for the context words. For example, if you have the rule:
<rule> <match lemma="positive" tags="*"/> <match lemma="charge" tags="n.*"> <select lemma="carga" tags="n.*"/> </match> </rule>
You could quite easily think that the antonym of "positive" is "negative", and add that too:
<rule> <or> <match lemma="positive" tags="*"/> <match lemma="negative" tags="*"/> </or> <match lemma="charge" tags="n.*"> <select lemma="carga" tags="n.*"/> </match> </rule>
[edit]
If you have the rule:
<rule> <match lemma="wind"/> <match lemma="power" tags="n.*"> <select lemma="energía" tags="n.*"/> </match> </rule>
You might think that the translation of "power" as "energía" (instead of the default translation poder) can happen more times than only after "wind", for example, "solar power" energía solar, "wave power" energía olamotriz. This might give, for example:
<rule> <or> <match lemma="wind"/> <match lemma="solar"/> <match lemma="hydro"/> <match lemma="geothermal"/> <match lemma="tidal"/> </or> <match lemma="power" tags="n.*"> <select lemma="energía" tags="n.*"/> </match> </rule>
And even thinking about these can bring you to other rules, for example "electrical power" is not energía eléctrica, it's potencia eléctrica.
Look at a concordance[edit]
A concordance (or "key word in context") is a set of sentences where they are centred on a single word (sometimes called the "key word"). To make a concordance you can use a concordancer (e.g. apertium-concord).
Here is an example from EuroParl:
represent, to the President and to the Governor of Texas, Mr Bush, who has the power to order a stay We should do everything within our We should do everything within our power to force the On the market, the balance of On the market, the balance of power between supply and The scandalous concentration of The scandalous concentration of power in sectors of strategic fact, retaining not only the The Commission is, in fact, retaining not only the power to your questions about the nuclear Turning to your questions about the nuclear power stations in financing required for improving the degree of efficiency and safety of nuclear power stations in certain have the Mr President, it is clear that the European Union does not have the power to intervene in the the balance of I therefore feel we must carefully consider the balance of power that we are in the sea's Having spent a lot of time at sea myself I am well aware of the sea's power and destructive force, The Commission is following with interest the planned construction of a nuclear power plant in Akkuyu, Turkey siting, construction, commissioning, operation and decommissioning of nuclear power plants in Turkey rests a serious risk that some idiot will decide that the new geopolitical balance of power in the Caucasus calls in the development of that nuclear If we see in the development of that nuclear power build a nuclear If the conclusion is that Turkey is planning to build a nuclear power plant that does not none of the upheaval would have been caused had we not acted with parliamentary power to press for changes yet the police are being forced into a position where they will not have the power to resist the terrorist right to interfere in the formation of a government even though it has assumed power on the basis of unusual about what is happening in Austria: there has been a changeover of power following democratic for this Intergovernmental Conference to score a hat trick; that of the power to act, democratic course, we need to create the At the same time of course, we need to create the power to act in order The European Union must also have the The European Union must also have the power to act ...
If you do the concordance yourself, particularly interesting are the sequences: "nuclear power", "nuclear power station", "nuclear power plant", "parliamentary power", "sea's power", "balance of power", "concentration of power", "power to act", "motive power", "abuse of power", "decision-making power", "power supply", "come into power", "economic power", "power structures", "combined power and heat", "political power".
Try a parallel corpus[edit]
You can look at which contexts are used in one translation, but not another by looking at a parallel corpus.
$ paste europarl-v6.es-en.en europarl-v6.es-en.es | grep ' power .* potencia ' Since the Union as a whole is a world-class fishing '''power''' and one of the largest markets for fish produce, ... Por ser la Unión en conjunto una '''potencia''' pesquera en el nivel mundial y uno de los mayores mercados de productos pesqueros, ...
Second approach[edit]
Another approach is to write rules to fix translation errors that you come across. In order to try this out, take a big text (for example a newspaper article), and run it through the translator.
For example, if we take this article, the translation is pretty bad, but there are some places where lexical selection could improve the picture.
MPs who had spent almost six hours debating the state of the UK economy voted by 213 to 79, a majority of 134.
MPs Quién había gastado casi seis horas debatiendo el estado de la economía de UK votada por 213 a 79, una mayoría de 134.
In English, "spend" can have a number of meanings, among them "to pass time" pasar and "to pay money" gastar. In this case, we see that the context demands the translation of pasar because it is talking about time spent. So, we might make a rule like the following:
<rule> <!-- MPs spent almost six hours debating... --> <match lemma="spend" tags="vblex.*"> <select lemma="pasar" tags="vblex.*"/> </match> <match/> <match/> <or> <match lemma="minute"/> <match lemma="hour"/> <match lemma="year"/> </or> </rule>
Rule weighting[edit]
Rule weighting is a bit complicated, partly because of how it works. So, suppose you have the following input:
^a<pr>/to<pr>$ ^un<det><ind><f><sg>/a<det><ind><sg>$ ^estació<n><f><sg>/station<n><sg>/season<n><sg>$ ^llarg<adj><f><sg>/long<adj><sint>$
If you have the following rules:
<rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> <rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule>
Then both translations with no context have an equal weight. Which one is picked will depend on how the transducer is minimised (e.g. it will be non-deterministic). If you want one translation to have a higher weight than another, then you can just adjust the weight:
<rule weight="1.2"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> <rule weight="0.8"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule>
Now "station" will be picked, as . Now, how about if we want to add some context in there. We can just add another rule:
<rule weight="1.2"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> <rule weight="0.8"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule> <rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match> <or><match lemma="llarg"/><match lemma="curt"/></or></rule>
So, if we have the sequence above "a una estació llarga", then the "season" translation will get and the station translation will get meaning that "season" will be picked...
But let's not stop there, the preposition can also give us some information, so let's add another rule:
<rule weight="1.0"><or><match lemma="a"/><match lemma="en"/></or><match/> <match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></rule> <rule weight="1.2"><match lemma="estació" tags="n.*"><select lemma="station" tags="n.*"/></match></rule> <rule weight="0.8"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match></rule> <rule weight="1.0"><match lemma="estació" tags="n.*"><select lemma="season" tags="n.*"/></match> <or><match lemma="llarg"/><match lemma="curt"/></or></rule>
Now, if we have the sequence "a una estación llarga", we would get:
- a _ estació → station = 1.0
- estació → station = 1.2
- estació → season = 0.8
- estació → season llarga = 1.0
So which translation would win ? The "station" translation because .
Why do my rules not match?[edit]
First off, read the above about rule weighting (lrx-proc will not always pick the longest match; you have to give it a higher weight to make it the preferred one).
Also, note that <match tags="n.*"></match>
never matches anything, you have to write it like <match tags="n.*"/>
(see https://sourceforge.net/p/apertium/tickets/64/ )