Difference between revisions of "Lextor"

From Apertium
Jump to navigation Jump to search
(import from latex)
 
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
:''This module deals with [[lexical selection]], for more information on the topic, see the [[lexical selection|main page]].''
  +
  +
{{unused}}
  +
  +
{{TOCD}}
 
When the Apertium system is used to translate between less related
 
When the Apertium system is used to translate between less related
 
languages than the ones dealt with in the first stages of the engine,
 
languages than the ones dealt with in the first stages of the engine,
Line 8: Line 13:
   
 
Before going into its characteristics, we will see how the problems of
 
Before going into its characteristics, we will see how the problems of
\emph{multiple equivalence} (the fact of existing more than one
+
''multiple equivalence'' (the fact of existing more than one
 
possible translation in target language for a source language lexical
 
possible translation in target language for a source language lexical
 
form) are tackled in Apertium in two ways.
 
form) are tackled in Apertium in two ways.
Line 19: Line 24:
 
linguist chooses one of the lemmas as a translation (generally the
 
linguist chooses one of the lemmas as a translation (generally the
 
most frequent or usual), and adds a direction restriction to the other
 
most frequent or usual), and adds a direction restriction to the other
lemmas (with the attributes \texttt{LR} or \texttt{RL}) so that they
+
lemmas (with the attributes <code>LR</code> or <code>RL</code>) so that they
 
are translated in the opposite direction but not in the direction
 
are translated in the opposite direction but not in the direction
 
where there are multiple equivalents.
 
where there are multiple equivalents.
 
   
 
On the other hand, we have the case where there is a clear difference
 
On the other hand, we have the case where there is a clear difference
Line 28: Line 32:
 
translation errors if the inappropriate lemma is chosen. These are the
 
translation errors if the inappropriate lemma is chosen. These are the
 
cases dealt with the new lexical selection module. The linguist has to
 
cases dealt with the new lexical selection module. The linguist has to
encode entries with the attributes \texttt{slr} or \texttt{srl}
+
encode entries with the attributes <code>slr</code> or <code>srl</code>
 
described in the next section, thus identifying the different
 
described in the next section, thus identifying the different
 
translation options; then, the lexical selection module, by means of
 
translation options; then, the lexical selection module, by means of
Line 44: Line 48:
 
cases, we will encode the entries so that the decision is left to the
 
cases, we will encode the entries so that the decision is left to the
 
lexical selection module.
 
lexical selection module.
 
   
 
When we use an Apertium system without lexical selection module, the
 
When we use an Apertium system without lexical selection module, the
Line 51: Line 54:
 
equivalences with a direction restriction. In the event that we use
 
equivalences with a direction restriction. In the event that we use
 
bilingual dictionaries with multiple translations, encoded with the
 
bilingual dictionaries with multiple translations, encoded with the
attributes \texttt{slr} or \texttt{srl}, in a system that does not
+
attributes <code>slr</code> or <code>srl</code>, in a system that does not
 
have any lexical selection module, a style sheet will
 
have any lexical selection module, a style sheet will
 
convert these entries designed for a lexical selection module into
 
convert these entries designed for a lexical selection module into
entries with direction restrictions \texttt{LR} or \texttt{RL}, so
+
entries with direction restrictions <code>LR</code> or <code>RL</code>, so
 
that one of the multiple equivalents (the one chosen as default entry
 
that one of the multiple equivalents (the one chosen as default entry
 
by the linguist) becomes the fixed translation of the source language
 
by the linguist) becomes the fixed translation of the source language
Line 60: Line 63:
   
 
As examples of bilingual equivalencies that should have a direction
 
As examples of bilingual equivalencies that should have a direction
restriction, we can give the translation pairs \texttt{ca-es}
+
restriction, we can give the translation pairs <code>ca-es</code>
\emph{encara -- aún/todavía} ("still") and \emph{sobtat --
+
'''encara''' &mdash; ''aún'' or ''todavía'' ("still") and '''sobtat''' &mdash;
súbito/repentino} ("sudden"), the first one of which could be encoded
+
''súbito'' or ''repentino'' ("sudden"), the first one of which could be encoded
 
like this:
 
like this:
  +
<pre>
\begin{alltt}
 
\begin{small}
 
 
 
<e r="LR">
 
<e r="LR">
 
<p>
 
<p>
Line 79: Line 80:
 
</p>
 
</p>
 
</e>
 
</e>
  +
</pre>
\end{small}
 
\end{alltt}
 
   
 
As examples of the second case (multiple equivalents with big
 
As examples of the second case (multiple equivalents with big
difference in meaning) we have the pairs \texttt{es-ca} \emph{hoja --
+
difference in meaning) we have the pairs <code>es-ca</code> '''hoja''' &mdash;
full/fulla} ("sheet/leaf") and \emph{muñeca -- nina/canell}
+
''full'' or ''fulla'' ("sheet/leaf") and '''muñeca''' &mdash; ''nina'' or ''canell''
("doll/wrist"), as well as the \texttt{en-ca} examples shown in page
+
("doll/wrist"), as well as the <code>en-ca</code> examples shown in page X, where it is described how to specify these multiple equivalents in the bilingual dictionary.
\pageref{entrades_lextor}, where it is described how to specify these
 
multiple equivalents in the bilingual dictionary.
 
   
   
  +
The next section describes the pre-processing
\begin{figure} {\footnotesize \setlength{\tabcolsep}{0.5mm}
 
\begin{center}
 
\begin{tabular}{ccccccccc} \\
 
\parbox{0.95cm}{source language text} \\ $\downarrow$ \\
 
\framebox{\parbox{1.0cm}{de-for\-matter}} $\rightarrow$ &
 
\framebox{\parbox{0.6cm}{morph. anal.}} $\rightarrow$ &
 
\framebox{\parbox{1.0cm}{POS tagger}} $\rightarrow$ &
 
\framebox{\parbox{0.6cm}{lex. select.}} $\rightarrow$ &
 
\framebox{\parbox{0.85cm}{struct. transf.}} $\rightarrow$ &
 
\framebox{\parbox{0.6cm}{morph. gen.}} $\rightarrow$ &
 
\framebox{\parbox{1.2cm}{post\-generator}} $\rightarrow$ &
 
\framebox{\parbox{1.0cm}{re-for\-matter}} \\ & & & & $\updownarrow$ &
 
& & $\downarrow$ \\ & & & & \framebox{\parbox{0.8cm}{lex. transf.}} &
 
& &
 
\parbox{0.95cm}{target language text} \\
 
\end{tabular}
 
\end{center} }
 
\caption{The nine modules that build the assembly line in the version
 
2 of the machine translation system Apertium.}
 
\label{fig:moduls}
 
\end{figure}
 
 
Figure~\ref{fig:moduls} shows the new assembly line of the version 2
 
of Apertium.\footnote{This figure substitutes the figure
 
\ref{fg:modules} in page \pageref{pg:modules} which represents the
 
version 1 of Apertium.} \nota{MG: caldria canviar la figura de la
 
pàgina 6 per aquesta d'aquí?} The module in charge of the lexical
 
selection (lexical selector) runs after the part-of-speech tagger and
 
before the structural transfer module; therefore, this new module
 
works only with source language information.
 
 
 
Section~\ref{se:preprocessament} next describes the pre-processing
 
 
that must be done on a bilingual dictionary containing more than
 
that must be done on a bilingual dictionary containing more than
 
one translation per entry (whether the system uses a
 
one translation per entry (whether the system uses a
lexical selector or not), and Section~\ref{se:lextor} describes
+
lexical selector or not), and [[Lextor#Preprocessing with lexical selection module]] describes
 
how the lexical selector works and how it has to be trained.
 
how the lexical selector works and how it has to be trained.
   
   
\subsection{Pre-processing of the bilingual dictionaries
+
===Pre-processing of the bilingual dictionaries===
}\label{se:preprocessament}
 
   
 
Bilingual dictionaries have been modified to allow the specification
 
Bilingual dictionaries have been modified to allow the specification
Line 143: Line 108:
 
specific action.
 
specific action.
   
  +
====Pre-processing without lexical selection module====
   
  +
When bilingual dictionaries with multiple equivalents are used in a system where there is no lexical selection module, the pre-processing is done by the application of the style sheet <code>translate-to-default-equivalent.xsl</code>. This style sheet turns dictionaries with multiple translations per entry into dictionaries with only one translation per entry; to do this, it chooses as translation the entry marked as default, and adds a direction restriction (<code>LR</code> or <code>RL</code> as applicable) to the other entries, so that they are only translated in the translation direction where there is no equivalent multiplicity. The style sheet is called from the <code>Makefile</code>.
\subsubsection{Pre-processing without lexical selection module}
 
   
  +
To put an example, the result of applying the style sheet on the first three entries shown in page \pageref{entrades_lextor} is the following:
When bilingual dictionaries with multiple equivalents are used in a
 
system where there is no lexical selection module, the pre-processing
 
is done by the application of the style sheet
 
\texttt{translate-to\--de\-fault\--e\-qui\-va\-lent.xsl}. This style
 
sheet turns dictionaries with multiple translations per entry into
 
dictionaries with only one translation per entry; to do this, it
 
chooses as translation the entry marked as default, and adds a
 
direction restriction (\texttt{LR} or \texttt{RL} as applicable) to
 
the other entries, so that they are only translated in the translation
 
direction where there is no equivalent multiplicity. The style sheet
 
is called from the \texttt{Makefile}.
 
   
  +
<pre>
 
To put an example, the result of applying the style sheet on the first
 
three entries shown in page \pageref{entrades_lextor} is the
 
following:
 
 
\begin{alltt}
 
\begin{small}
 
 
<e>
 
<e>
 
<p>
 
<p>
Line 185: Line 135:
 
</p>
 
</p>
 
</e>
 
</e>
  +
</pre>
\end{small}
 
\end{alltt}
 
   
\subsubsection{Preprocessing with lexical selection module}
+
====Preprocessing with lexical selection module====
   
If the Apertium system works with a lexical selection module, the
+
If the Apertium system works with a lexical selection module, the bilingual dictionary must be pre-processed in order to obtain:
bilingual dictionary must be pre-processed in order to obtain:
 
\begin{itemize}
 
\item a monolingual dictionary that, for each source language word
 
(for example \emph{look}) delivers all the possible translation marks
 
or equivalents (\texttt{look\_\_mirar D} and
 
\texttt{look\_\_semblar}); this dictionary will be used by the lexical
 
selection module; and
 
   
  +
* a monolingual dictionary that, for each source language word (for example '''look''') delivers all the possible translation marks or equivalents (<code>look__mirar D</code> and <code>look__semblar</code>); this dictionary will be used by the lexical selection module; and
\item a new bilingual dictionary that, given a word with the lexical
 
selection already done (for example \texttt{look\_\_semblar}) delivers
 
the translation (\emph{semblar}); this will be the bilingual
 
dictionary to be used in the lexical transfer.
 
   
  +
* a new bilingual dictionary that, given a word with the lexical selection already done (for example <code>look__semblar</code>) delivers the translation ('''semblar'''); this will be the bilingual dictionary to be used in the lexical transfer.
\end{itemize}
 
   
  +
This pre-processing is automatically done by means of the following software during dictionary compilation:
   
  +
* <code>apertium-gen-lextormono</code>, that receives three parameters:
This pre-processing is automatically done by means of the following
 
  +
*# the translation direction for which you want to generate the monolingual dictionary used in the lexical selection; <code>lr</code> for the translation left to right, and <code>rl</code> for the translation right to left;
software during dictionary compilation:
 
  +
*# the monolingual dictionary to be pre-processed; and
\begin{itemize}
 
  +
*# the file where the output monolingual dictionary has to be written.
\item \texttt{apertium-gen-lextormono}, that receives three
 
  +
parameters:
 
  +
* <code>apertium-gen-lextorbil</code>, that receives three parameters:
\begin{itemize}
 
\item the translation direction for which you want to generate the
+
*# the translation direction (<code>lr</code> or <code>rl</code>) for which you want to generate the bilingual dictionary to be used by the lexical transfer module;
monolingual dictionary used in the lexical selection; \texttt{lr}
+
*# the bilingual dictionary to be pre-processed; and
  +
*# the file where the output bilingual dictionary has to be written.
for the translation left to right, and \texttt{rl} for the
 
translation right to left;
 
\item the monolingual dictionary to be pre-processed; and
 
\item the file where the output monolingual dictionary has to be
 
written.
 
\end{itemize}
 
   
  +
===Execution of the lexical selection module===
\item \texttt{apertium-gen-lextorbil}, that receives three parameters:
 
\begin{itemize}
 
\item the translation direction (\texttt{lr} or \texttt{rl}) for
 
which you want to generate the bilingual dictionary to be used by
 
the lexical transfer module;
 
\item the bilingual dictionary to be pre-processed; and
 
\item the file where the output bilingual dictionary has to be
 
written.
 
\end{itemize}
 
\end{itemize}
 
 
\subsection{Execution of the lexical selection
 
module}\label{se:lextor}
 
   
 
The module responsible for the lexical selection runs after the
 
The module responsible for the lexical selection runs after the
Line 242: Line 165:
 
training of the module, target language information is also used.
 
training of the module, target language information is also used.
   
  +
====Training====
 
\subsubsection{Training}\label{se:entrenament}
 
   
 
To train the lexical selection module, a corpus in the source language
 
To train the lexical selection module, a corpus in the source language
Line 249: Line 171:
 
to be related. Both corpora must be pre-processed before the
 
to be related. Both corpora must be pre-processed before the
 
training. This pre-processing, consisting in analysing the corpora and
 
training. This pre-processing, consisting in analysing the corpora and
performing the POS disambiguation, can be done with
+
performing the POS disambiguation, can be done with <code>apertium-preprocess-corpus-lextor</code>
\texttt{apertium-prepro\-cess\--cor\-pus\--lex\-tor}.
 
   
 
The training of the module that performs the lexical selection
 
The training of the module that performs the lexical selection
consists of the following tasks:\footnote{The training of the models
+
consists of the following tasks:<ref>{{ref|footnote1}}</ref>
  +
used for the lexical selection has been automated in all the packages
 
  +
# Obtain the list of words that will be ignored when performing lexical selection ('''stopwords'''). This list can be done manually or using <code>apertium-gen-stopwords-lextor</code>;
using it. Furthermore, all the software mentioned has its UNIX manual
 
  +
# Obtain the list of (source language) words that have more than one translation in the target language, using <code>apertium-gen-wlist-lextor</code>;
page}
 
  +
# Translate to the target language all the words obtained in the previous step, using <code>apertium-gen-wlist-lextor-translation</code>;
  +
# Running <code>apertium-lextor --trainwrd<code> and using the target language pre-processed corpus, train a word co-occurrence model for the words obtained in the previous step;
  +
# Running <code>apertium-lextor --trainlch</code> and using the source language pre-processed corpus, the dictionaries generated by the programs mentioned in Section~\ref{se:preprocessament} and the word co-occurrence models calculated in the previous step, train a co-occurrence model for each of the translation marks of those words that can have more than one translation in the target language.
  +
  +
====Use====
  +
The word co-occurrence models calculated for each translation mark as described in the previous section provide the information required to perform lexical selection with information from the context.
  +
  +
Lexical selection is done by <code>apertium-lextor --lextor</code>; the formats used to communicate with the rest of the modules of the translation engine are:
  +
  +
* [Input:] text in the same format as the input for the structural transfer module, that is, text analysed and disambiguated, with invariable queues of multiwords moved before morphological tags.
  +
* [Output:] text in the same format, but with the translation mark to be used when executing lexical transfer.
   
  +
The following example illustrates the input/output formats used by the lexical selector (we have assumed in the example that only the English verb '''get''' has more than one translation equivalent in the dictionaries):
\begin{enumerate}
 
\item Obtain the list of words that will be ignored when performing
 
lexical selection (\emph{stopwords}). This list can be done manually
 
or using \texttt{apertium-gen-stopwords-lextor};
 
\item Obtain the list of (source language) words that have more than
 
one translation in the target language, using
 
\texttt{apertium-gen-wlist-lextor};
 
\item Translate to the target language all the words obtained in the
 
previous step, using \texttt{apertium-gen-wlist-lextor-translation};
 
\item Running \texttt{apertium-lextor --trainwrd} and using the target
 
language pre-processed corpus, train a word co-occurrence model for
 
the words obtained in the previous step;
 
\item Running \texttt{apertium-lextor --trainlch} and using the source
 
language pre-processed corpus, the dictionaries generated by the
 
programs mentioned in Section~\ref{se:preprocessament} and the word
 
co-occurrence models calculated in the previous step, train a
 
co-occurrence model for each of the translation marks of those words
 
that can have more than one translation in the target language.
 
\end{enumerate}
 
   
  +
* Source language text (English): '''To get to the city centre'''
\subsubsection{Use}\label{se:us}
 
  +
* Lexical selector input: <code>^To<pr>$ ^get<vblex><inf>$ ^to<pr>$ ^the<det><def><sp>$ ^city<n><sg>$ ^centre<n><sg>$</code>
  +
* Translation marks in the en-ca bilingual dictionary for the verb '''get''': <code>rebre</code>, <code>agafar</code>, <code>arribar</code>, <code>aconseguir D</code>
  +
* Lexical selector output: <code>^To<pr>$ ^get__arribar<vblex><inf>$ ^to<pr>$ ^the<det><def><sp>$ ^city<n><sg>$ ^centre<n><sg>$</code>
   
  +
==Notes==
The word co-occurrence models
 
calculated for each translation mark as described in the previous
 
section provide the information required to perform lexical selection
 
with information from the context.
 
   
  +
# The training of the models used for the lexical selection has been automated in all the packages using it. Furthermore, all the software mentioned has its UNIX manual page
Lexical selection is done by \texttt{apertium-lextor --lextor}; the
 
formats used to communicate with the rest of the modules of the
 
translation engine are:
 
   
\begin{description}
 
\item [Input:] text in the same format as the input for the structural
 
transfer module, that is, text analysed and disambiguated, with
 
invariable queues of multiwords moved before morphological tags.
 
\item [Output:] text in the same format, but with the translation mark
 
to be used when executing lexical transfer.
 
\end{description}
 
   
  +
==References==
  +
<references/>
   
  +
[[Category:Documentation]]
The following example illustrates the input/output formats used by the
 
  +
[[Category:Development]]
lexical selector (we have assumed in the example that only the English
 
  +
[[Category:Lexical selection]]
verb \emph{get} has more than one translation equivalent in the
 
  +
[[Category:Documentation in English]]
dictionaries):
 
\begin{itemize}
 
\item Source language text (English): \emph{To get to the city centre}
 
\item Lexical selector input: \verb!^To<pr>$!
 
\verb!^get<vblex><inf>$! \verb!^to<pr>$! \verb!^the<det><def><sp>$!
 
\verb!^city<n><sg>$! \verb!^centre<n><sg>$!
 
\item Translation marks in the en-ca bilingual dictionary for the verb
 
\emph{get}: \texttt{rebre}, \texttt{agafar}, \texttt{arribar},
 
\texttt{aconseguir D}
 
\item Lexical selector output: \verb!^To<pr>$!
 
\verb!^get__arribar<vblex><inf>$! \verb!^to<pr>$!
 
\verb!^the<det><def><sp>$! \verb!^city<n><sg>$!
 
\verb!^centre<n><sg>$!
 
\end{itemize}
 

Latest revision as of 03:22, 9 March 2019

This module deals with lexical selection, for more information on the topic, see the main page.

  This module is currently not used and not under active development.

When the Apertium system is used to translate between less related languages than the ones dealt with in the first stages of the engine, the question of lexical selection becomes significant, because there are more cases, and more critical, in which a source language word can have more than one different translation in the target language. For this reason we created a new module, the lexical selection module, which deals with this problem.

Before going into its characteristics, we will see how the problems of multiple equivalence (the fact of existing more than one possible translation in target language for a source language lexical form) are tackled in Apertium in two ways.

On the one hand, we have the situation where there is no big difference in meaning between the multiple equivalents in the target language, and the fact of choosing one or the other can not lead to any translation error. We could say that between these equivalents there is a synonymy or quasi-synonymy relation. In such a case, the linguist chooses one of the lemmas as a translation (generally the most frequent or usual), and adds a direction restriction to the other lemmas (with the attributes LR or RL) so that they are translated in the opposite direction but not in the direction where there are multiple equivalents.

On the other hand, we have the case where there is a clear difference in meaning between the multiple equivalents, which can lead to translation errors if the inappropriate lemma is chosen. These are the cases dealt with the new lexical selection module. The linguist has to encode entries with the attributes slr or srl described in the next section, thus identifying the different translation options; then, the lexical selection module, by means of statistical methods, chooses the translation which is most suitable in a given context.

Sometimes it is not easy to decide whether a multiple equivalence situation should be solved in one way or the other. For example, if there is difference in the meaning of two or more lemmas in the target language, but we think that the lexical selection module will not be capable of choosing the right translation by means of the context, we will follow the first method: choose a fixed translation (the most general, the most suitable in the maximum number of situations) and add a direction restriction to the rest of translations. In the other cases, we will encode the entries so that the decision is left to the lexical selection module.

When we use an Apertium system without lexical selection module, the only way to add entries with different possible translations is the first one, that is, choosing an only translation and marking the other equivalences with a direction restriction. In the event that we use bilingual dictionaries with multiple translations, encoded with the attributes slr or srl, in a system that does not have any lexical selection module, a style sheet will convert these entries designed for a lexical selection module into entries with direction restrictions LR or RL, so that one of the multiple equivalents (the one chosen as default entry by the linguist) becomes the fixed translation of the source language lemma.

As examples of bilingual equivalencies that should have a direction restriction, we can give the translation pairs ca-es encaraaún or todavía ("still") and sobtatsúbito or repentino ("sudden"), the first one of which could be encoded like this:

<e r="LR">
   <p>
      <l>aún<s n="adv"/></l>
      <r>encara<s n="adv"/></r>
   </p>
</e>
<e>
    <p>
      <l>todavía<s n="adv"/></l>
      <r>encara<s n="adv"/></r>
    </p>
</e>

As examples of the second case (multiple equivalents with big difference in meaning) we have the pairs es-ca hojafull or fulla ("sheet/leaf") and muñecanina or canell ("doll/wrist"), as well as the en-ca examples shown in page X, where it is described how to specify these multiple equivalents in the bilingual dictionary.


The next section describes the pre-processing that must be done on a bilingual dictionary containing more than one translation per entry (whether the system uses a lexical selector or not), and Lextor#Preprocessing with lexical selection module describes how the lexical selector works and how it has to be trained.


Pre-processing of the bilingual dictionaries[edit]

Bilingual dictionaries have been modified to allow the specification of more than one translation per entry (refer to Section \ref{dic_lextor} to learn how to write such dictionary entries); this fact makes it necessary to pre-process these dictionaries, since the Apertium engine works with compiled dictionaries in which there is only one possible translation for each word.

The pre-processing of dictionaries is done automatically during compilation, therefore the final user does not need to perform any specific action.

Pre-processing without lexical selection module[edit]

When bilingual dictionaries with multiple equivalents are used in a system where there is no lexical selection module, the pre-processing is done by the application of the style sheet translate-to-default-equivalent.xsl. This style sheet turns dictionaries with multiple translations per entry into dictionaries with only one translation per entry; to do this, it chooses as translation the entry marked as default, and adds a direction restriction (LR or RL as applicable) to the other entries, so that they are only translated in the translation direction where there is no equivalent multiplicity. The style sheet is called from the Makefile.

To put an example, the result of applying the style sheet on the first three entries shown in page \pageref{entrades_lextor} is the following:

<e>
   <p>
      <l>flat<s n="n"/></l>
      <r>pis<s n="n"/><s n="m"/></r>
   </p>
</e>

<e r="LR">
   <p>
      <l>floor<s n="n"/></l>
      <r>pis<s n="n"/><s n="m"/></r>
   </p>
</e>

<e r="RL">
   <p>
      <l>floor<s n="n"/></l>
      <r>terra<s n="n"/><s n="m"/></r>
   </p>
</e>

Preprocessing with lexical selection module[edit]

If the Apertium system works with a lexical selection module, the bilingual dictionary must be pre-processed in order to obtain:

  • a monolingual dictionary that, for each source language word (for example look) delivers all the possible translation marks or equivalents (look__mirar D and look__semblar); this dictionary will be used by the lexical selection module; and
  • a new bilingual dictionary that, given a word with the lexical selection already done (for example look__semblar) delivers the translation (semblar); this will be the bilingual dictionary to be used in the lexical transfer.

This pre-processing is automatically done by means of the following software during dictionary compilation:

  • apertium-gen-lextormono, that receives three parameters:
    1. the translation direction for which you want to generate the monolingual dictionary used in the lexical selection; lr for the translation left to right, and rl for the translation right to left;
    2. the monolingual dictionary to be pre-processed; and
    3. the file where the output monolingual dictionary has to be written.
  • apertium-gen-lextorbil, that receives three parameters:
    1. the translation direction (lr or rl) for which you want to generate the bilingual dictionary to be used by the lexical transfer module;
    2. the bilingual dictionary to be pre-processed; and
    3. the file where the output bilingual dictionary has to be written.

Execution of the lexical selection module[edit]

The module responsible for the lexical selection runs after the part-of-speech tagger and before the structural transfer (see Figure~\ref{fig:moduls} in page~\pageref{fig:moduls}); therefore, it uses only information from the source language. However, during the training of the module, target language information is also used.

Training[edit]

To train the lexical selection module, a corpus in the source language and another one in the target language are required; they do not need to be related. Both corpora must be pre-processed before the training. This pre-processing, consisting in analysing the corpora and performing the POS disambiguation, can be done with apertium-preprocess-corpus-lextor

The training of the module that performs the lexical selection consists of the following tasks:[1]

  1. Obtain the list of words that will be ignored when performing lexical selection (stopwords). This list can be done manually or using apertium-gen-stopwords-lextor;
  2. Obtain the list of (source language) words that have more than one translation in the target language, using apertium-gen-wlist-lextor;
  3. Translate to the target language all the words obtained in the previous step, using apertium-gen-wlist-lextor-translation;
  4. Running apertium-lextor --trainwrd and using the target language pre-processed corpus, train a word co-occurrence model for the words obtained in the previous step;
  5. Running apertium-lextor --trainlch and using the source language pre-processed corpus, the dictionaries generated by the programs mentioned in Section~\ref{se:preprocessament} and the word co-occurrence models calculated in the previous step, train a co-occurrence model for each of the translation marks of those words that can have more than one translation in the target language.

Use[edit]

The word co-occurrence models calculated for each translation mark as described in the previous section provide the information required to perform lexical selection with information from the context.

Lexical selection is done by apertium-lextor --lextor; the formats used to communicate with the rest of the modules of the translation engine are:

  • [Input:] text in the same format as the input for the structural transfer module, that is, text analysed and disambiguated, with invariable queues of multiwords moved before morphological tags.
  • [Output:] text in the same format, but with the translation mark to be used when executing lexical transfer.

The following example illustrates the input/output formats used by the lexical selector (we have assumed in the example that only the English verb get has more than one translation equivalent in the dictionaries):

  • Source language text (English): To get to the city centre
  • Lexical selector input: ^To<pr>$ ^get<vblex><inf>$ ^to<pr>$ ^the<det><def><sp>$ ^city<n><sg>$ ^centre<n><sg>$
  • Translation marks in the en-ca bilingual dictionary for the verb get: rebre, agafar, arribar, aconseguir D
  • Lexical selector output: ^To<pr>$ ^get__arribar<vblex><inf>$ ^to<pr>$ ^the<det><def><sp>$ ^city<n><sg>$ ^centre<n><sg>$

Notes[edit]

  1. The training of the models used for the lexical selection has been automated in all the packages using it. Furthermore, all the software mentioned has its UNIX manual page


References[edit]