At the moment Apertium transfer rules are combinations of fixed-length patterns and actions. Conflicts are "solved" by selecting the first rule that matches. The idea of this task is to allow rules to conflict and to select the most adequate rules for an input using weights.
- Unlexicalised rule weights
- Depend only on the category of the input words.
- Lexicalised rules weights
- Depend on one or more of the lemmas of the input words. e.g.
ID |
Rule |
Input |
Output |
Frequency
|
1 |
de → |
memoría de traducción |
translation memory |
90
|
2 |
de → 's |
memoría de traducción |
translation's memory |
0
|
3 |
de → of |
memoría de traducción |
memory of translation |
0
|
So here we would have something like:
- Rule 1 (x=memoría, y=traducción, weight=1.0)
- Rule 2 (x=memoría, y=traducción, weight=0.0)
- Rule 3 (x=memoría, y=traducción, weight=0.0)
Example
Transfer rules:
ID |
Rule |
Input |
Output
|
1 |
de → |
memoria de traducción |
translation memory
|
2 |
de → 's |
la hermana de mi novia |
my girlfriend's sister
|
3 |
de → of |
el estado de la cuestión |
the state of the art
|
Training
- Take a big corpus
- For each sentence:
- Apply transfer rules
- For each possible combination of transfer rules
- Translate the sentence and score on language model
- Each sentence gets a count 1. This count is shared between the transfer rules.
- Example
|
La canciller se reúne hoy con el presidente de EE UU para limar asperezas y preparar la cumbre del miércoles con Putin. |
|
1 1 |
The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday summit] with Putin. |
-74.55 |
0.39
|
2 1 |
The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday summit] with Putin. |
-69.51 |
60.71
|
3 1 |
The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday summit] with Putin. |
-74.47 |
0.43
|
1 2 |
The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday's summit] with Putin. |
-75.02 |
0.25
|
2 2 |
The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday's summit] with Putin. |
-69.98 |
37.94
|
3 2 |
The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday's summit] with Putin. |
-74.94 |
0.27
|
1 3 |
The chancellor gathers today with [the U.S. president] for mend fences and prepare [the summit of the Wednesday] with Putin. |
-82.88 |
0.0
|
2 3 |
The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the summit of the Wednesday] with Putin. |
-77.84 |
0.01
|
3 3 |
The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the summit of the Wednesday] with Putin. |
-82.80 |
0.0
|
You can then feed the fractional counts to some supervised machine learning program to get appropriate weights.
Questions
- How to calculate the paths?
- With optimal coverage, or with just taking the LRLM and only calculating paths for rules which conflict.
- For lexicalised weights:
- What is the function assigning cost to each lexical combination of N1 and N2?
- Could we score a rule at a time, by keeping part fixed ?
Tasks
- Implement in C++ and integrate into Apertium.
Coding challenge
- Write a program (in python or C++) that reads the XML transfer format patterns and applies them to an input stream printing out all the possible coverages, using left-right longest match (so a "det" rule and a "noun" rule won't match "det noun" input if there are "det noun" rules).
- Write a program (in python or C++) that reads the XML transfer format patterns and applies them to an input stream printing out all the possible coverages, including alternatives where a combination of shorter rules matches a longer rule (so a "det" rule and a "noun" rule will be included in the combinations even if there are "det noun" rules).
See also