Difference between revisions of "Automatic postediting at GSoC 2018"
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) |
||
(12 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]] |
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]] |
||
https://github.com/apertium/apertium-weights-learner/tree/629b48b306116565bc1d748c298bc28b41506f63 |
|||
https://github.com/deltamachine/naive-automatic-postediting |
https://github.com/deltamachine/naive-automatic-postediting |
||
== |
== Progress notes == |
||
==== Data preparation ==== |
|||
'''Russian - Belarusian''' |
|||
<ul> |
|||
{|class=wikitable |
|||
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li> |
|||
|- |
|||
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li> |
|||
! Week !! Dates !! To do |
|||
</ul> |
|||
|- |
|||
| 1 || 14th May — 20th May || |
|||
Total amount of sentences: 3821. |
|||
|- |
|||
| 2 || 21th May - 27th May || |
|||
|- |
|||
| 3 || 28th May — 3rd June || |
|||
|- |
|||
| 4 || 4th June — 10th June || |
|||
|- |
|||
! '''First evaluation, 11th June - 15th June''' !! colspan="2" align=left | |
|||
|- |
|||
| 5 || 11th June — 17th June || |
|||
|- |
|||
| 6 || 18th Jule — 24th July || |
|||
|- |
|||
| 7 || 25th July — 1st July || |
|||
'''Russian - Ukranian''' |
|||
|- |
|||
| 8 || 2nd July — 8th July || |
|||
<ul> |
|||
|- |
|||
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li> |
|||
!'''Second evaluation, 9th July - 13th July''' || colspan="2" align=left | |
|||
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li> |
|||
</ul> |
|||
|- |
|||
| 9 || 9th July — 15th July || |
|||
Total amount of sentences: 8463. |
|||
|- |
|||
==== Code refactoring ==== |
|||
| 10 || 16th July — 22th July || |
|||
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence. |
|||
==== Operations extraction ==== |
|||
|- |
|||
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3). |
|||
| 11 || 23rd July — 29th July || |
|||
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully). |
|||
=== Toolbox === |
|||
|- |
|||
The toolbox and step-to-step guide about how to use it: https://github.com/deltamachine/naive-automatic-postediting/tree/master/toolbox |
|||
| 12 || 30th August — 5th August || |
|||
==== New algorithm for operations extraction ==== |
|||
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/rationale.md. |
|||
==== Classifying operations ==== |
|||
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule). |
|||
How it works: |
|||
1) It takes file with postedit triplets (s, mt, pe). |
|||
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list. |
|||
3) If not, the script calculates the following metric: |
|||
''x = ((l - d) / l) * 100'' |
|||
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe. |
|||
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list. |
|||
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list. |
|||
==== Cleaning ==== |
|||
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ','). |
|||
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases. |
|||
==== Inserting operations into a language pair: dictionary approach (under development) ==== |
|||
For inserting operations into a language pair, a few helper scripts were written. |
|||
===== Monodix/bidix entries ===== |
|||
New monodix/bidix entries can be created from postedits in the following way: |
|||
1. Firstly, ''create_entries_table.py'' takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag". |
|||
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word. |
|||
3. Then ''check_entries.py'' should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word. |
|||
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries. |
|||
5. The last step is to run ''add_new_entries.py'' on the edited table. This script will create new antries, add them to the dictionaries and compile them. |
|||
==== Inserting operations into a language pair: separate module approach (under development) ==== |
|||
A script for applying learned postedits to new sentences was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/new_apply_postedits.py |
|||
It applies postedits to a given test MT file and creates an output file which contains source (S), Apertium translated (MT), edited by algorithm (ED) and target sentences (T) in a following format: |
|||
S я ненавижу спешить по утрам. |
|||
MT я *ненавижу *спешить по ранкам. |
|||
ED я ненавиджу поспішати по ранкам. |
|||
T я ненавиджу поспішати вранку. |
|||
For testing this approach a fast-and-dirty WER checking script was written. It takes file which was created by ''new_apply_postedits.py'', collects all MT, ED and T sentences (in case of few ED for the same sentence, the first one is chosen) and runs apertium-eval-translator on (MT, T) and (ED, T). |
|||
Here are the results for test data (with applying only the postedits from "potential bidix entries" list learned on train data). |
|||
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;" |
|||
|- |
|||
| |
|||
|'''bel-rus''' |
|||
|'''rus-ukr''' |
|||
|- |
|||
|'''(MT, T) WER / position-independent WER''' |
|||
|42.48% / 38.74% |
|||
|47.25% / 40.78% |
|||
|- |
|- |
||
|'''(ED, T) WER / position independent WER''' |
|||
!'''Final evaluation, 6th August - 14th August''' || colspan="2" align=left | |
|||
|40.50% / 36.76% |
|||
|44.09 / 37.36% |
|||
|- |
|- |
||
|} |
|} |
||
Little experiment with Spanish - Catalan (with applying only the postedits from "potential bidix entries" list learned on train data). |
|||
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;" |
|||
== Progress notes == |
|||
|- |
|||
| |
|||
|'''spa-cat''' |
|||
|- |
|||
|'''(MT, T) WER / position-independent WER''' |
|||
|22.49% / 15.03% |
|||
|- |
|||
|'''(ED, T) WER / position independent WER''' |
|||
|22.44% / 14.98% |
|||
|- |
|||
|} |
Latest revision as of 15:27, 15 August 2018
Contents
- 1 Related links
- 2 Progress notes
- 2.1 Data preparation
- 2.2 Code refactoring
- 2.3 Operations extraction
- 2.4 Toolbox
Related links[edit]
https://github.com/deltamachine/naive-automatic-postediting
Progress notes[edit]
Data preparation[edit]
Russian - Belarusian
- Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)
- Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)
Total amount of sentences: 3821.
Russian - Ukranian
- Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)
- OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).
Total amount of sentences: 8463.
Code refactoring[edit]
Two old scripts, learn_postedits.py and apply_postedits.py were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.
Operations extraction[edit]
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3). In fact, results are not very meaningful: the reason might lie in problems in learn_postedits.py and in the method itself (but it should be checked carefully).
Toolbox[edit]
The toolbox and step-to-step guide about how to use it: https://github.com/deltamachine/naive-automatic-postediting/tree/master/toolbox
New algorithm for operations extraction[edit]
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/rationale.md.
Classifying operations[edit]
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).
How it works:
1) It takes file with postedit triplets (s, mt, pe).
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.
3) If not, the script calculates the following metric:
x = ((l - d) / l) * 100
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.
Cleaning[edit]
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.
Inserting operations into a language pair: dictionary approach (under development)[edit]
For inserting operations into a language pair, a few helper scripts were written.
Monodix/bidix entries[edit]
New monodix/bidix entries can be created from postedits in the following way:
1. Firstly, create_entries_table.py takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag".
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word.
3. Then check_entries.py should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word.
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries.
5. The last step is to run add_new_entries.py on the edited table. This script will create new antries, add them to the dictionaries and compile them.
Inserting operations into a language pair: separate module approach (under development)[edit]
A script for applying learned postedits to new sentences was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/new_apply_postedits.py
It applies postedits to a given test MT file and creates an output file which contains source (S), Apertium translated (MT), edited by algorithm (ED) and target sentences (T) in a following format:
S я ненавижу спешить по утрам.
MT я *ненавижу *спешить по ранкам.
ED я ненавиджу поспішати по ранкам.
T я ненавиджу поспішати вранку.
For testing this approach a fast-and-dirty WER checking script was written. It takes file which was created by new_apply_postedits.py, collects all MT, ED and T sentences (in case of few ED for the same sentence, the first one is chosen) and runs apertium-eval-translator on (MT, T) and (ED, T).
Here are the results for test data (with applying only the postedits from "potential bidix entries" list learned on train data).
bel-rus | rus-ukr | |
(MT, T) WER / position-independent WER | 42.48% / 38.74% | 47.25% / 40.78% |
(ED, T) WER / position independent WER | 40.50% / 36.76% | 44.09 / 37.36% |
Little experiment with Spanish - Catalan (with applying only the postedits from "potential bidix entries" list learned on train data).
spa-cat | |
(MT, T) WER / position-independent WER | 22.49% / 15.03% |
(ED, T) WER / position independent WER | 22.44% / 14.98% |