Difference between revisions of "Automatic postediting at GSoC 2018"
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) |
||
Line 5: | Line 5: | ||
https://github.com/deltamachine/naive-automatic-postediting |
https://github.com/deltamachine/naive-automatic-postediting |
||
== Workplan == |
|||
{|class=wikitable |
|||
|- |
|||
! Week !! Dates !! To do |
|||
|- |
|||
| 1 || 14th May — 20th May || <s>Find and download needed Russian - Ukranian and Russian - Belarusian corpora, write scripts for preprocessing the data.</s> |
|||
|- |
|||
| 2 || 21th May - 27th May || <s> Learn to use bicleaner (https://github.com/sortiz/bicleaner), train ru-uk classifier, preprocess OpenSubtitles corpora, filter out loose translations. </s> |
|||
|- |
|||
| 3 || 28th May — 3rd June || <s> Continue to prepare Russian - Ukranian parallel corpus from OpenSubtitles, refactore the old apply_postedits.py code, make the old code work faster. </s> |
|||
|- |
|||
| 4 || 4th June — 10th June || Work on the old code, start to extract triplets. |
|||
|- |
|||
! '''First evaluation, 11th June - 15th June''' !! colspan="2" align=left | |
|||
|- |
|||
| 5 || 11th June — 17th June || |
|||
⚫ | |||
|- |
|||
| 6 || 18th Jule — 24th July || |
|||
|- |
|||
| 7 || 25th July — 1st July || |
|||
|- |
|||
| 8 || 2nd July — 8th July || |
|||
|- |
|||
!'''Second evaluation, 9th July - 13th July''' || colspan="2" align=left | |
|||
|- |
|||
| 9 || 9th July — 15th July || |
|||
|- |
|||
| 10 || 16th July — 22th July || |
|||
|- |
|||
| 11 || 23rd July — 29th July || |
|||
|- |
|||
| 12 || 30th August — 5th August || |
|||
|- |
|||
!'''Final evaluation, 6th August - 14th August''' || colspan="2" align=left | |
|||
|- |
|||
|} |
|||
== Progress notes == |
== Progress notes == |
||
Line 78: | Line 25: | ||
Total amount of sentences: 8463. |
Total amount of sentences: 8463. |
||
==== Code refactoring ==== |
==== Code refactoring ==== |
||
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence. |
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence. |
||
Line 85: | Line 31: | ||
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3). |
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3). |
||
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully). |
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully). |
||
==== New algorithm for operations extraction ==== |
|||
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment algorithm. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code and rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/tree/master/new_alg |
|||
==== Classifying operations ==== |
|||
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule). |
|||
How it works: |
|||
1) It takes file with postedit triplets (s, mt, pe). |
|||
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list. |
|||
3) If not, the script calculates the following metric: |
|||
* letters = number of letters in pe |
|||
* distance = Levenshtein distance betweeen mt and pe |
|||
((letters - distance) / letters) * 100 |
|||
If 50 <= this number < 100, the algorithm adds triplet to "grammar mistakes" list. |
|||
4) Else the algorithm checks, if mt != pe, and if no, adds triplet to "other mistakes" list. |
|||
⚫ | |||
==== Cleaning ==== |
|||
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ','). |
|||
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. |
|||
==== Inserting operations into a language pair ==== |
|||
==== Evaluation ==== |
Revision as of 09:27, 9 August 2018
Contents
Related links
https://github.com/deltamachine/naive-automatic-postediting
Progress notes
Data preparation
Russian - Belarusian
- Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)
- Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)
Total amount of sentences: 3821.
Russian - Ukranian
- Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)
- OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).
Total amount of sentences: 8463.
Code refactoring
Two old scripts, learn_postedits.py and apply_postedits.py were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.
Operations extraction
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3). In fact, results are not very meaningful: the reason might lie in problems in learn_postedits.py and in the method itself (but it should be checked carefully).
New algorithm for operations extraction
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment algorithm. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code and rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/tree/master/new_alg
Classifying operations
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).
How it works:
1) It takes file with postedit triplets (s, mt, pe). 2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list. 3) If not, the script calculates the following metric:
- letters = number of letters in pe
- distance = Levenshtein distance betweeen mt and pe
((letters - distance) / letters) * 100
If 50 <= this number < 100, the algorithm adds triplet to "grammar mistakes" list.
4) Else the algorithm checks, if mt != pe, and if no, adds triplet to "other mistakes" list.
Cleaning
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it.