Difference between revisions of "Automatic postediting at GSoC 2018"
Jump to navigation
Jump to search
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) |
||
Line 12: | Line 12: | ||
! Week !! Dates !! To do |
! Week !! Dates !! To do |
||
|- |
|- |
||
| 1 || 14th May — 20th May || |
| 1 || 14th May — 20th May || <s>Find and download needed Russian - Ukranian and Russian - Belarusian corpora, write scripts for preprocessing the data.</s> |
||
|- |
|- |
||
| 2 || 21th May - 27th May || <s> Learn to use bicleaner (https://github.com/sortiz/bicleaner), train ru-uk classifier, preprocess OpenSubtitles corpora, filter out loose translations. </s> |
|||
| 2 || 21th May - 27th May || |
|||
|- |
|- |
||
| 3 || 28th May — 3rd June || <s> Continue to prepare Russian - Ukranian parallel corpus from OpenSubtitles, refactore the old apply_postedits.py code, make the old code work faster. </s> |
|||
| 3 || 28th May — 3rd June || |
|||
|- |
|- |
||
| 4 || 4th June — 10th June || |
| 4 || 4th June — 10th June || Work on the old code, start to extract triplets. |
||
|- |
|- |
||
Line 61: | Line 60: | ||
== Progress notes == |
== Progress notes == |
||
==== Data preparation ==== |
|||
'''Russian - Belarusian''' |
|||
<ul> |
|||
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li> |
|||
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li> |
|||
</ul> |
|||
Total amount of sentences: 3821. |
|||
'''Russian - Ukranian''' |
|||
<ul> |
|||
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li> |
|||
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li> |
|||
</ul> |
|||
Total amount of sentences: 8463. |
|||
==== Code refactoring ==== |
|||
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence. |
Revision as of 18:34, 9 June 2018
Related links
https://github.com/deltamachine/naive-automatic-postediting
Workplan
Week | Dates | To do |
---|---|---|
1 | 14th May — 20th May | |
2 | 21th May - 27th May | |
3 | 28th May — 3rd June | |
4 | 4th June — 10th June | Work on the old code, start to extract triplets. |
First evaluation, 11th June - 15th June | ||
5 | 11th June — 17th June | |
6 | 18th Jule — 24th July | |
7 | 25th July — 1st July | |
8 | 2nd July — 8th July | |
Second evaluation, 9th July - 13th July | ||
9 | 9th July — 15th July | |
10 | 16th July — 22th July | |
11 | 23rd July — 29th July | |
12 | 30th August — 5th August | |
Final evaluation, 6th August - 14th August |
Progress notes
Data preparation
Russian - Belarusian
- Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)
- Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)
Total amount of sentences: 3821.
Russian - Ukranian
- Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)
- OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).
Total amount of sentences: 8463.
Code refactoring
Two old scripts, learn_postedits.py and apply_postedits.py were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.