Automatic postediting at GSoC 2018

From Apertium
Jump to navigation Jump to search

Related links

Idea description

Proposal for GSoC 2018

https://github.com/deltamachine/naive-automatic-postediting

Workplan

Week Dates To do
1 14th May — 20th May Find and download needed Russian - Ukranian and Russian - Belarusian corpora, write scripts for preprocessing the data.
2 21th May - 27th May Learn to use bicleaner (https://github.com/sortiz/bicleaner), train ru-uk classifier, preprocess OpenSubtitles corpora, filter out loose translations.
3 28th May — 3rd June Continue to prepare Russian - Ukranian parallel corpus from OpenSubtitles, refactore the old apply_postedits.py code, make the old code work faster.
4 4th June — 10th June Work on the old code, start to extract triplets.
First evaluation, 11th June - 15th June
5 11th June — 17th June
6 18th Jule — 24th July
7 25th July — 1st July
8 2nd July — 8th July
Second evaluation, 9th July - 13th July
9 9th July — 15th July
10 16th July — 22th July
11 23rd July — 29th July
12 30th August — 5th August
Final evaluation, 6th August - 14th August


Progress notes

Data preparation

Russian - Belarusian

  • Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)
  • Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)

Total amount of sentences: 3821.

Russian - Ukranian

  • Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)
  • OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).

Total amount of sentences: 8463.

Code refactoring

Two old scripts, learn_postedits.py and apply_postedits.py were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.

Operations extraction

There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3). In fact, results are not very meaningful: the reason might lie in problems in learn_postedits.py and in the method itself (but it should be checked carefully).