Difference between revisions of "Automatic postediting at GSoC 2018"

Revision as of 18:34, 9 June 2018

Workplan

Week	Dates	To do
1	14th May — 20th May	~~Find and download needed Russian - Ukranian and Russian - Belarusian corpora, write scripts for preprocessing the data.~~
2	21th May - 27th May	~~Learn to use bicleaner (https://github.com/sortiz/bicleaner), train ru-uk classifier, preprocess OpenSubtitles corpora, filter out loose translations.~~
3	28th May — 3rd June	~~Continue to prepare Russian - Ukranian parallel corpus from OpenSubtitles, refactore the old apply_postedits.py code, make the old code work faster.~~
4	4th June — 10th June	Work on the old code, start to extract triplets.
First evaluation, 11th June - 15th June
5	11th June — 17th June
6	18th Jule — 24th July
7	25th July — 1st July
8	2nd July — 8th July
Second evaluation, 9th July - 13th July
9	9th July — 15th July
10	16th July — 22th July
11	23rd July — 29th July
12	30th August — 5th August
Final evaluation, 6th August - 14th August

Progress notes

Data preparation

Russian - Belarusian

Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)
Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)

Total amount of sentences: 3821.

Russian - Ukranian

Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)
OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).

Total amount of sentences: 8463.

Code refactoring

Two old scripts, learn_postedits.py and apply_postedits.py were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.

@@ Line 12: / Line 12: @@
 ! Week    !! Dates       !! To do
 |-
-| 1     || 14th May — 20th May  ||
+| 1     || 14th May — 20th May  ||  <s>Find and download needed Russian - Ukranian and Russian - Belarusian corpora, write scripts for preprocessing the data.</s>
 |-
+| 2 || 21th May - 27th May || <s> Learn to use bicleaner (https://github.com/sortiz/bicleaner), train ru-uk classifier, preprocess OpenSubtitles corpora, filter out loose translations. </s>
-| 2 || 21th May - 27th May ||
 |-
+| 3     || 28th May — 3rd June || <s> Continue to prepare Russian - Ukranian parallel corpus from OpenSubtitles, refactore the old apply_postedits.py code, make the old code work faster. </s>
-| 3     || 28th May — 3rd June ||
 |-
-| 4      || 4th June — 10th June  ||
+| 4      || 4th June — 10th June  || Work on the old code, start to extract triplets.
 |-
@@ Line 61: / Line 60: @@
 == Progress notes ==
+==== Data preparation ====
+'''Russian - Belarusian'''
+<ul>
+<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li>
+<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li>
+</ul>
+Total amount of sentences: 3821.
+'''Russian - Ukranian'''
+<ul>
+<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li>
+<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li>
+</ul>
+Total amount of sentences: 8463.
+==== Code refactoring ====
+Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.

Difference between revisions of "Automatic postediting at GSoC 2018"

Revision as of 18:34, 9 June 2018

Contents

Related links

Workplan

Progress notes

Data preparation

Code refactoring

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools