https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Deltamachine&feedformat=atomApertium - User contributions [en]2024-03-29T11:56:28ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67537Automatic postediting at GSoC 20182018-08-15T15:27:32Z<p>Deltamachine: </p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
=== Toolbox ===<br />
The toolbox and step-to-step guide about how to use it: https://github.com/deltamachine/naive-automatic-postediting/tree/master/toolbox<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.<br />
<br />
==== Inserting operations into a language pair: dictionary approach (under development) ====<br />
For inserting operations into a language pair, a few helper scripts were written.<br />
<br />
===== Monodix/bidix entries =====<br />
New monodix/bidix entries can be created from postedits in the following way:<br />
<br />
1. Firstly, ''create_entries_table.py'' takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag".<br />
<br />
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word.<br />
<br />
3. Then ''check_entries.py'' should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word.<br />
<br />
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries.<br />
<br />
5. The last step is to run ''add_new_entries.py'' on the edited table. This script will create new antries, add them to the dictionaries and compile them.<br />
<br />
==== Inserting operations into a language pair: separate module approach (under development) ====<br />
A script for applying learned postedits to new sentences was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/toolbox/new_apply_postedits.py<br />
<br />
It applies postedits to a given test MT file and creates an output file which contains source (S), Apertium translated (MT), edited by algorithm (ED) and target sentences (T) in a following format:<br />
<br />
<br />
S я ненавижу спешить по утрам.<br />
<br />
MT я *ненавижу *спешить по ранкам.<br />
<br />
ED я ненавиджу поспішати по ранкам.<br />
<br />
T я ненавиджу поспішати вранку.<br />
<br />
<br />
For testing this approach a fast-and-dirty WER checking script was written. It takes file which was created by ''new_apply_postedits.py'', collects all MT, ED and T sentences (in case of few ED for the same sentence, the first one is chosen) and runs apertium-eval-translator on (MT, T) and (ED, T).<br />
<br />
Here are the results for test data (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''bel-rus'''<br />
|'''rus-ukr'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|42.48% / 38.74%<br />
|47.25% / 40.78%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|40.50% / 36.76%<br />
|44.09 / 37.36%<br />
|-<br />
|}<br />
<br />
<br />
Little experiment with Spanish - Catalan (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''spa-cat'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|22.49% / 15.03%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|22.44% / 14.98%<br />
|-<br />
|}</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67531Automatic postediting at GSoC 20182018-08-14T13:13:34Z<p>Deltamachine: </p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
=== Toolbox ===<br />
The toolbox and step-to-step guide about how to use it: https://github.com/deltamachine/naive-automatic-postediting/tree/master/new_alg<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.<br />
<br />
==== Inserting operations into a language pair: dictionary approach (under development) ====<br />
For inserting operations into a language pair, a few helper scripts were written.<br />
<br />
===== Monodix/bidix entries =====<br />
New monodix/bidix entries can be created from postedits in the following way:<br />
<br />
1. Firstly, ''create_entries_table.py'' takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag".<br />
<br />
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word.<br />
<br />
3. Then ''check_entries.py'' should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word.<br />
<br />
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries.<br />
<br />
5. The last step is to run ''add_new_entries.py'' on the edited table. This script will create new antries, add them to the dictionaries and compile them.<br />
<br />
==== Inserting operations into a language pair: separate module approach (under development) ====<br />
A script for applying learned postedits to new sentences was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_apply_postedits.py<br />
<br />
It applies postedits to a given test MT file and creates an output file which contains source (S), Apertium translated (MT), edited by algorithm (ED) and target sentences (T) in a following format:<br />
<br />
<br />
S я ненавижу спешить по утрам.<br />
<br />
MT я *ненавижу *спешить по ранкам.<br />
<br />
ED я ненавиджу поспішати по ранкам.<br />
<br />
T я ненавиджу поспішати вранку.<br />
<br />
<br />
For testing this approach a fast-and-dirty WER checking script was written. It takes file which was created by ''new_apply_postedits.py'', collects all MT, ED and T sentences (in case of few ED for the same sentence, the first one is chosen) and runs apertium-eval-translator on (MT, T) and (ED, T).<br />
<br />
Here are the results for test data (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''bel-rus'''<br />
|'''rus-ukr'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|42.48% / 38.74%<br />
|47.25% / 40.78%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|40.50% / 36.76%<br />
|44.09 / 37.36%<br />
|-<br />
|}<br />
<br />
<br />
Little experiment with Spanish - Catalan (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''spa-cat'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|22.49% / 15.03%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|22.44% / 14.98%<br />
|-<br />
|}</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67530Automatic postediting at GSoC 20182018-08-14T11:27:05Z<p>Deltamachine: /* Inserting operations into a language pair: separate module approach */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.<br />
<br />
==== Inserting operations into a language pair: dictionary approach ====<br />
For inserting operations into a language pair, a few helper scripts were written.<br />
<br />
===== Monodix/bidix entries =====<br />
New monodix/bidix entries can be created from postedits in the following way:<br />
<br />
1. Firstly, ''create_entries_table.py'' takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag".<br />
<br />
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word.<br />
<br />
3. Then ''check_entries.py'' should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word.<br />
<br />
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries.<br />
<br />
5. The last step is to run ''add_new_entries.py'' on the edited table. This script will create new antries, add them to the dictionaries and compile them.<br />
<br />
==== Inserting operations into a language pair: separate module approach ====<br />
A script for applying learned postedits to new sentences was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_apply_postedits.py<br />
<br />
It applies postedits to a given test MT file and creates an output file which contains source (S), Apertium translated (MT), edited by algorithm (ED) and target sentences (T) in a following format:<br />
<br />
<br />
S я ненавижу спешить по утрам.<br />
<br />
MT я *ненавижу *спешить по ранкам.<br />
<br />
ED я ненавиджу поспішати по ранкам.<br />
<br />
T я ненавиджу поспішати вранку.<br />
<br />
<br />
For testing this approach a fast-and-dirty WER checking script was written. It takes file which was created by ''new_apply_postedits.py'', collects all MT, ED and T sentences (in case of few ED for the same sentence, the first one is chosen) and runs apertium-eval-translator on (MT, T) and (ED, T).<br />
<br />
Here are the results for test data (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''bel-rus'''<br />
|'''rus-ukr'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|42.48% / 38.74%<br />
|47.25% / 40.78%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|40.50% / 36.76%<br />
|44.09 / 37.36%<br />
|-<br />
|}<br />
<br />
<br />
Little experiment with Spanish - Catalan (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''spa-cat'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|22.49% / 15.03%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|22.44% / 14.98%<br />
|-<br />
|}</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67481Automatic postediting at GSoC 20182018-08-12T13:24:29Z<p>Deltamachine: </p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.<br />
<br />
==== Inserting operations into a language pair: dictionary approach ====<br />
For inserting operations into a language pair, a few helper scripts were written.<br />
<br />
===== Monodix/bidix entries =====<br />
New monodix/bidix entries can be created from postedits in the following way:<br />
<br />
1. Firstly, ''create_entries_table.py'' takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag".<br />
<br />
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word.<br />
<br />
3. Then ''check_entries.py'' should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word.<br />
<br />
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries.<br />
<br />
5. The last step is to run ''add_new_entries.py'' on the edited table. This script will create new antries, add them to the dictionaries and compile them.<br />
<br />
==== Inserting operations into a language pair: separate module approach ====<br />
A script for applying learned postedits to new sentences was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_apply_postedits.py<br />
<br />
It applies postedits to a given test MT file and creates an output file which contains source (S), Apertium translated (MT), edited by algorithm (ED) and target sentences (T) in a following format:<br />
<br />
<br />
S я ненавижу спешить по утрам.<br />
<br />
MT я *ненавижу *спешить по ранкам.<br />
<br />
ED я ненавиджу поспішати по ранкам.<br />
<br />
T я ненавиджу поспішати вранку.<br />
<br />
<br />
For testing this approach a fast-and-dirty WER checking script was written. It takes file which was created by ''new_apply_postedits.py'', collects all MT, ED and T sentences (in case of few ED for the same sentence, the first one is chosen) and runs apertium-eval-translator on (MT, T) and (ED, T).<br />
<br />
Here are the results for test data (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''bel-rus'''<br />
|'''rus-ukr'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|42.48% / 38.74%<br />
|47.25% / 40.78%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|40.50% / 36.76%<br />
|44.09 / 37.36%<br />
|-<br />
|}</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67480Automatic postediting at GSoC 20182018-08-12T12:29:43Z<p>Deltamachine: </p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.<br />
<br />
==== Inserting operations into a language pair: dictionary approach ====<br />
For inserting operations into a language pair, a few helper scripts were written.<br />
<br />
===== Monodix/bidix entries =====<br />
New monodix/bidix entries can be created from postedits in the following way:<br />
<br />
1. Firstly, ''create_entries_table.py'' takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag".<br />
<br />
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word.<br />
<br />
3. Then ''check_entries.py'' should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word.<br />
<br />
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries.<br />
<br />
5. The last step is to run ''add_new_entries.py'' on the edited table. This script will create new antries, add them to the dictionaries and compile them.<br />
<br />
<br />
==== Inserting operations into a language pair: separate module approach ====<br />
A script for applying learned postedits to new sentences was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_apply_postedits.py<br />
<br />
It applies postedits to a given test MT file and creates an output file which contains source (S), Apertium translated (MT), edited by algorithm (ED) and target sentences (T) in a following format:<br />
<br />
<br />
S я ненавижу спешить по утрам.<br />
<br />
MT я *ненавижу *спешить по ранкам.<br />
<br />
ED я ненавиджу поспішати по ранкам.<br />
<br />
T я ненавиджу поспішати вранку.<br />
<br />
<br />
For testing this approach a fast-and-dirty WER checking script was written. It takes file which was created by ''new_apply_postedits.py'', collects all MT, ED and T sentences (in case of few ED for the same sentence, the first one is chosen) and runs apertium-eval-translator on (MT, T) and (ED, T).<br />
<br />
Here are the results for test data (with applying only the postedits from "potential bidix entries" list learned on train data).<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 70%;"<br />
|-<br />
|<br />
|'''bel-rus'''<br />
|'''rus-ukr'''<br />
|-<br />
|'''(MT, T) WER / position-independent WER'''<br />
|42.48% / 38.74%<br />
|47.25% / 40.78%<br />
|-<br />
|'''(ED, T) WER / position independent WER'''<br />
|40.50% / 36.76%<br />
|44.09 / 37.36%<br />
|-<br />
|}</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67434Automatic postediting at GSoC 20182018-08-10T20:45:40Z<p>Deltamachine: /* Inserting operations into a language pair */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.<br />
<br />
==== Inserting operations into a language pair ====<br />
For inserting operations into a language pair, a few helper scripts were written.<br />
<br />
===== Monodix/bidix entries =====<br />
New monodix/bidix entries can be created from postedits in the following way:<br />
<br />
1. Firstly, ''create_entries_table.py'' takes a file with bidix postedits, splits it in source and target, then analyzes both sides using UDPipe (in case of Belarusian and Ukranian) or Mystem (in case of Russian), finds a lemma of every word, replaces UD/Mystem tages with Apertium ones and then create a file, which contains table with rows "source lemma - source Apertium tag - target lemma - target Apertium tag".<br />
<br />
2. After that, the table should be manually checked: UDPipe/Mystem not always determine a correct/lemma for a word.<br />
<br />
3. Then ''check_entries.py'' should be run on the created table. This script looks for the given lemmas/pair of lemmas in source/target/bidix dictionaries and creates a new table with information for every word.<br />
<br />
4. Then user should again manually edit the table and add a stem and a paradigm for every word, which was not found in dictionaries.<br />
<br />
5. The last step is to run ''add_new_entries.py'' on the edited table. This script will create new antries, add them to the dictionaries and compile them.<br />
<br />
==== Evaluation ====</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67409Automatic postediting at GSoC 20182018-08-09T20:28:12Z<p>Deltamachine: /* Cleaning */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it. It helps filter out wrong alignment cases.<br />
<br />
==== Inserting operations into a language pair ====<br />
<br />
==== Evaluation ====</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67408Automatic postediting at GSoC 20182018-08-09T20:27:01Z<p>Deltamachine: /* New algorithm for operations extraction */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/new_learn_postedits_algorithm.py and the rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/rationale.md.<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it.<br />
<br />
==== Inserting operations into a language pair ====<br />
<br />
==== Evaluation ====</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67407Automatic postediting at GSoC 20182018-08-09T18:09:34Z<p>Deltamachine: /* Classifying operations */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment algorithm. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code and rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/tree/master/new_alg<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
<br />
3) If not, the script calculates the following metric:<br />
<br />
''x = ((l - d) / l) * 100''<br />
<br />
where l = number of letters in pe and d = Levenshtein distance betweeen mt and pe.<br />
<br />
If 50 <= x < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Otherwise the algorithm checks, if mt != pe, and if not, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it.<br />
<br />
==== Inserting operations into a language pair ====<br />
<br />
==== Evaluation ====</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67399Automatic postediting at GSoC 20182018-08-09T09:27:12Z<p>Deltamachine: </p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).<br />
<br />
==== New algorithm for operations extraction ====<br />
Because of meaningless results of using the old algorithm, the new algorithm was created. It is based on the custom alignment algorithm. It seems that the new code will work okay on close-related languages, but I'm not sure about others. The code and rationale can be found here: https://github.com/deltamachine/naive-automatic-postediting/tree/master/new_alg<br />
<br />
==== Classifying operations ====<br />
A script for classifying extracted postedits (https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/extract_types.py) indentifies three types of operations: potential monodix/bidix entries (when a pair doesn't have a translation for a given word), grammar mistakes (when Apertium chooses incorrect form of translated word) and other mistakes (it can be, for example, a potential lexical selection rule).<br />
<br />
How it works:<br />
<br />
1) It takes file with postedit triplets (s, mt, pe).<br />
2) If here is '*' in mt, algorithm adds triplet to "potential bidix entries" list.<br />
3) If not, the script calculates the following metric:<br />
<br />
* letters = number of letters in pe<br />
* distance = Levenshtein distance betweeen mt and pe<br />
<br />
((letters - distance) / letters) * 100<br />
<br />
If 50 <= this number < 100, the algorithm adds triplet to "grammar mistakes" list.<br />
<br />
4) Else the algorithm checks, if mt != pe, and if no, adds triplet to "other mistakes" list.<br />
<br />
==== Cleaning ====<br />
The extracting postedits algorithm is not perfect and extracts a lot of garbage along with potentially good triplets. For cleaning files with postedits a following script was written: https://github.com/deltamachine/naive-automatic-postediting/blob/master/new_alg/clean_postedits.py. On the first step it tags every part of every triplet using apertium-tagger and then drops triplets which contain punctuation. It helps filter out wrong triplets like, for example, (',', '*видець', ',').<br />
<br />
Then it calculates the same metric as in classifying step between s and mt, mt and pe, s and pe. If every result >= 30 and triplet is not from "other mistakes" list, the algorithm saves this triplet, if not - drops it.<br />
<br />
==== Inserting operations into a language pair ====<br />
<br />
==== Evaluation ====</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67143Automatic postediting at GSoC 20182018-06-11T23:33:59Z<p>Deltamachine: /* Progress notes */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Workplan ==<br />
<br />
{|class=wikitable<br />
|-<br />
! Week !! Dates !! To do <br />
|-<br />
| 1 || 14th May — 20th May || <s>Find and download needed Russian - Ukranian and Russian - Belarusian corpora, write scripts for preprocessing the data.</s><br />
<br />
|-<br />
| 2 || 21th May - 27th May || <s> Learn to use bicleaner (https://github.com/sortiz/bicleaner), train ru-uk classifier, preprocess OpenSubtitles corpora, filter out loose translations. </s> <br />
|-<br />
| 3 || 28th May — 3rd June || <s> Continue to prepare Russian - Ukranian parallel corpus from OpenSubtitles, refactore the old apply_postedits.py code, make the old code work faster. </s><br />
<br />
|-<br />
| 4 || 4th June — 10th June || Work on the old code, start to extract triplets.<br />
<br />
|-<br />
! '''First evaluation, 11th June - 15th June''' !! colspan="2" align=left | <br />
<br />
|-<br />
| 5 || 11th June — 17th June ||<br />
<br />
|-<br />
| 6 || 18th Jule — 24th July || <br />
<br />
|-<br />
| 7 || 25th July — 1st July || <br />
<br />
|-<br />
| 8 || 2nd July — 8th July || <br />
<br />
|-<br />
!'''Second evaluation, 9th July - 13th July''' || colspan="2" align=left | <br />
<br />
|-<br />
| 9 || 9th July — 15th July || <br />
<br />
|-<br />
| 10 || 16th July — 22th July ||<br />
<br />
|-<br />
| 11 || 23rd July — 29th July || <br />
<br />
|-<br />
| 12 || 30th August — 5th August || <br />
<br />
|-<br />
!'''Final evaluation, 6th August - 14th August''' || colspan="2" align=left | <br />
|-<br />
|}<br />
<br />
<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.<br />
<br />
==== Operations extraction ====<br />
There were three attempts to extract postediting operations for each language pair: with threshold = 0.8 and -m, -M = (1, 3).<br />
In fact, results are not very meaningful: the reason might lie in problems in ''learn_postedits.py'' and in the method itself (but it should be checked carefully).</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67125Automatic postediting at GSoC 20182018-06-09T18:34:58Z<p>Deltamachine: </p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Workplan ==<br />
<br />
{|class=wikitable<br />
|-<br />
! Week !! Dates !! To do <br />
|-<br />
| 1 || 14th May — 20th May || <s>Find and download needed Russian - Ukranian and Russian - Belarusian corpora, write scripts for preprocessing the data.</s><br />
<br />
|-<br />
| 2 || 21th May - 27th May || <s> Learn to use bicleaner (https://github.com/sortiz/bicleaner), train ru-uk classifier, preprocess OpenSubtitles corpora, filter out loose translations. </s> <br />
|-<br />
| 3 || 28th May — 3rd June || <s> Continue to prepare Russian - Ukranian parallel corpus from OpenSubtitles, refactore the old apply_postedits.py code, make the old code work faster. </s><br />
<br />
|-<br />
| 4 || 4th June — 10th June || Work on the old code, start to extract triplets.<br />
<br />
|-<br />
! '''First evaluation, 11th June - 15th June''' !! colspan="2" align=left | <br />
<br />
|-<br />
| 5 || 11th June — 17th June ||<br />
<br />
|-<br />
| 6 || 18th Jule — 24th July || <br />
<br />
|-<br />
| 7 || 25th July — 1st July || <br />
<br />
|-<br />
| 8 || 2nd July — 8th July || <br />
<br />
|-<br />
!'''Second evaluation, 9th July - 13th July''' || colspan="2" align=left | <br />
<br />
|-<br />
| 9 || 9th July — 15th July || <br />
<br />
|-<br />
| 10 || 16th July — 22th July ||<br />
<br />
|-<br />
| 11 || 23rd July — 29th July || <br />
<br />
|-<br />
| 12 || 30th August — 5th August || <br />
<br />
|-<br />
!'''Final evaluation, 6th August - 14th August''' || colspan="2" align=left | <br />
|-<br />
|}<br />
<br />
<br />
<br />
== Progress notes ==<br />
==== Data preparation ====<br />
'''Russian - Belarusian''' <br />
<br />
<ul><br />
<li>Mediawiki: 2059 sentences - source, Apertium translated and postedited by humans (only bel -> rus)</li><br />
<li>Tatoeba: 1762 sentences: source, target and both ways Apertium translated (bel -> rus, rus -> bel)</li><br />
</ul><br />
<br />
Total amount of sentences: 3821.<br />
<br />
'''Russian - Ukranian'''<br />
<br />
<ul><br />
<li>Tatoeba: 6463 sentences - source, target and both ways Apertium translated (ukr -> rus, rus -> ukr)</li><br />
<li>OpenSubtitles: 2000 manually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus, rus -> ukr).</li><br />
</ul> <br />
<br />
Total amount of sentences: 8463.<br />
<br />
==== Code refactoring ====<br />
Two old scripts, ''learn_postedits.py'' and ''apply_postedits.py'' were refactored. Also now both scripts work approximately 10 times faster: now scripts collect all subsegments in one large file and translate/analyze the whole file. Instead of calling Apertium few times for every subsegment, now it is called only two times (for translating and for analyzing) for all subsegments of a sentence.</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67090Automatic postediting at GSoC 20182018-05-28T17:06:27Z<p>Deltamachine: /* Related links */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Workplan ==<br />
<br />
{|class=wikitable<br />
|-<br />
! Week !! Dates !! To do <br />
|-<br />
| 1 || 14th May — 20th May || <br />
<br />
|-<br />
| 2 || 21th May - 27th May ||<br />
<br />
|-<br />
| 3 || 28th May — 3rd June ||<br />
<br />
|-<br />
| 4 || 4th June — 10th June || <br />
<br />
|-<br />
! '''First evaluation, 11th June - 15th June''' !! colspan="2" align=left | <br />
<br />
|-<br />
| 5 || 11th June — 17th June ||<br />
<br />
|-<br />
| 6 || 18th Jule — 24th July || <br />
<br />
|-<br />
| 7 || 25th July — 1st July || <br />
<br />
|-<br />
| 8 || 2nd July — 8th July || <br />
<br />
|-<br />
!'''Second evaluation, 9th July - 13th July''' || colspan="2" align=left | <br />
<br />
|-<br />
| 9 || 9th July — 15th July || <br />
<br />
|-<br />
| 10 || 16th July — 22th July ||<br />
<br />
|-<br />
| 11 || 23rd July — 29th July || <br />
<br />
|-<br />
| 12 || 30th August — 5th August || <br />
<br />
|-<br />
!'''Final evaluation, 6th August - 14th August''' || colspan="2" align=left | <br />
|-<br />
|}<br />
<br />
<br />
<br />
== Progress notes ==</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67089Automatic postediting at GSoC 20182018-05-28T17:04:48Z<p>Deltamachine: /* Related links */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2018]]<br />
<br />
https://github.com/apertium/apertium-weights-learner/tree/629b48b306116565bc1d748c298bc28b41506f63<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
== Workplan ==<br />
<br />
{|class=wikitable<br />
|-<br />
! Week !! Dates !! To do <br />
|-<br />
| 1 || 14th May — 20th May || <br />
<br />
|-<br />
| 2 || 21th May - 27th May ||<br />
<br />
|-<br />
| 3 || 28th May — 3rd June ||<br />
<br />
|-<br />
| 4 || 4th June — 10th June || <br />
<br />
|-<br />
! '''First evaluation, 11th June - 15th June''' !! colspan="2" align=left | <br />
<br />
|-<br />
| 5 || 11th June — 17th June ||<br />
<br />
|-<br />
| 6 || 18th Jule — 24th July || <br />
<br />
|-<br />
| 7 || 25th July — 1st July || <br />
<br />
|-<br />
| 8 || 2nd July — 8th July || <br />
<br />
|-<br />
!'''Second evaluation, 9th July - 13th July''' || colspan="2" align=left | <br />
<br />
|-<br />
| 9 || 9th July — 15th July || <br />
<br />
|-<br />
| 10 || 16th July — 22th July ||<br />
<br />
|-<br />
| 11 || 23rd July — 29th July || <br />
<br />
|-<br />
| 12 || 30th August — 5th August || <br />
<br />
|-<br />
!'''Final evaluation, 6th August - 14th August''' || colspan="2" align=left | <br />
|-<br />
|}<br />
<br />
<br />
<br />
== Progress notes ==</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Automatic_postediting_at_GSoC_2018&diff=67088Automatic postediting at GSoC 20182018-05-28T17:04:32Z<p>Deltamachine: Created page with "== Related links == Idea description Proposal for GSoC 2016 https://github.com/a..."</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/automatic-postediting|Idea description]]<br />
<br />
[[User:Deltamachine/proposal2018|Proposal for GSoC 2016]]<br />
<br />
https://github.com/apertium/apertium-weights-learner/tree/629b48b306116565bc1d748c298bc28b41506f63<br />
<br />
https://github.com/deltamachine/naive-automatic-postediting<br />
<br />
<br />
== Workplan ==<br />
<br />
{|class=wikitable<br />
|-<br />
! Week !! Dates !! To do <br />
|-<br />
| 1 || 14th May — 20th May || <br />
<br />
|-<br />
| 2 || 21th May - 27th May ||<br />
<br />
|-<br />
| 3 || 28th May — 3rd June ||<br />
<br />
|-<br />
| 4 || 4th June — 10th June || <br />
<br />
|-<br />
! '''First evaluation, 11th June - 15th June''' !! colspan="2" align=left | <br />
<br />
|-<br />
| 5 || 11th June — 17th June ||<br />
<br />
|-<br />
| 6 || 18th Jule — 24th July || <br />
<br />
|-<br />
| 7 || 25th July — 1st July || <br />
<br />
|-<br />
| 8 || 2nd July — 8th July || <br />
<br />
|-<br />
!'''Second evaluation, 9th July - 13th July''' || colspan="2" align=left | <br />
<br />
|-<br />
| 9 || 9th July — 15th July || <br />
<br />
|-<br />
| 10 || 16th July — 22th July ||<br />
<br />
|-<br />
| 11 || 23rd July — 29th July || <br />
<br />
|-<br />
| 12 || 30th August — 5th August || <br />
<br />
|-<br />
!'''Final evaluation, 6th August - 14th August''' || colspan="2" align=left | <br />
|-<br />
|}<br />
<br />
<br />
<br />
== Progress notes ==</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Weighted_transfer_rules&diff=66778Weighted transfer rules2018-04-17T14:43:27Z<p>Deltamachine: /* Expertiment */</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules|Idea description]]<br />
<br />
[[Weighted_transfer_rules_at_GSoC_2016|Nikita Medyankin's project at GSoC 2016]]<br />
<br />
https://github.com/apertium/apertium-weights-learner/tree/629b48b306116565bc1d748c298bc28b41506f63<br />
<br />
https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/<br />
<br />
== Fixes ==<br />
Nikita's code should work okay now. To run it, download apertium-weights-learner from https://github.com/apertium/apertium-weights-learner/tree/experimental, English - Spanish language pair with ambiguous rules from https://github.com/apertium/apertium-en-es/tree/ambiguous-rules and Apertium core with modified transfer module from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium/.<br />
<br />
== Coverages ==<br />
The number of all possible coverages was calculated 100 times for 100 random sentences for 5 language pairs. <br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|'''language pair'''<br />
|'''corpus'''<br />
|'''mean number of coverages'''<br />
|-<br />
|English - Spanish<br />
|Tatoeba<br />
|3.72<br />
|-<br />
|English - Spanish<br />
|Europarl<br />
|194.35<br />
|-<br />
|Spanish - Catalan<br />
|Tatoeba<br />
|2.94<br />
|-<br />
|Spanish - Catalan<br />
|Europarl<br />
|53.04<br />
|-<br />
|Basque - Spanish<br />
|Tatoeba<br />
|9.19<br />
|-<br />
|Swedish - Norwegian<br />
|Europarl<br />
|488.57<br />
|-<br />
|Crimean Tatar - Turkish<br />
|Crimean Tatar Wikipedia<br />
|3.12<br />
|-<br />
|}<br />
<br />
== Experiment ==<br />
The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved.<br />
<br />
The contents of new-software-sample.txt looks like the following:<br />
<br />
<pre><br />
Mr Stephen said the council had agreed to consider new software which would make the test more difficult.<br />
What's Next: Simonyi's new software writes its own code<br />
This new software makes it easier to get a movie done quickly, though harder to get it done well.<br />
</pre><br />
<br />
The contents of the unpruned w1x file without generalizing patterns should look like the following:<br />
<br />
<pre><br />
<?xml version='1.0' encoding='UTF-8'?><br />
<transfer-weights><br />
<rule-group><br />
<rule comment="REGLA: ADJ NOM no-swap-version" id="1" md5="64121bebaee1b179cfc0002db6b06fc3"><br />
<pattern weight="1.625228556310039"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="1.625228556310039"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="1.625228556310039"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="1.625228556310039"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
<rule comment="REGLA: ADJ NOM" id="2" md5="8eed4b8aee5567fcfebc0de7698f4bdb"><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
</rule-group><br />
<rule-group><br />
<rule comment="REGLA: DET ADJ NOM no-swap-version" id="3" md5="05d8b437ee595c7d0c992c5ae066a199"><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
<rule comment="REGLA: DET ADJ NOM" id="4" md5="87fb69c4cd8792f06e0b51c6fd79f127"><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
</rule-group><br />
</transfer-weights><br />
</pre><br />
<br />
This would mean that 'no-swap' versions of both rules are preferred for each pattern, which tells the transfer module that the translations of 'new' and 'software' should not be swapped (as specified in '-ns' versions of both rules), since in Spanish the adjective 'nuevo' is usually put before the noun as opposed to the fact that most adjectives are put after the noun.</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Weighted_transfer_rules&diff=66777Weighted transfer rules2018-04-17T14:15:34Z<p>Deltamachine: </p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules|Idea description]]<br />
<br />
[[Weighted_transfer_rules_at_GSoC_2016|Nikita Medyankin's project at GSoC 2016]]<br />
<br />
https://github.com/apertium/apertium-weights-learner/tree/629b48b306116565bc1d748c298bc28b41506f63<br />
<br />
https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/<br />
<br />
== Fixes ==<br />
Nikita's code should work okay now. To run it, download apertium-weights-learner from https://github.com/apertium/apertium-weights-learner/tree/experimental, English - Spanish language pair with ambiguous rules from https://github.com/apertium/apertium-en-es/tree/ambiguous-rules and Apertium core with modified transfer module from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium/.<br />
<br />
== Coverages ==<br />
The number of all possible coverages was calculated 100 times for 100 random sentences for 5 language pairs. <br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|'''language pair'''<br />
|'''corpus'''<br />
|'''mean number of coverages'''<br />
|-<br />
|English - Spanish<br />
|Tatoeba<br />
|3.72<br />
|-<br />
|English - Spanish<br />
|Europarl<br />
|194.35<br />
|-<br />
|Spanish - Catalan<br />
|Tatoeba<br />
|2.94<br />
|-<br />
|Spanish - Catalan<br />
|Europarl<br />
|53.04<br />
|-<br />
|Basque - Spanish<br />
|Tatoeba<br />
|9.19<br />
|-<br />
|Swedish - Norwegian<br />
|Europarl<br />
|488.57<br />
|-<br />
|Crimean Tatar - Turkish<br />
|Crimean Tatar Wikipedia<br />
|3.12<br />
|-<br />
|}<br />
<br />
== Expertiment ==<br />
The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved.<br />
<br />
The contents of the unpruned w1x file without generalizing patterns should look like the following:<br />
<br />
<pre><br />
<?xml version='1.0' encoding='UTF-8'?><br />
<transfer-weights><br />
<rule-group><br />
<rule comment="REGLA: ADJ NOM no-swap-version" id="1" md5="64121bebaee1b179cfc0002db6b06fc3"><br />
<pattern weight="1.625228556310039"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="1.625228556310039"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="1.625228556310039"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="1.625228556310039"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
<rule comment="REGLA: ADJ NOM" id="2" md5="8eed4b8aee5567fcfebc0de7698f4bdb"><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.3747714436899609"><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
</rule-group><br />
<rule-group><br />
<rule comment="REGLA: DET ADJ NOM no-swap-version" id="3" md5="05d8b437ee595c7d0c992c5ae066a199"><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9376183345269524"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.9844006834162787"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
<rule comment="REGLA: DET ADJ NOM" id="4" md5="87fb69c4cd8792f06e0b51c6fd79f127"><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item lemma="code" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.06238166547304746"><br />
<pattern-item lemma="its" tags="det.pos.sp"/><br />
<pattern-item lemma="own" tags="adj"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item tags="n.sg"/><br />
</pattern><br />
<pattern weight="0.0155993165837215"><br />
<pattern-item lemma="this" tags="det.dem.sg"/><br />
<pattern-item lemma="new" tags="adj.sint"/><br />
<pattern-item lemma="software" tags="n.sg"/><br />
</pattern><br />
</rule><br />
</rule-group><br />
</transfer-weights><br />
</pre></div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Weighted_transfer_rules&diff=66771Weighted transfer rules2018-04-13T11:57:34Z<p>Deltamachine: Created page with "== Related links == Idea description [[Weighted_transfer_rules_at_GSoC_2016|Nikita Medyankin's project at GSoC 201..."</p>
<hr />
<div>== Related links ==<br />
[[Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules|Idea description]]<br />
<br />
[[Weighted_transfer_rules_at_GSoC_2016|Nikita Medyankin's project at GSoC 2016]]<br />
<br />
https://github.com/apertium/apertium-weights-learner/tree/629b48b306116565bc1d748c298bc28b41506f63<br />
<br />
https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/<br />
<br />
== Fixes ==<br />
Nikita's code should work okay now. To run it, download apertium-weights-learner from https://github.com/apertium/apertium-weights-learner/tree/experimental, English - Spanish language pair with ambiguous rules from https://github.com/apertium/apertium-en-es/tree/ambiguous-rules and Apertium core with modified transfer module from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium/.<br />
<br />
== Coverages ==<br />
The number of all possible coverages was calculated 100 times for 100 random sentences for 5 language pairs. <br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|'''language pair'''<br />
|'''corpus'''<br />
|'''mean number of coverages'''<br />
|-<br />
|English - Spanish<br />
|Tatoeba<br />
|3.72<br />
|-<br />
|English - Spanish<br />
|Europarl<br />
|194.35<br />
|-<br />
|Spanish - Catalan<br />
|Tatoeba<br />
|2.94<br />
|-<br />
|Spanish - Catalan<br />
|Europarl<br />
|53.04<br />
|-<br />
|Basque - Spanish<br />
|Tatoeba<br />
|9.19<br />
|-<br />
|Swedish - Norwegian<br />
|Europarl<br />
|488.57<br />
|-<br />
|Crimean Tatar - Turkish<br />
|Crimean Tatar Wikipedia<br />
|3.12<br />
|-<br />
|}</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66554User:Deltamachine/proposal20182018-03-26T09:19:18Z<p>Deltamachine: /* Which of the published tasks are you interested in? What do you plan to do? */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow/Yekaterinburg, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''Github:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3 (Moscow) / UTC+5 (Yekaterinburg)</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for support those languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in substr(S), mt in substr(MT(S)), pe in substr(PE(MT(S))) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<u>About collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will make our data noisy.</p><p>'''-''' sometimes posteditors change the contents of a paragraph to make the article better: they split original sentences, add new information, etc. But these cases could probably be filtered.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
<u>About language pairs</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
The problem with working with bel-rus and ukr-rus is the comparably small amount of postedited data and more or less suitable parallel corpora. I'm still looking for data, but current situation looks like this:<br />
<br />
''' bel-rus '''<br />
* Russian -> Belarusian Mediawiki corpus of Apertium translated and postedited data = 1895 sentences.<br />
* Tatoeba parallel corpus = about 1800 sentences<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
''' ukr-rus'''<br />
* Russian -> Ukranian Mediawiki corpus of Apertium translated and postedited data = 60 sentences.<br />
* Tatoeba parallel corpus = about 6500 sentences<br />
* OpenSubtitles2016 parallel corpus = about 400000 sentences (might contain free translations)<br />
* OpenSubtitles2018 parallel corpus = about 600000 sentences (might contain free translations)<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair. We might choose another language pair or start our experiments with a small amount of data. The question is discussable.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done.<br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big training set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on training and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
In addition, I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py):<br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''<br />
<br />
[[Category:GSoC 2018 student proposals|Deltamachine]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66469User:Deltamachine/proposal20182018-03-24T13:30:32Z<p>Deltamachine: </p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow/Yekaterinburg, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''Github:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3 (Moscow) / UTC+5 (Yekaterinburg)</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for support those languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in substr(S), mt in substr(MT(S)), pe in substr(PE(MT(S))) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will make our data noisy.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
<u>About language pairs</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
The problem with working with bel-rus and ukr-rus is the comparably small amount of postedited data and more or less suitable parallel corpora. I'm still looking for data, but current situation looks like this:<br />
<br />
''' bel-rus '''<br />
* Russian -> Belarusian Mediawiki corpus of Apertium translated and postedited data = 1895 sentences.<br />
* Tatoeba parallel corpus = about 1800 sentences<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
''' ukr-rus'''<br />
* Russian -> Ukranian Mediawiki corpus of Apertium translated and postedited data = 60 sentences.<br />
* Tatoeba parallel corpus = about 6500 sentences<br />
* OpenSubtitles2016 parallel corpus = about 400000 sentences (might contain free translations)<br />
* OpenSubtitles2018 parallel corpus = about 600000 sentences (might contain free translations)<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair. We might choose another language pair or start our experiments with a small amount of data. The question is discussable.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done.<br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big training set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on training and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
In addition, I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py):<br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''<br />
<br />
[[Category:GSoC 2018 student proposals|Deltamachine]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66468User:Deltamachine/proposal20182018-03-24T13:23:00Z<p>Deltamachine: /* Contact information */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow/Yekaterinburg, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''Github:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3 (Moscow) / UTC+5 (Yekaterinburg)</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
The problem with working with bel-rus and ukr-rus is the comparably small amount of postedited data and more or less suitable parallel corpora. I'm still looking for data, but current situation looks like this:<br />
<br />
''' bel-rus '''<br />
* Russian -> Belarusian Mediawiki corpus of Apertium translated and postedited data = 1895 sentences.<br />
* Tatoeba parallel corpus = about 1800 sentences<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
''' ukr-rus'''<br />
* Russian -> Ukranian Mediawiki corpus of Apertium translated and postedited data = 60 sentences.<br />
* Tatoeba parallel corpus = about 6500 sentences<br />
* OpenSubtitles2016 parallel corpus = about 400000 sentences (might contain loose translations)<br />
* OpenSubtitles2018 parallel corpus = about 600000 sentences (might contain loose translations)<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair. We might choose another language pair or start our experiments with a small amount of data. The question is discussable.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done.<br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
In addition, I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py):<br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''<br />
<br />
[[Category:GSoC 2018 student proposals|Deltamachine]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66460User:Deltamachine/proposal20182018-03-23T17:43:50Z<p>Deltamachine: /* Choosing a language pair(s) to experiment with and collecting/processing data. */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
The problem with working with bel-rus and ukr-rus is the comparably small amount of postedited data and more or less suitable parallel corpora. I'm still looking for data, but current situation looks like this:<br />
<br />
''' bel-rus '''<br />
* Russian -> Belarusian Mediawiki corpus of Apertium translated and postedited data = 1895 sentences.<br />
* Tatoeba parallel corpus = about 1800 sentences<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
''' ukr-rus'''<br />
* Russian -> Ukranian Mediawiki corpus of Apertium translated and postedited data = 60 sentences.<br />
* Tatoeba parallel corpus = about 6500 sentences<br />
* OpenSubtitles2016 parallel corpus = about 400000 sentences (might contain loose translations)<br />
* OpenSubtitles2018 parallel corpus = about 600000 sentences (might contain loose translations)<br />
* A few specific parallel corpora like KDE and GNOME<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair. We might choose another language pair or start our experiments with a small amount of data. The question is discussable.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done.<br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
In addition, I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py):<br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''<br />
<br />
[[Category:GSoC 2018 student proposals|Deltamachine]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66447User:Deltamachine/proposal20182018-03-23T10:19:33Z<p>Deltamachine: </p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done.<br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
In addition, I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py):<br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''<br />
<br />
[[Category:GSoC 2018 student proposals|Deltamachine]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine&diff=66444User:Deltamachine2018-03-23T09:56:59Z<p>Deltamachine: /* GSoC */</p>
<hr />
<div><br />
== Contact info ==<br />
<br />
<p>'''Name:''' Anna Kondratjeva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''VK:''' http://vk.com/anya_archer</p><br />
<p>'''Github:''' http://github.com/deltamachine</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== GSoC ==<br />
<br />
[http://wiki.apertium.org/wiki/User:Deltamachine/proposal2017 My proposal to Google Summer of Code 2017]<br />
<br />
[http://wiki.apertium.org/wiki/User:Deltamachine/proposal2018 My proposal to Google Summer of Code 2018]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2017&diff=66443User:Deltamachine/proposal20172018-03-23T09:56:34Z<p>Deltamachine: Created page with "== Contact information == <p>'''Name:''' Anna Kondrateva</p> <p>'''Location:''' Moscow, Russia</p> <p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p> <p>'''Phone number:''' +792..."</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a second-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Introduction to Data Analysis</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
</ul><br />
<p>'''Technical skills:''' Python (advanced), HTML, CSS, Flask, Django, SQLite (familiar)</p><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I am deeply interested in machine translation, because it combines my two most favourite fields of studies - linguistics and programming. As a computational linguist, I would like to know how machine translation systems are built, how they work with language material and how we can improve results of their work. So, on the one hand, I can learn a lot of new things about structures of different languages while working with machine translation system like Apertium. On the other hand, I can significantly improve my coding skills, learn more about natural language processing and create something great and useful.<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
There are three main reasons why I want to work with Apertium:<br />
<p>1. Apertium works with a lot of minority languages, which is great, because it is quite unusual for machine translation system. There are a lot of systems which can translate from English to German well enough, but there are very few which can translate, for example, from Kazakh to Tatar. Apertium is one of the said systems, and I believe they do a very important job.</p><br />
<p>2. Apertium does rule-based machine translation which is unusual too. But as a linguist I am very curious about learning more about this approach, because rule-based translation requires close work with language structure and a big amount of language data.</p><br />
<p>3. Apertium community is very friendly, helpful, responsive and open to new members, which is very attractive.</p><br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to implement a shallow syntactic function labeller.<br />
<br />
The first idea was to take an annotated corpus (dependency treebank in UD format) and calculate the table "surface form - label - frequency", then take a test corpus, assign the most frequent label from the table for each token in it and calculate the accuracy score. All materials and scripts with descriptions are available in "Coding challenge" section.<br />
<br />
It appeared that this approach shows acceptable results (for example, the accuracy score was 0.8 for Russian, 0.68 for English, 0.75 for Spanish and Finnish, 0.76 for Basque), but we definitely may reach higher results. <br />
<br />
So, the next idea is to use machine learning methods for creating a better prototype of shallow syntactic function labeller.<br />
<br />
'''A brief concept:''' <br />
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a sequence-to-sequence model trained on prepared datasets, which were made from parsed syntax-labelled corpora (for instance, UD-treebanks).<br />
<br />
The dataset for an encoder contains sequences of morphological tags, the dataset for a decoder contains sequences of labels, in both cases one sequence is a one sentence. UD-tags in datasets are replaced with suitable tags from Apertium tagset. Here is an example of this transformation:<br />
<br />
{| class = "wikitable" style = "background-color: white;"<br />
|-<br />
|'''UD'''<br />
|'''Apertium'''<br />
|-<br />
|NOUN<br />
|n<br />
|-<br />
|AUX<br />
|vaux<br />
|-<br />
|INTJ<br />
|ij<br />
|}<br />
<br />
For each language will be created its own model, moreover, models will be trained just on sequences of morphological tags, not, for example, on tokens + tags. It means that models should not be overweight, so they will not slow down the entire workflow.<br />
<br />
The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string. The result could look like this: <br />
<blockquote><br />
^vino<n><m><sg>$ => ^vino<n><m><sg>'''<nsubj>'''$<br />
</blockquote><br />
<br />
So, in the end of the work there will be:<br />
<ul><br />
<li>The labeller itself, which parses the string, restores a model for a needed language from a file, gives a sequence of tags to the model, gets a sequence of labels as an output and applies these labels to the original string</li><br />
<li>Files with trained models, which are saved in a suitable format (it could be, for example, JSON)</li><br />
</ul><br />
<br />
The task can be done with Tensorflow, but we may need a library, which is not so complex and has a simple runtime. The idea also can be realised with Keras (it has seq2seq add-on) and Theano as a backend, these libraries are not as massive as Tensorflow, which is usually being used for creating sequence-to-sequence models, so the labeller should work comparably fast. Moreover, Keras/Theano model can be run in regular hardware.<br />
<br />
'''Integration into Apertium pipeline:'''<br />
As it has been said on the page of Google Summer of Code ideas, the main task is only to create a tool, which could be adapted and used for some language pairs after Google Summer of Code.<br />
<br />
However, it seems that we are able to test this approach during the summer work. We may adapt the labeller for North Sámi - Norwegian Bokmål language pair, which already works with syntactic labelling, and then measure, how well it works. In the North Sámi - Norwegian Bokmål pipeline morphological disambiguation and syntax labelling were run as one CG module. Now they are split into two different parts (mor.rlx.bin and syn.rlx.bin), so we can try to replace the syntax labelling part with our machine-learned module and then test it.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
Adding the shallow function labeller in addition to approaches that are actually present (HMM part-of-speech tagging, constraint grammar, pattern-based syntactical transfer) should help to handle some existing problems in translating between languages, which are not closely related and belong to different types. <br />
<br />
A few examples of such problems:<br />
<ul><br />
<li>When you are working with an ergative language, it may be useful to know, if an absolutive is subject or object. Here is an example from Basque - English:<br />
<br />
{| class = "wikitable" style = "background-color: white;"<br />
|-<br />
|'''Basque'''<br />
|'''Current Apertium translation'''<br />
|'''English'''<br />
|-<br />
|Otsoa'''<abs><nsubj>''' etorri da<br />
|The wolf he has come<br />
|The wolf has come<br />
|-<br />
|Ehiztariak'''<erg><nsubj>''' otsoa'''<abs><obj>''' harrapatu du<br />
|The hunter the wolf he has caught<br />
|The hunter has caught the wolf // The wolf was caught by the hunter<br />
|-<br />
|}<br />
<br />
If we would know the information about syntactic function of word in absolutive case, we could change the word order in translated English sentence and get the better translation.<br />
<br />
</li> <br />
<li>There may be cases like classical Russian example "Мать любит дочь", which equally could mean "Mother loves daughter" or "Daughter loves mother". Machine translation systems always prefer the first variant, but due to the comparably free word order in Russian the meaning actually depends on syntactic functions of words.</li><br />
<li>Also there are a lot of cases when case translation is ambiguous and it could be really helpful for disambiguation to know the syntactic function of the word. Russian dative can be translated with English dative or with English nominative, but the choice depends on the syntactic function of the word. For example, "дай '''мне<dat><iobj>''' ручку" should be translated as "give '''me''' the pen", but "что '''мне<dat><obl>''' делать" should be translated as "what should '''I''' do". </li><br />
</ul><br />
<br />
So, it means that shallow function labelling is a good way to reach better quality of translation for "ergative - nominative", "synthetic - analytic" and "(comparably) free word order - strict word order" language pairs. In my opinion, the shallow syntactic function labeller trained on corpus data is more simple and effective way to label sentences than rule-based approach, because writing a good enough list of rules for determining a syntactic function of a word seems to be almost impossible even for a one language.<br />
<br />
Also I believe that the shallow function labelling stage can help to make the chunking stage of translation easier and more accurate.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, the shallow syntactic function labeller, as a part of Apertium system, will help to improve the quality of translation for many language pairs.<br />
<br />
Secondly, there are currently not too many projects about using machine learning methods for shallow syntactic function labelling, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<ul><br />
<li>Getting closer with Apertium and its tools, reading documentation</li><br />
<li>Setting up Linux and getting used to it</li><br />
<li>Learning more about machine learning, looking for more researches about sequence-to-sequence models</li><br />
<li>Learning more about UD/VISL treebanks and tagsets and North Sámi syntax-labelled corpus</li><br />
</ul><br />
<br />
=== Community bonding period ===<br />
<ul><br />
<li>Choosing language pairs, with which shallow function labeller will work. Currently I am thinking about Basque, English, Russian/Finnish, maybe Spanish, but it needs to be discussed. <br />
Also I will create a module for North Sámi → Norwegian Bokmål language pair, which already uses syntactic labelling, in order to evaluate the quality of the prototype.</li><br />
<li>Choosing the most suitable Python ML library</li><br />
<li>Thinking about how to integrate the classifier into North Sámi → Norwegian Bokmål pipeline<br />
<li>Learning more about possible problems, especially about discrepancies between all needed tagsets</li><br />
</ul><br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: preparing the data (includes a lot of thinking)====<br />
<p></p><br />
<li>'''Week 1:''' writing a script for parsing UD-treebanks</li><br />
<li>'''Week 2:''' writing a script for parsing North Sámi syntax-labelled corpus</li><br />
<li>'''Week 3:''' comparing UD and Apertium tagsets, writing a script for replacing UD tags with suitable Apertium tags, writing scripts for handling other possible discrepancies between all needed tagsets</li><br />
<li>'''Week 4:''' creating datasets (in a few possible variants), writing a script for parsing a string in Apertium stream format into a sequence of morphological tags</li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: building the classifier ====<br />
<p></p><br />
<li>'''Week 5:''' building the model</li><br />
<li>'''Week 6:''' training the classifier, evaluating the quality of the prototype</li><br />
<li>'''Week 7:''' further training, working on improvements of the model</li><br />
<li>'''Week 8:''' final testing, writing a script, which applies labels to the original string in Apertium stream format</li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: testing the labeller on North Sámi → Norwegian Bokmål language pair ====<br />
<p></p><br />
<li>'''Week 9:''' collecting all parts of the labeller together, adding machine-learned module instead of the syntax labelling part of CG module</li><br />
<li>'''Week 10:''' adding machine-learned module instead of the syntax labelling part of CG module</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' the prototype shallow syntactic function labeller, which is able to label sentences well enough and works with several languages.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But I will try to take as many exams as possible in advance, in May, so it may be changed.<br />
After that I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/wannabe_hackerman</p><br />
<br />
<ul><br />
<li>''flatten_conllu.py:'' A script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:</li><br />
<ul><br />
<li>Words with the @conj relation take the label of their head</li><br />
<li>Words with the @parataxis relation take the label of their head</li><br />
</ul><br />
<br />
<li>''calculate_accuracy_index.py:'' A script that does the following:</li><br />
<ul><br />
<li>Takes -train.conllu file and calculates the table: surface_form - label - frequency</li><br />
<li>Takes -dev.conllu file and for each token assigns the most frequent label from the table</li><br />
<li>Calculates the accuracy index</li><br />
</ul><br />
<br />
<li>''label_asf.py:'' A script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus.<br />
</li><br />
</ul><br />
<br />
<br />
[[Category:GSoC 2017 Student Proposals|Deltamachine]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal&diff=66442User:Deltamachine/proposal2018-03-23T09:56:07Z<p>Deltamachine: Blanked the page</p>
<hr />
<div></div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66441User:Deltamachine/proposal20182018-03-23T09:22:03Z<p>Deltamachine: /* Coding challenge */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done.<br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
In addition, I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py):<br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66440User:Deltamachine/proposal20182018-03-23T09:20:27Z<p>Deltamachine: /* Improving of existing methods */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have added cache function to learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done.<br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66439User:Deltamachine/proposal20182018-03-23T09:19:34Z<p>Deltamachine: /* Reasons why Google and Apertium should sponsor it */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes that can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66437User:Deltamachine/proposal20182018-03-23T09:16:52Z<p>Deltamachine: /* Work stages */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I'm a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. Also those are closely related languages so there will be less problems with alignment.<br />
<br />
But the methods I'm going to develop are not going to be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Searching for other ways of improving the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Searching for extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
But it obviously won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries. Also it might be a new module in a pipeline. The final decision will depend on the results of the previous stage.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes which can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66436User:Deltamachine/proposal20182018-03-23T09:06:58Z<p>Deltamachine: /* Why is it that you are interested in Apertium? */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in Google Summer of Code 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in Google Code-In 2017 as Apertium mentor and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important for saving such languages.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Search for other ways we can improve the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Search of extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
But it won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes which can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66435User:Deltamachine/proposal20182018-03-23T09:04:38Z<p>Deltamachine: /* Work period */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Search for other ways we can improve the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Search of extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
But it won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes which can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 11 - 15'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 9 - 13'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Final evaluation, August 6 - 14'''</li><br />
<li>'''Project completed:''' The toolbox for automatic improvement of lexical component of a language pair.</li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66419User:Deltamachine/proposal20182018-03-22T22:03:46Z<p>Deltamachine: /* Work plan */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Search for other ways we can improve the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Search of extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
But it won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes which can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Taking an online statistics course to refresh my knowledge<br />
* Working on the old code<br />
<br />
=== Community bonding period ===<br />
* Learning more about the structure of Apertium dictionaries and tools<br />
* Discussing questions about data types and language pairs to work with<br />
* Looking for a suitable data<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' collecting and parsing the data, doing preprocessing, if needed, improving the existing code</li><br />
<li>'''Week 2:''' improving the existing code, making experiments with data, extracting triplets (it can take a lot of time)</li><br />
<li>'''Week 3:''' making experiments with data, extracting triplets</li><br />
<li>'''Week 4:''' searching of extracted postediting operations that actually improve the quality of translation</li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 6:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 7:''' studying and classifying of successful postediting operations</li><br />
<li>'''Week 8:''' studying and classifying of successful postediting operations</li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 10:''' writing tools for inserting extracted information in a language pair</li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66417User:Deltamachine/proposal20182018-03-22T21:46:16Z<p>Deltamachine: /* Reasons why Google and Apertium should sponsor it */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Search for other ways we can improve the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Search of extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
But it won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes which can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a lexical selection rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66416User:Deltamachine/proposal20182018-03-22T21:44:40Z<p>Deltamachine: /* Reasons why Google and Apertium should sponsor it */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Search for other ways we can improve the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Search of extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
But it won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
This toolbox might become a great way of improving language pairs by filling gaps in dictionaries and reducing the amount of human work at the same time. Even the released Apertium pairs are not perfect and sometimes do mistakes which can be easily fixed.<br />
<br />
For example, Apertium translates Belarusian sentence ''"Нехта тут размаўляе па-руску?"'' ("Does somebody here speak Russian?") in Russian as ''"Кто-то здесь размаўляе по-русски?"'' when a correct translation would be ''"Кто-то здесь говорит по-русски?"''. The problem with this example is obvious: Apertium doesn't know the word "размаўлять". But it can be easily fixed with the methods described above (sections 5.2.3, 5.2.4).<br />
<br />
Another example: in Belarusian the word "кубак" behaves the same way as English "cup": it might appear in contexts like ''"Аня выпіла кубак малака"'' ("Anya has drunk a cup of milk") and in contexts like ''"Кубак Нямеччыны па футболе"'' ("German football cup"). Apertium translates ''"Аня выпіла кубак малака"'' in Russian as ''"Аня выпила кубок молока"'' and ''"Кубак Нямеччыны па футболе"'' as ''"Кубок Нямеччыны па футболе"'' (it has many mistakes, but we are looking only at "кубок" now).<br />
<br />
The second translation of the word "кубак" is correct (though the correct translation of the whole sentence should look like ''"Кубок Германии по футболу"''), but the first one looks strange: it should be ''"Аня выпила чашку/кружку молока"'' instead. In this case we could find sentences which are improved by applying a postediting operation o = ("кубак", "кубок", "чашку"), study the context (for example, look at the coinciding words in such sentences and find out that there is a word "выпiть" that appears right before "кубак" in every sentence) and then extract and write a rule like this:<br />
<br />
<pre><br />
<rule><br />
<match lemma="выпiть" tags="*"/><br />
<match lemma="кубак" tags="n.*"><br />
<select lemma="чашка" tags="n.*"/><br />
</match><br />
</rule><br />
</pre><br />
<br />
This is just a few examples. I believe that there are much more ways to use postediting information for improving a language pair.<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66413User:Deltamachine/proposal20182018-03-22T17:54:52Z<p>Deltamachine: /* Which of the published tasks are you interested in? What do you plan to do? */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There might be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think we might experiment with both approaches or even mix different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Search for other ways we can improve the speed.<br />
<br />
2. There might be problems with current alignment method because currently postediting operations look pretty strange even if they were extracted with high fuzzy match threshold. Alignment implementation should be checked carefully.<br />
<br />
3. Some old code refactoring needs to be done. <br />
<br />
==== Search of extracted postediting operations that improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
But it won't be enough for identifying potential lexical selection rules: we should carefully look at the context to find those.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66412User:Deltamachine/proposal20182018-03-22T17:46:46Z<p>Deltamachine: /* Which of the published tasks are you interested in? What do you plan to do? */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
The purpose of this proposal is to create a toolbox for automatic improvement of lexical component of a language pair.<br />
<br />
=== Definitions ===<br />
* S: source sentence<br />
* MT: machine translation system (Apertium in our case)<br />
* MT(S): machine translation of S<br />
* PE(MT(S)): post-editing of the machine translation of S<br />
* O(s, mt, pe): set of extracted postediting operations where s in S, mt in MT(S), pe in PE(MT(S)) <br />
<br />
=== Work stages ===<br />
<br />
==== Choosing a language pair(s) to experiment with and collecting/processing data. ====<br />
<br />
<u>About language pair</u><br />
<br />
I would like to work with languages I know or at least can more or less understand. Since I“m a native Russian speaker, it seems to be a good idea to work with bel-rus and ukr-rus. Though these pairs are already released, there are still a lot of work to be done. But the methods I“m going to develop won“t be tied to a language pair.<br />
<br />
<u>Abour collecting and processing data</u><br />
<br />
There can be two approaches.<br />
<br />
* Using Mediawiki JSON postediting data from https://dumps.wikimedia.org/other/contenttranslation/. <p>'''+''' the target (postedited) side is very close to the given machine translation because it is basically based on it.</p> <p>'''-''' Mediawiki articles often contain very long and very specific sentences and these factors can affect the quality of extracted triplets, because the current method of extracting them is built on edit distance algorithm. </p> <p>'''-''' also there will surely be words and sentences in other languages (like translations, links and so on, typical Wikipedia content) and it will noise our data.</p><br />
<br />
* Using parallel corpora and Apertium translation of the source side. <p>'''+''' in parallel corpora, especially in those, which are not Europarl, sentences are usually not very long and complicated and contain pretty common words and phrases (Tatoeba might be a good example). This is not a rule, but, in my opinion, parallel corpora are still less specific than Mediawiki articles.</p> <p>'''+''' parallel corpora are more likely to contain less noise.</p> <p>'''-''' the target side might be very different from Apertium-translated one, especially if we talk about long and complicated sentences.</p><br />
<br />
The question of choosing an approach is pretty discussable. I think that we might experiment with both approaches or even mix the different types of data and see if there any difference.<br />
<br />
==== Improving of existing methods ====<br />
<br />
Some tools for learning and applying postediting operations were already created (https://github.com/mlforcada/naive-automatic-postediting). However, they might need to be improved in some ways.<br />
<br />
1. The main problem is a pretty low speed of learn_postedits.py and apply_postedits.py. Repeating Apertium calls and big size of maximum source/target segment length make the process pretty slow. It makes extracting triplets from a big corpus very hard and using a small corpus is just useless. <br />
<br />
I have implemented cache function for learn_postedits.py (https://github.com/deltamachine/naive-automatic-postediting/blob/master/lib/explain2_cache.py): now instead of calling Apertium every time the program needs to translate/analyze any subsegment, it firstly checks, is this subsegment already stored in a database. If yes, it takes it from here, if no, it calls Apertium and then adds new information to the database.<br />
<br />
Results of running two versions of learn_postedits.py on 100 sentences corpus with different parameters:<br />
<br />
{| class = "wikitable" style = "background-color: white; text-align: center; width: 50%;"<br />
|-<br />
|<br />
|'''with caching'''<br />
|'''without caching'''<br />
|-<br />
|'''-m 1 -M 1'''<br />
|2m55s<br />
|3m41s<br />
|-<br />
|'''-m 2 -M 2'''<br />
|6m50s<br />
|7m08s<br />
|-<br />
|'''-m 3 -M 3'''<br />
|8m25s<br />
|10m25s<br />
|-<br />
|'''-m 4 -M 4'''<br />
|11m45s<br />
|13m18s<br />
|-<br />
|}<br />
<br />
It is clearly seen that caching really saves time. For 100 sentences the difference is not very huge, because there is not much information stored in database yet, but it seems that the difference would be very important if we run the code on 50000 sentences.<br />
<br />
<u>What else needs to be done:</u><br />
<br />
* Caching function for apply_postedits.py<br />
<br />
* Search for other ways we can improve the speed.<br />
<br />
2. Some code refactoring needs to be done. <br />
<br />
==== Search of extracted postediting operations which improve the quality of translation ====<br />
<br />
The next step is to process a big train set using different parameters, extract potential postediting operations, apply them to the test set and find those that improve the quality of original Apertium translation on a regular basis.<br />
<br />
A language model score might be a criteria of a quality improvement. A safety of a postediting operation might be determined by statistical methods.<br />
<br />
==== Classifying of successful postediting operations ====<br />
<br />
After finding safe postediting operations we should classify them in some way to find out how we might insert this information in a language pair.<br />
<br />
There might be few types:<br />
<br />
* Monodix/bidix entries<br />
<br />
* Lexical selection rules<br />
<br />
* Transfer rules (?)<br />
<br />
* and so on<br />
<br />
For example, to identify potential bidix entries, we might choose a set of triplets from O such that s = mt for all (s, pe) from O.<br />
<br />
==== Creating tools for inserting useful information into a language pair ====<br />
<br />
The last step is to create tools for inserting useful information into a language pair. It might be some scripts which automatically create monodix/bidix entries or write rules based on a given data and its type and insert it in dictionaries.<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66360User:Deltamachine/proposal20182018-03-20T15:02:24Z<p>Deltamachine: /* Coding challenge */</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66356User:Deltamachine/proposal20182018-03-20T12:41:36Z<p>Deltamachine: </p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==<br />
<p>https://github.com/deltamachine/naive-automatic-postediting</p><br />
<br />
<ul><br />
<li>''parse_ct_json.py:'' A script that parses Mediawiki JSON file and splits the whole corpus on train and test sets of a given size.</li><br />
<li>''estimate_changes.py:'' A script that takes a file generated by ''apply_postedits.py'' and scores sentences which were processed with postediting rules on a language model.</li><br />
</ul><br />
<br />
Also I have refactored and documented the old code from https://github.com/mlforcada/naive-automatic-postediting/blob/master/learn_postedits.py. A new version is stored in a given folder in ''cleaned_learn_postedits.py''<br />
<br />
''cleaned_learn_postedits.py'' was runned on English - Spanish train set of 500 sentences. List of learned potential postediting operations is stored in ''postediting_operations.txt''. Then I applied these operations to the test set of 100 sentences. Results are stored in ''pe_sentences.txt''. After that I scored these results on a language model using ''estimate_changes.py'', scores are stored in ''scores.txt''.</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66275User:Deltamachine/proposal20182018-03-14T08:22:57Z<p>Deltamachine: </p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
I'm a computational linguist and I'm in love with NLP and every field close to it. My two most favourite fields of studies are linguistics and programming and machine translation combines these fields in the most interesting way. Working with machine translation system will allow me to learn more new things about different languages, their structures and different modern approaches in machine translation and to know what results we can get with the help of such systems. This is very exciting!<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. My programming and NLP skills became much better and I want to develop them more.<br />
Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in further contributing to Apertium.<br />
<br />
This organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. I love that Apertium works with minority languages because it is very important and is not a mainstream at the same time.<br />
<br />
Also Apertium community is very friendly and open to new members, people here are always ready to help you. It encourages me to work with these people.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
== A description of how and who it will benefit in society ==<br />
Firstly, methods which will be developed during this project will help to improve the quality of translation for many language pairs and reduce the amount of human work.<br />
<br />
Secondly, there are currently very few papers about using postedits to improving a RBMT system, so my work will contribute to learning more abour this approach.<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine/proposal2018&diff=66221User:Deltamachine/proposal20182018-03-11T07:31:12Z<p>Deltamachine: Created page with "== Contact information == <p>'''Name:''' Anna Kondrateva</p> <p>'''Location:''' Moscow, Russia</p> <p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p> <p>'''Phone number:''' +792..."</p>
<hr />
<div>== Contact information ==<br />
<p>'''Name:''' Anna Kondrateva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''SourceForge:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== Skills and experience ==<br />
<p>I am a third-year bachelor student of Linguistics Faculty in National Research University «Higher School of Economics» (NRU HSE)</p><br />
<p>'''Main university courses:'''</p><br />
<ul><br />
<li>Programming (Python, R)</li><br />
<li>Computer Tools for Linguistic Research</li><br />
<li>Theory of Language (Phonetics, Morphology, Syntax, Semantics)</li><br />
<li>Language Diversity and Typology</li><br />
<li>Machine Learning</li><br />
<li>Math (Discrete Math, Linear Algebra and Calculus, Probability Theory and Mathematical Statistics)</li><br />
<li>Theory of Algorithms</li><br />
<li>Databases</li><br />
</ul><br />
<p>'''Technical skills:'''</p><br />
<ul><br />
<li>Programming languages: Python, R, Javascript</li><br />
<li>Web design: HTML, CSS </li><br />
<li>Frameworks: Flask, Django</li><br />
<li>Databases: SQLite, PostgreSQL, MySQL</li><br />
</ul><br />
<p>'''Projects and experience:''' http://github.com/deltamachine</p><br />
<p>'''Languages:''' Russian (native), English, German</p><br />
<br />
== Why is it you are interested in machine translation? ==<br />
<br />
== Why is it that you are interested in Apertium? ==<br />
I have participated in GSoC 2017 with Apertium and it was a great experience. I have successfully finished my project, learned a lot of new things and had a lot of fun. Also I have participated in GCI 2017 as a mentor for Apertium and it was great too. So I am very interested in contributing to Apertium.<br />
<br />
Apertium community is very friendly and open to new members, people here are always ready to help you. Also this organisation works on things which are very interesting for me as a computational linguist: (rule-based) machine translation, minority languages, NLP and so on. <br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I would like to work on [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/automatic-postediting improving language pairs by mining MediaWiki Content Translation postedits].<br />
<br />
== Reasons why Google and Apertium should sponsor it ==<br />
<br />
== A description of how and who it will benefit in society ==<br />
<br />
== Work plan ==<br />
<br />
=== Post application period ===<br />
<br />
=== Community bonding period ===<br />
<br />
=== Work period ===<br />
<ul><br />
==== Part 1, weeks 1-4: ====<br />
<p></p><br />
<li>'''Week 1:''' </li><br />
<li>'''Week 2:''' </li><br />
<li>'''Week 3:''' </li><br />
<li>'''Week 4:''' </li><br />
<li>'''Deliverable #1, June 26 - 30'''</li><br />
<p></p><br />
<br />
==== Part 2, weeks 5-8: ====<br />
<p></p><br />
<li>'''Week 5:''' </li><br />
<li>'''Week 6:''' </li><br />
<li>'''Week 7:''' </li><br />
<li>'''Week 8:''' </li><br />
<li>'''Deliverable #2, July 24 - 28'''</li><br />
<p></p><br />
<br />
==== Part 3, weeks 9-12: ====<br />
<p></p><br />
<li>'''Week 9:''' </li><br />
<li>'''Week 10:''' </li><br />
<li>'''Week 11:''' testing, fixing bugs</li><br />
<li>'''Week 12:''' cleaning up the code, writing documentation</li><br />
<li>'''Project completed:''' </li><br />
</ul><br />
<br />
Also I am going to write short notes about work process on the page of my project during the whole summer.<br />
<br />
== Non-Summer-of-Code plans you have for the Summer ==<br />
I have exams at the university until the third week of June, so I will be able to work only 20-25 hours per week. But since I am already familiar with Apertium system I can work on the project during the community bonding period. After exams I will be able to work full time and spend 45-50 hours per week on the task.<br />
<br />
== Coding challenge ==</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine&diff=66220User:Deltamachine2018-03-11T07:17:55Z<p>Deltamachine: /* GSoC */</p>
<hr />
<div><br />
== Contact info ==<br />
<br />
<p>'''Name:''' Anna Kondratjeva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''VK:''' http://vk.com/anya_archer</p><br />
<p>'''Github:''' http://github.com/deltamachine</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== GSoC ==<br />
<br />
[http://wiki.apertium.org/wiki/User:Deltamachine/proposal My proposal to Google Summer of Code 2017]<br />
<br />
[http://wiki.apertium.org/wiki/User:Deltamachine/proposal2018 My proposal to Google Summer of Code 2018]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=User:Deltamachine&diff=66219User:Deltamachine2018-03-11T07:17:41Z<p>Deltamachine: </p>
<hr />
<div><br />
== Contact info ==<br />
<br />
<p>'''Name:''' Anna Kondratjeva</p><br />
<p>'''Location:''' Moscow, Russia</p><br />
<p>'''E-mail:''' an-an-kondratjeva@yandex.ru</p><br />
<p>'''Phone number:''' +79250374221</p><br />
<p>'''VK:''' http://vk.com/anya_archer</p><br />
<p>'''Github:''' http://github.com/deltamachine</p><br />
<p>'''IRC:''' deltamachine</p><br />
<p>'''Timezone:''' UTC+3</p><br />
<br />
== GSoC ==<br />
<br />
[http://wiki.apertium.org/wiki/User:Deltamachine/proposal My proposal to Google Summer of Code 2017]<br />
[http://wiki.apertium.org/wiki/User:Deltamachine/proposal2018 My proposal to Google Summer of Code 2018]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Task_ideas_for_Google_Code-in&diff=64734Task ideas for Google Code-in2017-11-15T10:50:51Z<p>Deltamachine: </p>
<hr />
<div>{{TOCD}}<br />
This is the task ideas page for [https://developers.google.com/open-source/gci/ Google Code-in], here you can find ideas on interesting tasks that will improve your knowledge of Apertium and help you get into the world of open-source development.<br />
<br />
The people column lists people who you should get in contact with to request further information. All tasks are 2 hours maximum estimated amount of time that would be spent on the task by an experienced developer, however:<br />
<br />
# '''this does not include time taken to [[Minimal installation from SVN|install]] / set up apertium (and relevant tools)'''.<br />
# this is the time expected to take by an experienced developer, you may find that you spend more time on the task because of the learning curve. <br />
<br />
<!--Если ты не понимаешь английский язык или предпочитаешь работать над русским языком или другими языками России, смотри: [[Task ideas for Google Code-in/Russian]]--><br />
'''Categories:'''<br />
<br />
* {{sc|code}}: Tasks related to writing or refactoring code<br />
* {{sc|documentation}}: Tasks related to creating/editing documents and helping others learn more<br />
* {{sc|research}}: Tasks related to community management, outreach/marketting, or studying problems and recommending solutions<br />
* {{sc|quality}}: Tasks related to testing and ensuring code is of high quality.<br />
* {{sc|interface}}: Tasks related to user experience research or user interface design and interaction<br />
<br />
'''Clarification of "multiple task" types'''<br />
* multi = number of students who can do a given task<br />
* dup = number of times a student can do the same task<br />
<br />
You can find descriptions of some of the mentors [[List_of_Apertium_mentors | here]].<br />
<br />
==Task ideas==<br />
<table class="sortable wikitable" style="display: none"><br />
<!-- THE TASKS NEED TO BE HIDDEN FOR NOW,<br />
but feel free to remove style="display: none" to preview changes to this page.<br />
Just remember to put it back before saving<br />
JNW 2017-10-30<br />
--><br />
<tr><th>type</th><th>title</th><th>description</th><th>tags</th><th>mentors</th><th>bgnr?</th><th>multi?</th><th>duplicates</th></tr><br />
{{Taskidea<br />
|type=research<br />
|title=Document resources for a language<br />
|description=Document resources for a language without resources already documented on the Apertium wiki. [[Task ideas for Google Code-in/Documentation of resources|read more...]]<br />
|tags=wiki, languages<br />
|mentors=Jonathan, Vin, Xavivars, Marc Riera<br />
|multi=40<br />
|beginner=yes<br />
}}{{Taskidea<br />
|type=research<br />
|title=Write a contrastive grammar<br />
|description=Document 6 differences between two (preferably related) languages and where they would need to be addressed in the [[Apertium pipeline]] (morph analysis, transfer, etc). Use a grammar book/resource for inspiration. Each difference should have no fewer than 3 examples. Put your work on the Apertium wiki under [[Language1_and_Language2/Contrastive_grammar]]. See [[Farsi_and_English/Pending_tests]] for an example of a contrastive grammar that a previous GCI student made.<br />
|mentors=Vin, Jonathan, Fran, mlforcada<br />
|tags=wiki, languages<br />
|beginner=yes<br />
|multi=40<br />
}}<br />
{{Taskidea|type=interface|mentors=Fran, Masha, Jonathan<br />
|tags=annotation, annotatrix<br />
|title=Nicely laid out interface for ud-annotatrix <br />
|description=Design an HTML layout for the annotatrix tool that makes best use of the space and functions nicely<br />
at different screen resolutions.<br />
}}<br />
{{Taskidea|type=interface|mentors=Fran, Masha, Jonathan, Vin<br />
|tags=annotation, annotatrix, css<br />
|title=Come up with a CSS style for annotatrix<br />
|description=<br />
}}<br />
{{Taskidea|type=code|mentors=Fran, Masha, Jonathan, Vin<br />
|tags=annotation, annotatrix, javascript, dependencies<br />
|title=SDparse to CoNLL-U converter in JavaScript<br />
|description=SDparse is a format for describing dependency trees, they look like relation(head, dependency). CoNLL-U is another<br />
format for describing dependency trees. Make a converter between the two formats. You will probably need to learn more about the specifics of these formats. The GitHub issue is [https://github.com/jonorthwash/ud-annotatrix/issues/88 here].<br />
}}<br />
{{Taskidea|type=quality|mentors=Fran, Masha, Vin<br />
|tags=annotation, annotatrix<br />
|title=Write a test for the format converters in annotatrix<br />
|description=<br />
|multi=yes<br />
}}<br />
{{Taskidea|type=code|mentors=Fran, Masha, Jonathan<br />
|tags=annotation, annotatrix, javascript<br />
|title=Write a function to detect invalid trees in the UD annotatrix software and advise the user about it<br />
|description=It is possible to detect invalid trees (such as those that have cycles). We would like to write a function to detect those kinds of trees and advise the user. The GitHub issue is [https://github.com/jonorthwash/ud-annotatrix/issues/96 here].<br />
}}<br />
{{Taskidea|type=documentation|mentors=Fran, Masha, Jonathan, Vin<br />
|tags=annotation, annotatrix, dependencies<br />
|title=Write a tutorial on how to use annotatrix to annotate a dependency tree<br />
|description=Give step by step instructions to annotating a dependency tree with Annotatrix. Make sure you include all possibilities in the app, for example tokenisation options.<br />
}}<br />
{{Taskidea|type=documentation|mentors=Fran, Masha, Vin<br />
|tags=annotation, annotatrix, video, dependencies<br />
|title=Make a video tutorial on annotating a dependency tree using the [https://github.com/jonorthwash/ud-annotatrix/ UD annotatrix software].<br />
|description=Give step by step instructions to annotating a dependency tree with Annotatrix. Make sure you include all possibilities available in the app, for example tokenisation options.<br />
}}<br />
{{Taskidea|type=quality|mentors=Masha|tags=xml, dictionaries, svn<br />
|title=Merge two versions of the Polish morphological dictionary<br />
|description=At some point in the past, someone deleted a lot of entries from the Polish morphological dictionary, and unfortunately we didn't notice at the time and have since added stuff to it. The objective of this task is to take the last<br />
version before the mass deletion and the current version and merge them.<br />
Getting list of the changes: <br />
$ svn diff --old apertium-pol.pol.dix@73196 --new apertium-pol.pol.dix@73199 > changes.diff<br />
}}<br />
{{Taskidea|type=quality|mentors=fotonzade, Jonathan, Xavivars, Marc Riera, mlforcada<br />
|tags=xml, dictionaries, svn<br />
|title=Add 200 new entries to a bidix to language pair %AAA%-%BBB%<br />
|description=Our translation systems require large lexicons so as to provide production-quality coverage of any input data. This task requires the student to add 200 new words to a bidirectional dictionary.<br />
|multi=yes<br />
|bgnr=yes<br />
}}<br />
{{Taskidea|type=quality|mentors=fotonzade, Jonathan, Xavivars, Marc Riera, mlforcada<br />
|tags=xml, dictionaries, svn<br />
|title=Add 500 new entries to a bidix to language pair %AAA%-%BBB%<br />
|description=Our translation systems require large lexicons so as to provide production-quality coverage of any input data. This task requires the student to add 500 new words to a bidirectional dictionary.<br />
|multi=yes<br />
}}<br />
{{Taskidea|type=quality|mentors=fotonzade, Xavivars, Marc Riera, mlforcada<br />
|tags=disambiguation, svn<br />
|title=Disambiguate 500 tokens of text in %AAA%<br />
|description=Run some text through a morphological analyser and disambiguate the output. Contact the mentor beforehand to approve the choice of language and text.<br />
|multi=yes<br />
}}<br />
{{Taskidea|type=code|mentors=Fran, Katya|tags=morphology, languages, finite-state, fst<br />
|title=Use apertium-init to start a new morphological analyser for %AAA%<br />
|description=Use apertium-init to start a new morphological analyser (for a language we don't already <br />
have, e.g. %AAA%) and add 100 words.<br />
|multi=yes<br />
}}<br />
{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Flammie<br />
|title=add comments to .dix file symbol definitions<br />
|tags=dix<br />
}}<br />
{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan<br />
|title=find symbols that aren't on the list of symbols page<br />
|description=Go through symbol definitions in Apertium dictionaries in svn (.lexc and .dix format), and document any symbols you don't find on the [[List of symbols]] page. This task is fulfilled by adding at least one class of related symbols (e.g., xyz_*) or one major symbol (e.g., abc), along with notes about what it means.<br />
|tags=wiki,lexc,dix<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=conllu parser and searching<br />
|description=Write a script (preferably in python3) that will parse files in conllu format, and perform basic searches, such as "find a node that has an nsubj relation to another node that has a noun POS" or "find all nodes with a cop label and a past feature"<br />
|tags=python, dependencies<br />
|mentors=Jonathan, Fran, Wei En, Anna<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=group and count possible lemmas output by guesser<br />
|mentors=Jonathan, Fran, Wei En<br />
|description=Currently a "guesser" version of Apertium transducers can output a list of possible analyses for unknown forms. Develop a new pipleine, preferably with shell scripts or python, that uses a guesser on all unknown forms in a corpus, and takes the list of all possible analyses, and output a hit count of the most common combinations of lemma and POS tag.<br />
|tags=guesser, transducers, shellscripts<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=vim mode/tools for annotating dependency corpora in CG3 format<br />
|mentors=Jonathan, Fran<br />
|description=includes formatting, syntax highlighting, navigation, adding/removing nodes, updating node numbers, etc.<br />
|tags=vim, dependencies, CG3<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=vim mode/tools for annotating dependency corpora in CoNLL-U format<br />
|mentors=Jonathan, Fran<br />
|description=includes formatting, syntax highlighting, navigation, adding/removing nodes, updating node numbers, etc.<br />
|tags=vim, dependencies, conllu<br />
}}{{Taskidea<br />
|type=quality<br />
|title=figure out one-to-many bug in the [[lsx module]]<br />
|mentors=Jonathan, Fran, Wei En, Irene<br />
|description=There is a bug in the [[lsx module]] referred to as the [http://wiki.apertium.org/wiki/Lsx_module#The_one-to-many_bug one-to-many bug] because lsx-proc will not convert one form to many given an appropriately compiled transducer. Your job is to figure out why this happens and fix it.<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=code<br />
|title=add an option for reverse compiling to the [[lsx module]]<br />
|mentors=Jonathan, Fran, Wei En, Irene, Xavivars<br />
|description=this should be simple as it can just leverage the existing lttoolbox options for left-right / right-left compiling<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=clean up lsx-comp<br />
|mentors=Jonathan, Fran, Wei En, Irene, Xavivars<br />
|description=remove extraneous functions from lsx-comp and clean up the code<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=clean up lsx-proc<br />
|mentors=Jonathan, Fran, Wei En, Irene, Xavivars<br />
|description=remove extraneous functions from lsx-proc and clean up the code<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=documentation<br />
|title=document usage of the lsx module<br />
|mentors= Irene<br />
|description= document which language pairs have included the lsx module in its package, which have beta-tested the lsx module, and which are good candidates for including support for lsx. add to [[Lsx_module/supported_languages | this wiki page]]<br />
|tags=C++, transducers, lsx<br />
|beginner=yes<br />
}}{{Taskidea<br />
|type=quality<br />
|title=beta testing the lsx-module<br />
|mentors=Jonathan, Fran, Wei En, Irene<br />
|description= [[Lsx_module#Creating_the_lsx-dictionary|create an lsx dictionary]]for any relevant and existing language pair that doesn't yet support it, adding 10-30 entries to it. Thoroughly test to make sure the output is as expected. report bugs/non-supported features and add them to [[Lsx_module#Future_work| future work]]. Document your tested language pair by listing it under [[Lsx_module#Beta_testing]] and in [[Lsx_module/supported_languages | this wiki page]]<br />
|tags=C++, transducers, lsx<br />
|multi=yes<br />
|dup=yes<br />
}}{{Taskidea<br />
|type=code<br />
|title=fix an lsx bug / add an lsx feature<br />
|mentors=Jonathan, Fran, Wei En, Irene<br />
|description= if you've done the above task (beta testing the lsx-module) and discovered any bugs or unsupported features, fix them.<br />
|tags=C++, transducers, lsx<br />
|multi=yes<br />
|dup=yes<br />
}}{{Taskidea<br />
|type=code<br />
|title=script to test coverage over wikipedia corpus<br />
|mentors=Jonathan, Wei En, Shardul<br />
|description=Write a script (in python or ruby) that in one mode checks out a specified language module to a given directory, compiles it (or updates it if already existant), and then gets the most recently nightly wikipedia archive for that language and runs coverage over it (as much in RAM if possible). In another mode, it compiles the language pair in a docker instance that it then disposes of after successfully running coverage. Scripts exist in Apertium already for finding where a wikipedia is, extracting a wikipedia archive into a text file, and running coverage.<br />
|tags=python, ruby, wikipedia<br />
}}{{Taskidea<br />
|type=quality,code<br />
|tag=issues<br />
|title=fix any open ticket<br />
|description=Fix any open ticket in any of our issues trackers: [https://sourceforge.net/p/apertium/tickets/ main], [https://github.com/goavki/apertium-html-tools/issues html-tools], [https://github.com/goavki/phenny/issues begiak]. When you claim this task, let your mentor know which issue you plan to work on.<br />
|mentors=Jonathan, Wei En, Sushain, Shardul<br />
|multi=25<br />
|dup=10<br />
}}<br />
{{Taskidea<br />
|type=quality,code<br />
|title=make html-tools do better on Chrome's audit<br />
|tags=javascript, html, css, web<br />
|description=Currently, apertium.org and generally any [https://github.com/goavki/apertium-html-tools html-tools] installation fails lots of Chrome audit tests. As many as possible should be fixed. Ones that require substantial work should be filed as tickets and measures should be taken to prevent problems from reappearing (e.g. a test or linter rule). More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/201 #201]) and asynchronous discussion should occur there.<br />
|mentors=Jonathan, Sushain, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=upgrade html-tools to Bootstrap 4<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] uses Bootstrap 3.x. Bootstrap 4 beta is out and we can upgrade (hopefully)! If an upgrade is not possible, you should document why it's not and ensure that it's easy to upgrade when the blockers are removed. More information may be available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/200 #200]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=display API endpoint on sandbox<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] has an "APy" mode where users can easily test out the API. However, it doesn't display the actual URL of the API endpoint and it would be nice to show that to the user. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/147 #147]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,quality,research<br />
|title=set up a testing framework for html-tools<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] has no tests (sad!). This task requires researching what solutions there are for testing jQuery based web applications and putting one into place with a couple tests as a proof of concept. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/116 #116]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,research<br />
|title=make html-tools automatically download translated files in Safari, IE, etc.<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] is capable of translating files. However, this translation does not always result in the file immediately being download to the user on all browsers. It would be awesome if it did! This task requires researching what solutions there are, evaluating them against each other and it may result in a conclusion that it just isn't possible (yet). More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/97 #97]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Unhammer, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=make html-tools fail more gracefully when API is down<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] relies on an API endpoint to translate documents, files, etc. However, when this API is down the interface also breaks! This task requires fixing this breakage. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/207 #207]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=make html-tools properly align text in mixed RTL/LTR contexts<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] is capable of displaying results/allowing input for RTL languages in a LTR context (e.g. we're translating Arabic in an English website). However, this doesn't always look exactly how it should look, i.e. things are not aligned correctly. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/49 #49]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=de-conflict the 'make a suggestion' interface in html-tools<br />
|tags=javascript, html, css, web<br />
|description=There has been much demand for [https://github.com/goavki/apertium-html-tools html-tools] to support an interface for users making suggestions regarding e.g. incorrect translations (c.f. Google translate). An interface was designed for this purpose. However, since it has been a while since anyone touched it, the code now conflicts with the current master branch. This task requires de-conflicting this [https://github.com/goavki/apertium-html-tools/pull/74 branch] with master and providing screenshot/video(s) of the interface to show that it functions. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/74 #74]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,quality<br />
|title=make html-tools capable of translating itself<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] supports website translation. However, if asked to translate itself, weird things happen and the interface does not properly load. This task requires figuring out the root problem and correcting the fault. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/203 #203]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=interface<br />
|title=create mock-ups for variant support in html-tools<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] supports translation using language variants. However, we do not have first-class style/interface support for it. This task requires speaking with mentors/reading existing discussion to understand the problem and then produce design mockups for a solution. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/82 #82]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Fran, Shardul, Xavivars<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=refine the html-tools dictionary interface<br />
|tags=javascript, html, css, web<br />
|description=Significant progress has been made towards providing a dictionary-style interface within [https://github.com/goavki/apertium-html-tools html-tools]. This task requires refining the existing [https://github.com/goavki/apertium-html-tools/pull/184 PR] by de-conflicting it with master and resolving the interface concerns discussed [https://github.com/goavki/apertium-html-tools/pull/184#issuecomment-323597780 here]. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/105 #105]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Xavivars<br />
}}<br />
{{Taskidea<br />
|type=code,quality,interface<br />
|title=eliminate inline styles from html-tools<br />
|tags=html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] has inline styles. These are not very maintainable and widely considered as bad style. This task requires surveying the uses, removing all of them in a clean manner, i.e. semantically, and re-enabling the linter rule that will prevent them going forward. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/114 #114]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Shardul, Xavivars<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=refine the html-tools spell checking interface<br />
|tags=html, css, web<br />
|description=Spell checking is a feature that would greatly benefit [https://github.com/goavki/apertium-html-tools html-tools]. Significant effort has been put towards implementing an effective interface to provide spelling suggestions to users (this [https://github.com/goavki/apertium-html-tools/pull/176 PR] contains the current progress). This task requires solving the problems highlighted in the code review on the PR and fixing any other bugs uncovered in conversations with the mentors. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/12 #12]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan<br />
}}<br />
{{Taskidea<br />
|type=quality<br />
|title=find an apertium module not developed in svn and import it<br />
|description=Find an Apertium module developed elsewhere (e.g., github) released under a compatible open license, and import it into [http://wiki.apertium.org/wiki/SVN Apertium's svn], being sure to attribute any authors (in an AUTHORS file) and keeping the original license. Once place to look for such modules might be among the [https://wikis.swarthmore.edu/ling073/Category:Sp17_FinalProjects final projects] in a recent Computational Linguistics course.<br />
|mentors=Jonathan, Wei En<br />
|multi=10<br />
|dup=2<br />
}}{{Taskidea<br />
|type=code<br />
|title=add an incubator mode to the wikipedia scraper<br />
|tags=wikipedia, python<br />
|description=Add a mode to scrape a Wikipedia in incubator (e.g,. [https://incubator.wikimedia.org/wiki/Wp/inh/Main_page the Ingush incubator]) to the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py WikiExtractor] script<br />
|mentors=Jonathan, Wei En<br />
}}{{Taskidea<br />
|type=code,interface<br />
|title=add a translation mode interface to the geriaoueg plugin for firefox<br />
|description=Fork the [https://github.com/vigneshv59/geriaoueg-firefox geriaoueg firefox plugin] and add an interface for translation mode. It doesn't have to translate at this point, but it should communicate with the server (as it currently does) to load available languages.<br />
|tags=javascript<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=code, interface<br />
|title=add a translation mode interface to the geriaoueg plugin for chrome<br />
|description=Fork the [https://github.com/vigneshv59/geriaoueg-chrome geriaoueg chrome plugin] and add an interface for translation mode. It doesn't have to translate at this point, but it should communicate with the server (as it currently does) to load available languages.<br />
|tags=javascript<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality<br />
|title=update bidix included in apertium-init<br />
|description=There are some issues with the bidix currently included in [https://github.com/goavki/bootstrap/ apertium-init]: the alphabet should be empty (or non-existant?) and the "sg" tags shouldn't be in the example entries. It would also be good to have entries in two different languages, especially ones with incompatible POS sub-categories (e.g. casa{{tag|n}}{{tag|f}}). There is [https://github.com/goavki/bootstrap/issues/24 a github issue for this task].<br />
|tags=python, xml, dix<br />
|beginner=yes<br />
|mentors=Jonathan, Sushain<br />
}}{{Taskidea<br />
|type=code<br />
|title=apertium-init support for more features in hfst modules<br />
|description=Add optional support to hfst modules for enabling spelling modules, an extra twoc module for morphotactic constraints, and spellrelax. You'll want to figure out how to integrate this into the Makefile template. There is [https://github.com/goavki/bootstrap/issues/23 a github issue for this task].<br />
|tags=python, xml, Makefile<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=code, quality<br />
|title=make apertium-init README files show only relevant dictionary file<br />
|description=Currently in [https://github.com/goavki/bootstrap/ apertium-init], the README files for HFST modules show the "dix" file in the list of files, and it's likely that lttoolbox modules show "hfst" files in their README too. Check this and make it so that READMEs for these two types of monolingual modules display only the right dictionary files. There is [https://github.com/goavki/bootstrap/issues/26 a github issue for this task].<br />
|tags=python, xml, Makefile<br />
|mentors=Jonathan, Sushain<br />
}}{{Taskidea<br />
|type=code, quality<br />
|title=Write a script to add glosses to a monolingual dictionary from a bilingual dictionary<br />
|description=Write a script that matches bilingual dictionary entries (in dix format) to monolingual dictionary entries in one of the languages (in [[Apertium-specific conventions for lexc|lexc]] format) and adds glosses from the other side of the bilingual dictionary if not already there. The script should combine glosses into one when there's more than one in the bilingual dictionary. Some level of user control might be justified, from simply defaulting to a dry run unless otherwise specified, to controls for adding to versus replacing versus leaving alone existing glosses, and the like. A [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/inject-words-from-bidix-to-lexc.py prototype of this script] is available in SVN, though it's buggy and doesn't fully work—so this task may just end up being to debug it and make it work as intended. A good test case might be the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-eng-kaz/apertium-eng-kaz.eng-kaz.dix English-Kazakh bilingual dictionary] and the [http://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz/apertium-kaz.kaz.lexc Kazakh monolingual dictionary].<br />
|tags=python, lexc, dix, xml<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=code<br />
|title=Write a script to deduplicate and/or sort individual lexc lexica.<br />
|description=The lexc format is a way to specify a monolingual dictionary that gets compiled into a transducer: see [[Apertium-specific conventions for lexc]] and [[Lttoolbox and lexc#lexc]]. A single lexc file may contain quite a few individual lexicons of stems, e.g. for nouns, verbs, prepositions, etc. Write a script (in python or ruby) that reads a specified lexicon, and based on which option the user specifies, identifies and removes duplicates from the lexicon, and/or sorts the entries in the lexicon. Be sure to make a dry-run (i.e., do not actually make the changes) the default, and add different levels debugging (such as displaying a number of duplicates versus printing each duplicate). Also consider allowing for different criteria for matching duplicates: e.g., whether or not the comment matches too. There are two scripts that parse lexc files already that would be a good point to start from: [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/lexccounter.py lexccounter.py] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/inject-words-from-bidix-to-lexc.py inject-words-from-bidix-to-lexc.py] (not fully functional).<br />
|tags=python, ruby, lexc<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, interface<br />
|title=Interface improvement for Apertium Globe Viewer<br />
|description=The [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] is a tool to visualise the translation pairs that Apertium currently offers, similar to the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer]. Choose any [https://wikis.swarthmore.edu/ling073/User:Cpillsb1/Final_project interface or usability issue] listed in the tool's documentation in consultation with your mentor, file an [https://github.com/jonorthwash/Apertium-Global-PairViewer/issues issue], and fix it.<br />
|tags=javascript, maps<br />
|multi=3<br />
|dup=5<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=Separate geographic and module data for Apertium Globe Viewer<br />
|description=The [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] is a tool to visualise the translation pairs that Apertium currently offers, similar to the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer]. Currently, geographic data for languages and pairs (latitude, longitude) is stored with the size of the dictionary, etc. Find a way to separate this data into distinct files (named sensibly), and at the same time make it possible to specify only the points for each language and not the endpoints for the arcs for language pairs (those should be trivial to generate dynamically).<br />
|tags=javascript, json<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=Scraper of information needed for Apertium visualisers<br />
|description=There are currently three prototype visualisers for the translation pairs Apertium offers: [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/ language family visualisation tool]. They all rely on data about Apertium linguistic modules, and that data has to be scraped. There are some tools which do various parts of this already, but they are not unified: There are scripts that do different pieces of all of this already: [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/wiki-tools/dixTable.py queries svn], [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/overtime.rb queries svn revisions], [http://wiki.apertium.org/wiki/The_Right_Way_to_count_dix_stems counting bidix stems]. Evaluate how well the script works, and attempt to make it output data that will be compatible with all viewers (and/or modify the viewers to make sure it is compatible with the general output format).<br />
|tags=python, json, scrapers<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality<br />
|title=fix pairviewer's 2- and 3-letter code conflation problems<br />
|description=[[pairviewer]] doesn't always conflate languages that have two codes. E.g. sv/swe, nb/nob, de/deu, da/dan, uk/ukr, et/est, nl/nld, he/heb, ar/ara, eus/eu are each two separate nodes, but should instead each be collapsed into one node. Figure out why this isn't happening and fix it. Also, implement an algorithm to generate 2-to-3-letter mappings for available languages based on having the identical language name in languages.json instead of loading the huge list from codes.json; try to make this as processor- and memory-efficient as possible. <br />
|tags=javascript<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=split nor into nob and nno in pairviewer<br />
|description=Currently in [[pairviewer]], nor is displayed as a language separately from nob and nno. However, the nor pair actually consists of both an nob and an nno component. Figure out a way for pairviewer (or pairsOut.py / get_all_lang_pairs.py) to detect this split. So instead of having swe-nor, there would be swe-nob and swe-nno displayed (connected seemlessly with other nob-* and nno-* pairs), though the paths between the nodes would each still give information about the swe-nor pair. Implement a solution, trying to make sure it's future-proof (i.e., will work with similar sorts of things in the future).<br />
|mentors=Jonathan, Fran, Unhammer<br />
|tags=javascript<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=add support to pairviewer for regional and alternate orthograpic modes<br />
|description=Currently in [[pairviewer]], there is no way to detect or display modes like zh_TW. Add suppor to pairsOut.py / get_all_lang_pairs.py to detect pairs containing abbreviations like this, as well as alternate orthographic modes in pairs (e.g. uzb_Latn and uzb_Cyrl). Also, figure out a way to display these nicely in the pairviewer's front-end. Get creative. I can imagine something like zh_CN and zh_TW nodes that are in some fixed relation to zho (think Mickey Mouse configuration?). Run some ideas by your mentor and implement what's decided on.<br />
|mentors=Jonathan, Fran<br />
|tags=javascript<br />
}}{{Taskidea<br />
|type=code<br />
|title=Extend visualisation of pairs involving a language in language family visualisation tool<br />
|description=The [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/ language family visualisation tool] currently has a visualisation of all pairs involving the language. Extend this to include pairs that involve those languages, and so on, until there are no more pairs. This should result in a graph of quite a few languages, with the current language in the middle. Note that if language x is the center, and there are x-y and x-z pairs, but also a y-z pair, this should display the y-z pair with a link, not with an extra z and y node each, connected to the original y and z nodes, respectively. The best way to do this may involve some sort of filtering of the data.<br />
|mentors=Jonathan<br />
|tags=javascript<br />
}}{{Taskidea<br />
|type=code<br />
|title=Scrape Crimean Tatar Quran translation from a website<br />
|description=Bible and Quran translations often serve as a parallel corpus useful for solving NLP tasks because both texts are available in many languages. Your goal in this task is to write a program in the language of your choice which scrapes the Quran translation in the Crimean Tatar language available on the following website: http://crimean.org/islam/koran/dizen-qurtnezir/. You can adapt the scraper described on the [[Writing a scraper]] page or write your own from scratch. The output should be plain text in Tanzil format ('text with aya numbers'). You can see examples of that format on http://tanzil.net/trans/ page. When scraping, please be polite and request data at a reasonable rate.<br />
|mentors=Ilnar, Jonathan, fotonzade<br />
|tags=scraper<br />
}}{{Taskidea<br />
|type=code<br />
|title=Scrape Quran translations from a website<br />
|description=Bible and Quran translations often serve as a parallel corpus useful for solving NLP tasks because both texts are available in many languages. Your goal in this task is to write a program in the language of your choice which scrapes the Quran translations available on the following website: http://www.quran-ebook.com/. You can adapt the scraper described on the [[Writing a scraper]] page or write your own from scratch. The output should be plain text in Tanzil format ('text with aya numbers'). You can see examples of that format on http://tanzil.net/trans/ page. Before starting, check whether the translation is not already available on the Tanzil project's page (no need to re-scrape those, but you should use them to test the output of your program). Although the format of the translations seems to be the same and thus your program is expected to work for all of them, translations we are interested the most are the following: [http://www.quran-ebook.com/azerbaijan_version2/1.html Azerbaijani version 2], [http://www.quran-ebook.com/bashkir_version/index_ba.html Bashkir], [http://www.quran-ebook.com/chechen_version/index_cech.html Chechen], [http://www.quran-ebook.com/karachayevo_version/index_krc.html Karachay] and [http://www.quran-ebook.com/kyrgyzstan_version/index_kg.html Kyrgyz]. When scraping, please be polite and request data at a reasonable rate.<br />
|mentors=Ilnar, Jonathan, fotonzade<br />
|tags=scraper<br />
}}{{Taskidea<br />
|type=documentation<br />
|title=Unified documentation on Apertium visualisers<br />
|description=There are currently three prototype visualisers for the translation pairs Apertium offers: [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/ language family visualisation tool]. Make a page on the Apertium wiki that showcases these three visualisers and links to further documentation on each. If documentation for any of them is available somewhere other than the Apertium wiki, then (assuming compatible licenses) integrate it into the Apertium wiki, with a link back to the original.<br />
|mentors=Jonathan<br />
|tags=wiki, visualisers<br />
}}{{Taskidea|type=research|mentors=Jonathan<br />
|title=Investigate FST backends for Swype-type input<br />
|description=Investigate what options exist for implementing an FST (of the sort used in Apertium [[spell checking]]) for auto-correction into an existing open source Swype-type input method on Android. You don't need to do any coding, but you should determine what would need to be done to add an FST backend into the software. Write up your findings on the Apertium wiki.<br />
|mentors=Jonathan<br />
|tags=spelling,android<br />
}}{{Taskidea|type=research|mentors=Jonathan<br />
|title=tesseract interface for apertium languages<br />
|description=Find out what it would take to integrate apertium or voikkospell into tesseract. Document thoroughly available options on the wiki. <br />
|tags=spelling,ocr<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Shardul<br />
|title=Integrate documentation of the Apertium deformatter/reformatter into system architecture page<br />
|description=Integrate documentation of the Apertium deformatter and reformatter into the wiki page on the [[Apertium system architecture]].<br />
|tags=wiki, architecture<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Shardul<br />
|title=Document a full example through the Apertium pipeline<br />
|description=Come up with an example sentence that could hypothetically rely on each stage of the [[Apertium pipeline]], and show the input and output of each stage under the [[Apertium_system_architecture#Example_translation_at_each_stage|Example translation at each stage]] section on the Apertium wiki.<br />
|tags=wiki, architecture<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Shardul<br />
|title=Create a visual overview of structural transfer rules<br />
|description=Based on an [https://wikis.swarthmore.edu/ling073/Structural_transfer existing overview of Apertium structural transfer rules], come up with a visual presentation of transfer rules that shows what parts of a set of rules correspond to which changes in input and output, and also which definitions are used where in the rules. Get creative—you can do this all in any format easily viewed across platforms, especially as a webpage using modern effects like those offered by d3 or similar.<br />
|tags=wiki, architecture, visualisations, transfer<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan<br />
|title=Complete the Linguistic Data chart on Apertium system architecture wiki page<br />
|description=With the assistance of the Apertium community (our [[IRC]] channel) and the resources available on the Apertium wiki, fill in the remaining cells of the table in the "Linguistic data" section of the [[Apertium system architecture]] wiki page.<br />
|tags=wiki, architecture<br />
|beginner=yes<br />
}}{{Taskidea<br />
|type=research<br />
|mentors=Fran<br />
|title=Do a literature review on anaphora resolution<br />
|description=Anaphora resolution (see the [[anaphora resolution|wiki page]] is the task of determining for a pronoun or other item with reference what it refers to. Do a literature review and write up common methods with their success rates.<br />
|tags=anaphora, rbmt, engine<br />
|beginner=<br />
}}{{Taskidea<br />
|type=research<br />
|mentors=Fran<br />
|title=Write up grammatical tables for a grammar of a language that Apertium doesn't have an analyser for<br />
|description=Many descriptive grammars have useful tables that can be used for building morphological analysers. Unfortunately they are in Google Books or in paper and not easily processable by machine. The objective is to find a grammar of a language for which Apertium doesn't have a morphological analyser and write up the tables on a Wiki page.<br />
|tags=grammar, books, data-entry<br />
|beginner=<br />
}}{{Taskidea<br />
|type=research<br />
|mentors=Fran, Xavivars<br />
|title=Phrasebooks and frequency<br />
|description=Apertium is quite terrible in general with phrasebook style sentences in most languages. Try translating "what's up" from English to Spanish. The objective of this task is to look for phrasebook/filler type sentences/utterances in parallel corpora of film subtitles and on the internet and order them by frequency/generality. Frequency is the amount of times you see the utterance, generality is in how many different places you see it. <br />
|tags=phrasebook, translation<br />
|beginner=<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Flammie<br />
|title=Hungarian Open Source dictionaries<br />
|description=There are currently 3+ open source Hungarian open source resources for morphological analysis/dictionaries, study and document on how to install these and get the words and their inflectional informations out, and e.g. tabulate some examples of similarities and differences of word classes/tags/stuff. See [[Hungarian]] for more info.<br />
|tags=hungarian<br />
|beginner=<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Vin, Jonathan, Anna<br />
|title=Create a UD-Apertium morphology mapping<br />
|description=Choose a language that has a Universal Dependencies treebank and tabulate a potential set of Apertium morph labels based on the (universal) UD morph labels. See Apertium's [[list of symbols]] and [http://universaldependencies.org/ UD]'s POS and feature tags for the labels.<br />
|tags=morphology, ud, dependencies<br />
|beginner=<br />
|multi=5<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Vin, Jonathan, Anna<br />
|title=Create an Apertium-UD morphology mapping<br />
|description=Choose a language that has an Apertium morphological analyser and adapt it to convert the morphology to UD morphology<br />
|tags=morphology, ud, dependencies<br />
|beginner=<br />
|multi=5<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Vin<br />
|title=Create a full verbal paradigm for an Indo-Aryan language<br />
|description=Choose a regular verb and create a paradigm with all possible tense/aspect/mood inflections for an Indo-Aryan language (except Hindi or Marathi). Use Masica's grammar as a reference.<br />
|tags=morphology, indo-aryan<br />
|beginner=<br />
|multi=10<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Vin<br />
|title=Create a syntactic analogy corpus for a particular POS/language.<br />
|description=Refer to the syntactic section of [https://www.aclweb.org/anthology/N/N16/N16-2002.pdf this paper]. Try to create a data set with more than 2000 * 8 = 16000 entries for a particular POS with any language, using a large corpus for frequency.<br />
|tags=morphology, embeddings<br />
|beginner=<br />
|multi=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Vin<br />
|title=Envision and create a quick utility for tasks like morphological lookup<br />
|description=Many tasks like morphological analysis are annoying to do by navigating to the right directory, typing out an entire pipeline etc. Write a bash script to simplify some of these procedures, taking into account the install paths and prefixes if necessary. eg. echo "hargle" \| ~/analysers/apertium-eng/eng.automorf.bin ==> morph "hargle" eng<br />
|tags=bash, scripting<br />
|beginner=yes<br />
|multi=10<br />
}}<br />
{{Taskidea<br />
|type=research,code<br />
|mentors=Vin<br />
|title=Use open-source OCR to convert open-source non-text news corpora to text. Evaluate an analyser's coverage on them.<br />
|description=Many languages that have online newspapers do not use actual text to store the news but instead use images or GIFs :((( find a newspaper for a language that lacks news text online (eg. Marathi), check licenses, find an OCR tool and scrape a reasonably large corpus from the images if doing so would not violate CC/GPL. Evaluate the morphological analyser on it.<br />
|tags=python,morphology<br />
|beginner=<br />
}}<br />
{{Taskidea<br />
|type=research,quality<br />
|mentors=Shardul, Jonathan<br />
|tags=issues, python<br />
|title=Clean up open issues in [https://github.com/goavki/apertium-html-tools/issues html-tools], [https://github.com/goavki/phenny/issues begiak], or [https://github.com/goavki/apertium-apy/issues APy]<br />
|description=Go through issue threads for [https://github.com/goavki/apertium-html-tools/issues html-tools], [https://github.com/goavki/phenny/issues begiak], or [https://github.com/goavki/apertium-apy/issues APy], and find issues that have been solved in the code but are still open on GitHub. (The fact that they have been solved may not be evident from the comments thread alone.) Once you find such an issue, comment on the thread explaining what code/commit fixed it and how it behaves at the latest revision.<br />
|multi=15<br />
}}<br />
{{Taskidea<br />
|type=code,quality<br />
|mentors=Shardul, Jonathan<br />
|tags=tests, python, IRC<br />
|title=Get [https://github.com/goavki/phenny begiak] to build cleanly<br />
|description=Currently, [https://github.com/goavki/phenny begiak] does not build cleanly because of a number of failing tests. Find what is causing the tests to fail, and either fix the code or the tests if the code has changed its behavior. Document all your changes in the PR that you create.<br />
}}<br />
{{Taskidea<br />
|type=quality<br />
|mentors=Jonathan, Ilnar<br />
|title=Find stems in the Kazakh treebank that are not in the Kazakh analyser<br />
|description=There are quite a few analyses in the [http://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz/texts/puupankki/puupankki.kaz.conllu Kazakh treebank] that don't exist in the [[apertium-kaz|Kazakh analyser]]. Find as many examples of missing stems as you can. Feel free to write a script to automate this so it's as exhaustive (and non-exhausting:) as possible. You may either add what you find to the analyser yourself, commit a list of the missing stems to apertium-kaz/dev, or send a list to your mentor so that they may do one of these.<br />
|tags=treebank, Kazakh, analyses<br />
|beginner=yes<br />
}}<br />
{{Taskidea<br />
|type=quality<br />
|mentors=Jonathan, Ilnar<br />
|title=Find missing analyses in the Kazakh treebank that are not in the Kazakh analyser<br />
|description=There are quite a few analyses in the [http://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz/texts/puupankki/puupankki.kaz.conllu Kazakh treebank] that don't exist in the [[apertium-kaz|Kazakh analyser]]. Find as many examples of missing analyses (for existing stems) as you can. Feel free to write a script to automate this so it's as exhaustive (and non-exhausting:) as possible. You may commit a list of the missing stems to apertium-kaz/dev or send a list to your mentor so that they may do this.<br />
|tags=treebank, Kazakh, analyses<br />
|beginner=yes<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Use apertium-init to bootstrap a new language module<br />
|description=Use [[Apertium-init]] to bootstrap a new language module that doesn't currently exist in Apertium. To see if a language is available, check [[languages]] and [[incubator]], and especially ask on IRC. Add enough stems and morphology to the module so that it analyses and generates at least 100 correct forms. Check your code into Apertium's codebase. [[Task ideas for Google Code-in/Add words from frequency list|Read more about adding stems...]]<br />
|tags=languages, bootstrap, dictionaries<br />
|beginner=yes<br />
|multi=25<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Use apertium-init to bootstrap a new language pair<br />
|description=Use [[Apertium-init]] to bootstrap a new translation pair between two languages which have monolingual modules already in Apertium. To see if a translation pair has already been made, check our [[SVN]] repository, and especially ask on IRC. Add 100 common stems to the dictionary. Check your work into Apertium's codebase.<br />
|tags=languages, bootstrap, dictionaries, translators<br />
|beginner=yes<br />
|multi=25<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan, mlforcada<br />
|title=Add a transfer rule to an existing translation pair<br />
|description=Add a transfer rule to an existing translation pair that fixes an error in translation. Document the rule on the [http://wiki.apertium.org/ Apertium wiki] by adding a [[regression testing|regression tests]] page similar to [[English_and_Portuguese/Regression_tests]] or [[Icelandic_and_English/Regression_tests]]. Check your code into Apertium's codebase. [[Task ideas for Google Code-in/Add transfer rule|Read more...]]<br />
|tags=languages, bootstrap, transfer<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Add stems to an existing translation pair<br />
|description=Add 1000 common stems to the dictionary of an existing translation pair. Check your work into Apertium's codebase. [[Task ideas for Google Code-in/Add words from frequency list|Read more about adding stems...]]<br />
|tags=languages, bootstrap, dictionaries, translators<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Write 10 lexical selection to an existing translation pair<br />
|description=Add 10 lexical selection rules to an existing translation pair. Check your work into Apertium's codebase. [[Task ideas for Google Code-in/Add lexical-select rules|Read more...]]<br />
|tags=languages, bootstrap, lexical selection, translators<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Write 10 constraint grammar rules for an existing language module<br />
|description=Add 10 constraint grammar rules to an existing language that you know. Check your work into Apertium's codebase. [[Task ideas for Google Code-in/Add constraint-grammar rules|Read more...]]<br />
|tags=languages, bootstrap, constraint grammar<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|mentors=Jonathan<br />
|title=Paradigm generator webpage<br />
|description=Write a standalone webpage that makes queries (though javascript) to an [[apertium-apy]] server to fill in a morphological forms based on morphological tags that are hidden throughout the body of the page. For example, say you have the verb "say", and some tags like inf, past, pres.p3.sg—these forms would get filled in as "say", "said", "says".<br />
|tags=javascript, html, apy<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Anna<br />
|title=Train a new model for syntactic function labeller<br />
|description=Choose one of the languages Apertium uses in language pairs and prepare training data for the labeller from its UD-treebank: replace UD tags with Apertium tags, parse the treebank, create fastText embeddings. Then train a new model on this data and evaluate an accuracy.<br />
|tags=python, UD, embeddings, machine learning<br />
|multi=5<br />
}}<br />
{{Taskidea<br />
|type=code,quality<br />
|mentors=Anna<br />
|title=Tuning a learning rate for syntactic function labeller's RNN<br />
|description=Syntactic function labeller uses RNN for training and predicting syntactic functions of words. Current models can be improved by tuning training parameters, e.g. learning rate parameter.<br />
|tags=python, machine learning<br />
}}<br />
</table><br />
<br />
<br />
[[Category:Google Code-in]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Task_ideas_for_Google_Code-in&diff=64733Task ideas for Google Code-in2017-11-15T06:47:53Z<p>Deltamachine: </p>
<hr />
<div>{{TOCD}}<br />
This is the task ideas page for [https://developers.google.com/open-source/gci/ Google Code-in], here you can find ideas on interesting tasks that will improve your knowledge of Apertium and help you get into the world of open-source development.<br />
<br />
The people column lists people who you should get in contact with to request further information. All tasks are 2 hours maximum estimated amount of time that would be spent on the task by an experienced developer, however:<br />
<br />
# '''this does not include time taken to [[Minimal installation from SVN|install]] / set up apertium (and relevant tools)'''.<br />
# this is the time expected to take by an experienced developer, you may find that you spend more time on the task because of the learning curve. <br />
<br />
<!--Если ты не понимаешь английский язык или предпочитаешь работать над русским языком или другими языками России, смотри: [[Task ideas for Google Code-in/Russian]]--><br />
'''Categories:'''<br />
<br />
* {{sc|code}}: Tasks related to writing or refactoring code<br />
* {{sc|documentation}}: Tasks related to creating/editing documents and helping others learn more<br />
* {{sc|research}}: Tasks related to community management, outreach/marketting, or studying problems and recommending solutions<br />
* {{sc|quality}}: Tasks related to testing and ensuring code is of high quality.<br />
* {{sc|interface}}: Tasks related to user experience research or user interface design and interaction<br />
<br />
'''Clarification of "multiple task" types'''<br />
* multi = number of students who can do a given task<br />
* dup = number of times a student can do the same task<br />
<br />
You can find descriptions of some of the mentors [[List_of_Apertium_mentors | here]].<br />
<br />
==Task ideas==<br />
<table class="sortable wikitable" style="display: none"><br />
<!-- THE TASKS NEED TO BE HIDDEN FOR NOW,<br />
but feel free to remove style="display: none" to preview changes to this page.<br />
Just remember to put it back before saving<br />
JNW 2017-10-30<br />
--><br />
<tr><th>type</th><th>title</th><th>description</th><th>tags</th><th>mentors</th><th>bgnr?</th><th>multi?</th><th>duplicates</th></tr><br />
{{Taskidea<br />
|type=research<br />
|title=Document resources for a language<br />
|description=Document resources for a language without resources already documented on the Apertium wiki. [[Task ideas for Google Code-in/Documentation of resources|read more...]]<br />
|tags=wiki, languages<br />
|mentors=Jonathan, Vin, Xavivars, Marc Riera<br />
|multi=40<br />
|beginner=yes<br />
}}{{Taskidea<br />
|type=research<br />
|title=Write a contrastive grammar<br />
|description=Document 6 differences between two (preferably related) languages and where they would need to be addressed in the [[Apertium pipeline]] (morph analysis, transfer, etc). Use a grammar book/resource for inspiration. Each difference should have no fewer than 3 examples. Put your work on the Apertium wiki under [[Language1_and_Language2/Contrastive_grammar]]. See [[Farsi_and_English/Pending_tests]] for an example of a contrastive grammar that a previous GCI student made.<br />
|mentors=Vin, Jonathan, Fran, mlforcada<br />
|tags=wiki, languages<br />
|beginner=yes<br />
|multi=40<br />
}}<br />
{{Taskidea|type=interface|mentors=Fran, Masha, Jonathan<br />
|tags=annotation, annotatrix<br />
|title=Nicely laid out interface for ud-annotatrix <br />
|description=Design an HTML layout for the annotatrix tool that makes best use of the space and functions nicely<br />
at different screen resolutions.<br />
}}<br />
{{Taskidea|type=interface|mentors=Fran, Masha, Jonathan, Vin<br />
|tags=annotation, annotatrix, css<br />
|title=Come up with a CSS style for annotatrix<br />
|description=<br />
}}<br />
{{Taskidea|type=code|mentors=Fran, Masha, Jonathan, Vin<br />
|tags=annotation, annotatrix, javascript, dependencies<br />
|title=SDparse to CoNLL-U converter in JavaScript<br />
|description=SDparse is a format for describing dependency trees, they look like relation(head, dependency). CoNLL-U is another<br />
format for describing dependency trees. Make a converter between the two formats. You will probably need to learn more about the specifics of these formats. The GitHub issue is [https://github.com/jonorthwash/ud-annotatrix/issues/88 here].<br />
}}<br />
{{Taskidea|type=quality|mentors=Fran, Masha, Vin<br />
|tags=annotation, annotatrix<br />
|title=Write a test for the format converters in annotatrix<br />
|description=<br />
|multi=yes<br />
}}<br />
{{Taskidea|type=code|mentors=Fran, Masha, Jonathan<br />
|tags=annotation, annotatrix, javascript<br />
|title=Write a function to detect invalid trees in the UD annotatrix software and advise the user about it<br />
|description=It is possible to detect invalid trees (such as those that have cycles). We would like to write a function to detect those kinds of trees and advise the user. The GitHub issue is [https://github.com/jonorthwash/ud-annotatrix/issues/96 here].<br />
}}<br />
{{Taskidea|type=documentation|mentors=Fran, Masha, Jonathan, Vin<br />
|tags=annotation, annotatrix, dependencies<br />
|title=Write a tutorial on how to use annotatrix to annotate a dependency tree<br />
|description=Give step by step instructions to annotating a dependency tree with Annotatrix. Make sure you include all possibilities in the app, for example tokenisation options.<br />
}}<br />
{{Taskidea|type=documentation|mentors=Fran, Masha, Vin<br />
|tags=annotation, annotatrix, video, dependencies<br />
|title=Make a video tutorial on annotating a dependency tree using the [https://github.com/jonorthwash/ud-annotatrix/ UD annotatrix software].<br />
|description=Give step by step instructions to annotating a dependency tree with Annotatrix. Make sure you include all possibilities available in the app, for example tokenisation options.<br />
}}<br />
{{Taskidea|type=quality|mentors=Masha|tags=xml, dictionaries, svn<br />
|title=Merge two versions of the Polish morphological dictionary<br />
|description=At some point in the past, someone deleted a lot of entries from the Polish morphological dictionary, and unfortunately we didn't notice at the time and have since added stuff to it. The objective of this task is to take the last<br />
version before the mass deletion and the current version and merge them.<br />
Getting list of the changes: <br />
$ svn diff --old apertium-pol.pol.dix@73196 --new apertium-pol.pol.dix@73199 > changes.diff<br />
}}<br />
{{Taskidea|type=quality|mentors=fotonzade, Jonathan, Xavivars, Marc Riera, mlforcada<br />
|tags=xml, dictionaries, svn<br />
|title=Add 200 new entries to a bidix to language pair %AAA%-%BBB%<br />
|description=Our translation systems require large lexicons so as to provide production-quality coverage of any input data. This task requires the student to add 200 new words to a bidirectional dictionary.<br />
|multi=yes<br />
|bgnr=yes<br />
}}<br />
{{Taskidea|type=quality|mentors=fotonzade, Jonathan, Xavivars, Marc Riera, mlforcada<br />
|tags=xml, dictionaries, svn<br />
|title=Add 500 new entries to a bidix to language pair %AAA%-%BBB%<br />
|description=Our translation systems require large lexicons so as to provide production-quality coverage of any input data. This task requires the student to add 500 new words to a bidirectional dictionary.<br />
|multi=yes<br />
}}<br />
{{Taskidea|type=quality|mentors=fotonzade, Xavivars, Marc Riera, mlforcada<br />
|tags=disambiguation, svn<br />
|title=Disambiguate 500 tokens of text in %AAA%<br />
|description=Run some text through a morphological analyser and disambiguate the output. Contact the mentor beforehand to approve the choice of language and text.<br />
|multi=yes<br />
}}<br />
{{Taskidea|type=code|mentors=Fran, Katya|tags=morphology, languages, finite-state, fst<br />
|title=Use apertium-init to start a new morphological analyser for %AAA%<br />
|description=Use apertium-init to start a new morphological analyser (for a language we don't already <br />
have, e.g. %AAA%) and add 100 words.<br />
|multi=yes<br />
}}<br />
{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Flammie<br />
|title=add comments to .dix file symbol definitions<br />
|tags=dix<br />
}}<br />
{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan<br />
|title=find symbols that aren't on the list of symbols page<br />
|description=Go through symbol definitions in Apertium dictionaries in svn (.lexc and .dix format), and document any symbols you don't find on the [[List of symbols]] page. This task is fulfilled by adding at least one class of related symbols (e.g., xyz_*) or one major symbol (e.g., abc), along with notes about what it means.<br />
|tags=wiki,lexc,dix<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=conllu parser and searching<br />
|description=Write a script (preferably in python3) that will parse files in conllu format, and perform basic searches, such as "find a node that has an nsubj relation to another node that has a noun POS" or "find all nodes with a cop label and a past feature"<br />
|tags=python, dependencies<br />
|mentors=Jonathan, Fran, Wei En, Anna<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=group and count possible lemmas output by guesser<br />
|mentors=Jonathan, Fran, Wei En<br />
|description=Currently a "guesser" version of Apertium transducers can output a list of possible analyses for unknown forms. Develop a new pipleine, preferably with shell scripts or python, that uses a guesser on all unknown forms in a corpus, and takes the list of all possible analyses, and output a hit count of the most common combinations of lemma and POS tag.<br />
|tags=guesser, transducers, shellscripts<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=vim mode/tools for annotating dependency corpora in CG3 format<br />
|mentors=Jonathan, Fran<br />
|description=includes formatting, syntax highlighting, navigation, adding/removing nodes, updating node numbers, etc.<br />
|tags=vim, dependencies, CG3<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|title=vim mode/tools for annotating dependency corpora in CoNLL-U format<br />
|mentors=Jonathan, Fran<br />
|description=includes formatting, syntax highlighting, navigation, adding/removing nodes, updating node numbers, etc.<br />
|tags=vim, dependencies, conllu<br />
}}{{Taskidea<br />
|type=quality<br />
|title=figure out one-to-many bug in the [[lsx module]]<br />
|mentors=Jonathan, Fran, Wei En, Irene<br />
|description=There is a bug in the [[lsx module]] referred to as the [http://wiki.apertium.org/wiki/Lsx_module#The_one-to-many_bug one-to-many bug] because lsx-proc will not convert one form to many given an appropriately compiled transducer. Your job is to figure out why this happens and fix it.<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=code<br />
|title=add an option for reverse compiling to the [[lsx module]]<br />
|mentors=Jonathan, Fran, Wei En, Irene, Xavivars<br />
|description=this should be simple as it can just leverage the existing lttoolbox options for left-right / right-left compiling<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=clean up lsx-comp<br />
|mentors=Jonathan, Fran, Wei En, Irene, Xavivars<br />
|description=remove extraneous functions from lsx-comp and clean up the code<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=clean up lsx-proc<br />
|mentors=Jonathan, Fran, Wei En, Irene, Xavivars<br />
|description=remove extraneous functions from lsx-proc and clean up the code<br />
|tags=C++, transducers, lsx<br />
}}{{Taskidea<br />
|type=documentation<br />
|title=document usage of the lsx module<br />
|mentors= Irene<br />
|description= document which language pairs have included the lsx module in its package, which have beta-tested the lsx module, and which are good candidates for including support for lsx. add to [[Lsx_module/supported_languages | this wiki page]]<br />
|tags=C++, transducers, lsx<br />
|beginner=yes<br />
}}{{Taskidea<br />
|type=quality<br />
|title=beta testing the lsx-module<br />
|mentors=Jonathan, Fran, Wei En, Irene<br />
|description= [[Lsx_module#Creating_the_lsx-dictionary|create an lsx dictionary]]for any relevant and existing language pair that doesn't yet support it, adding 10-30 entries to it. Thoroughly test to make sure the output is as expected. report bugs/non-supported features and add them to [[Lsx_module#Future_work| future work]]. Document your tested language pair by listing it under [[Lsx_module#Beta_testing]] and in [[Lsx_module/supported_languages | this wiki page]]<br />
|tags=C++, transducers, lsx<br />
|multi=yes<br />
|dup=yes<br />
}}{{Taskidea<br />
|type=code<br />
|title=fix an lsx bug / add an lsx feature<br />
|mentors=Jonathan, Fran, Wei En, Irene<br />
|description= if you've done the above task (beta testing the lsx-module) and discovered any bugs or unsupported features, fix them.<br />
|tags=C++, transducers, lsx<br />
|multi=yes<br />
|dup=yes<br />
}}{{Taskidea<br />
|type=code<br />
|title=script to test coverage over wikipedia corpus<br />
|mentors=Jonathan, Wei En, Shardul<br />
|description=Write a script (in python or ruby) that in one mode checks out a specified language module to a given directory, compiles it (or updates it if already existant), and then gets the most recently nightly wikipedia archive for that language and runs coverage over it (as much in RAM if possible). In another mode, it compiles the language pair in a docker instance that it then disposes of after successfully running coverage. Scripts exist in Apertium already for finding where a wikipedia is, extracting a wikipedia archive into a text file, and running coverage.<br />
|tags=python, ruby, wikipedia<br />
}}{{Taskidea<br />
|type=quality,code<br />
|tag=issues<br />
|title=fix any open ticket<br />
|description=Fix any open ticket in any of our issues trackers: [https://sourceforge.net/p/apertium/tickets/ main], [https://github.com/goavki/apertium-html-tools/issues html-tools], [https://github.com/goavki/phenny/issues begiak]. When you claim this task, let your mentor know which issue you plan to work on.<br />
|mentors=Jonathan, Wei En, Sushain, Shardul<br />
|multi=25<br />
|dup=10<br />
}}<br />
{{Taskidea<br />
|type=quality,code<br />
|title=make html-tools do better on Chrome's audit<br />
|tags=javascript, html, css, web<br />
|description=Currently, apertium.org and generally any [https://github.com/goavki/apertium-html-tools html-tools] installation fails lots of Chrome audit tests. As many as possible should be fixed. Ones that require substantial work should be filed as tickets and measures should be taken to prevent problems from reappearing (e.g. a test or linter rule). More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/201 #201]) and asynchronous discussion should occur there.<br />
|mentors=Jonathan, Sushain, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=upgrade html-tools to Bootstrap 4<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] uses Bootstrap 3.x. Bootstrap 4 beta is out and we can upgrade (hopefully)! If an upgrade is not possible, you should document why it's not and ensure that it's easy to upgrade when the blockers are removed. More information may be available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/200 #200]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=display API endpoint on sandbox<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] has an "APy" mode where users can easily test out the API. However, it doesn't display the actual URL of the API endpoint and it would be nice to show that to the user. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/147 #147]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,quality,research<br />
|title=set up a testing framework for html-tools<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] has no tests (sad!). This task requires researching what solutions there are for testing jQuery based web applications and putting one into place with a couple tests as a proof of concept. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/116 #116]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,research<br />
|title=make html-tools automatically download translated files in Safari, IE, etc.<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] is capable of translating files. However, this translation does not always result in the file immediately being download to the user on all browsers. It would be awesome if it did! This task requires researching what solutions there are, evaluating them against each other and it may result in a conclusion that it just isn't possible (yet). More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/97 #97]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Unhammer, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=make html-tools fail more gracefully when API is down<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] relies on an API endpoint to translate documents, files, etc. However, when this API is down the interface also breaks! This task requires fixing this breakage. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/207 #207]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=make html-tools properly align text in mixed RTL/LTR contexts<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] is capable of displaying results/allowing input for RTL languages in a LTR context (e.g. we're translating Arabic in an English website). However, this doesn't always look exactly how it should look, i.e. things are not aligned correctly. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/49 #49]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=de-conflict the 'make a suggestion' interface in html-tools<br />
|tags=javascript, html, css, web<br />
|description=There has been much demand for [https://github.com/goavki/apertium-html-tools html-tools] to support an interface for users making suggestions regarding e.g. incorrect translations (c.f. Google translate). An interface was designed for this purpose. However, since it has been a while since anyone touched it, the code now conflicts with the current master branch. This task requires de-conflicting this [https://github.com/goavki/apertium-html-tools/pull/74 branch] with master and providing screenshot/video(s) of the interface to show that it functions. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/74 #74]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
}}<br />
{{Taskidea<br />
|type=code,quality<br />
|title=make html-tools capable of translating itself<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] supports website translation. However, if asked to translate itself, weird things happen and the interface does not properly load. This task requires figuring out the root problem and correcting the fault. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/203 #203]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Shardul<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=interface<br />
|title=create mock-ups for variant support in html-tools<br />
|tags=javascript, html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] supports translation using language variants. However, we do not have first-class style/interface support for it. This task requires speaking with mentors/reading existing discussion to understand the problem and then produce design mockups for a solution. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/82 #82]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Fran, Shardul, Xavivars<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=refine the html-tools dictionary interface<br />
|tags=javascript, html, css, web<br />
|description=Significant progress has been made towards providing a dictionary-style interface within [https://github.com/goavki/apertium-html-tools html-tools]. This task requires refining the existing [https://github.com/goavki/apertium-html-tools/pull/184 PR] by de-conflicting it with master and resolving the interface concerns discussed [https://github.com/goavki/apertium-html-tools/pull/184#issuecomment-323597780 here]. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/105 #105]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan, Xavivars<br />
}}<br />
{{Taskidea<br />
|type=code,quality,interface<br />
|title=eliminate inline styles from html-tools<br />
|tags=html, css, web<br />
|description=Currently, [https://github.com/goavki/apertium-html-tools html-tools] has inline styles. These are not very maintainable and widely considered as bad style. This task requires surveying the uses, removing all of them in a clean manner, i.e. semantically, and re-enabling the linter rule that will prevent them going forward. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/114 #114]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Shardul, Xavivars<br />
|bgnr=yes<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|title=refine the html-tools spell checking interface<br />
|tags=html, css, web<br />
|description=Spell checking is a feature that would greatly benefit [https://github.com/goavki/apertium-html-tools html-tools]. Significant effort has been put towards implementing an effective interface to provide spelling suggestions to users (this [https://github.com/goavki/apertium-html-tools/pull/176 PR] contains the current progress). This task requires solving the problems highlighted in the code review on the PR and fixing any other bugs uncovered in conversations with the mentors. More information is available in the issue tracker ([https://github.com/goavki/apertium-html-tools/issues/12 #12]) and asynchronous discussion should occur there.<br />
|mentors=Sushain, Jonathan<br />
}}<br />
{{Taskidea<br />
|type=quality<br />
|title=find an apertium module not developed in svn and import it<br />
|description=Find an Apertium module developed elsewhere (e.g., github) released under a compatible open license, and import it into [http://wiki.apertium.org/wiki/SVN Apertium's svn], being sure to attribute any authors (in an AUTHORS file) and keeping the original license. Once place to look for such modules might be among the [https://wikis.swarthmore.edu/ling073/Category:Sp17_FinalProjects final projects] in a recent Computational Linguistics course.<br />
|mentors=Jonathan, Wei En<br />
|multi=10<br />
|dup=2<br />
}}{{Taskidea<br />
|type=code<br />
|title=add an incubator mode to the wikipedia scraper<br />
|tags=wikipedia, python<br />
|description=Add a mode to scrape a Wikipedia in incubator (e.g,. [https://incubator.wikimedia.org/wiki/Wp/inh/Main_page the Ingush incubator]) to the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py WikiExtractor] script<br />
|mentors=Jonathan, Wei En<br />
}}{{Taskidea<br />
|type=code,interface<br />
|title=add a translation mode interface to the geriaoueg plugin for firefox<br />
|description=Fork the [https://github.com/vigneshv59/geriaoueg-firefox geriaoueg firefox plugin] and add an interface for translation mode. It doesn't have to translate at this point, but it should communicate with the server (as it currently does) to load available languages.<br />
|tags=javascript<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=code, interface<br />
|title=add a translation mode interface to the geriaoueg plugin for chrome<br />
|description=Fork the [https://github.com/vigneshv59/geriaoueg-chrome geriaoueg chrome plugin] and add an interface for translation mode. It doesn't have to translate at this point, but it should communicate with the server (as it currently does) to load available languages.<br />
|tags=javascript<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality<br />
|title=update bidix included in apertium-init<br />
|description=There are some issues with the bidix currently included in [https://github.com/goavki/bootstrap/ apertium-init]: the alphabet should be empty (or non-existant?) and the "sg" tags shouldn't be in the example entries. It would also be good to have entries in two different languages, especially ones with incompatible POS sub-categories (e.g. casa{{tag|n}}{{tag|f}}). There is [https://github.com/goavki/bootstrap/issues/24 a github issue for this task].<br />
|tags=python, xml, dix<br />
|beginner=yes<br />
|mentors=Jonathan, Sushain<br />
}}{{Taskidea<br />
|type=code<br />
|title=apertium-init support for more features in hfst modules<br />
|description=Add optional support to hfst modules for enabling spelling modules, an extra twoc module for morphotactic constraints, and spellrelax. You'll want to figure out how to integrate this into the Makefile template. There is [https://github.com/goavki/bootstrap/issues/23 a github issue for this task].<br />
|tags=python, xml, Makefile<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=code, quality<br />
|title=make apertium-init README files show only relevant dictionary file<br />
|description=Currently in [https://github.com/goavki/bootstrap/ apertium-init], the README files for HFST modules show the "dix" file in the list of files, and it's likely that lttoolbox modules show "hfst" files in their README too. Check this and make it so that READMEs for these two types of monolingual modules display only the right dictionary files. There is [https://github.com/goavki/bootstrap/issues/26 a github issue for this task].<br />
|tags=python, xml, Makefile<br />
|mentors=Jonathan, Sushain<br />
}}{{Taskidea<br />
|type=code, quality<br />
|title=Write a script to add glosses to a monolingual dictionary from a bilingual dictionary<br />
|description=Write a script that matches bilingual dictionary entries (in dix format) to monolingual dictionary entries in one of the languages (in [[Apertium-specific conventions for lexc|lexc]] format) and adds glosses from the other side of the bilingual dictionary if not already there. The script should combine glosses into one when there's more than one in the bilingual dictionary. Some level of user control might be justified, from simply defaulting to a dry run unless otherwise specified, to controls for adding to versus replacing versus leaving alone existing glosses, and the like. A [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/inject-words-from-bidix-to-lexc.py prototype of this script] is available in SVN, though it's buggy and doesn't fully work—so this task may just end up being to debug it and make it work as intended. A good test case might be the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-eng-kaz/apertium-eng-kaz.eng-kaz.dix English-Kazakh bilingual dictionary] and the [http://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz/apertium-kaz.kaz.lexc Kazakh monolingual dictionary].<br />
|tags=python, lexc, dix, xml<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=code<br />
|title=Write a script to deduplicate and/or sort individual lexc lexica.<br />
|description=The lexc format is a way to specify a monolingual dictionary that gets compiled into a transducer: see [[Apertium-specific conventions for lexc]] and [[Lttoolbox and lexc#lexc]]. A single lexc file may contain quite a few individual lexicons of stems, e.g. for nouns, verbs, prepositions, etc. Write a script (in python or ruby) that reads a specified lexicon, and based on which option the user specifies, identifies and removes duplicates from the lexicon, and/or sorts the entries in the lexicon. Be sure to make a dry-run (i.e., do not actually make the changes) the default, and add different levels debugging (such as displaying a number of duplicates versus printing each duplicate). Also consider allowing for different criteria for matching duplicates: e.g., whether or not the comment matches too. There are two scripts that parse lexc files already that would be a good point to start from: [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/lexccounter.py lexccounter.py] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/inject-words-from-bidix-to-lexc.py inject-words-from-bidix-to-lexc.py] (not fully functional).<br />
|tags=python, ruby, lexc<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, interface<br />
|title=Interface improvement for Apertium Globe Viewer<br />
|description=The [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] is a tool to visualise the translation pairs that Apertium currently offers, similar to the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer]. Choose any [https://wikis.swarthmore.edu/ling073/User:Cpillsb1/Final_project interface or usability issue] listed in the tool's documentation in consultation with your mentor, file an [https://github.com/jonorthwash/Apertium-Global-PairViewer/issues issue], and fix it.<br />
|tags=javascript, maps<br />
|multi=3<br />
|dup=5<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=Separate geographic and module data for Apertium Globe Viewer<br />
|description=The [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] is a tool to visualise the translation pairs that Apertium currently offers, similar to the [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer]. Currently, geographic data for languages and pairs (latitude, longitude) is stored with the size of the dictionary, etc. Find a way to separate this data into distinct files (named sensibly), and at the same time make it possible to specify only the points for each language and not the endpoints for the arcs for language pairs (those should be trivial to generate dynamically).<br />
|tags=javascript, json<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=Scraper of information needed for Apertium visualisers<br />
|description=There are currently three prototype visualisers for the translation pairs Apertium offers: [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/ language family visualisation tool]. They all rely on data about Apertium linguistic modules, and that data has to be scraped. There are some tools which do various parts of this already, but they are not unified: There are scripts that do different pieces of all of this already: [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/wiki-tools/dixTable.py queries svn], [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/overtime.rb queries svn revisions], [http://wiki.apertium.org/wiki/The_Right_Way_to_count_dix_stems counting bidix stems]. Evaluate how well the script works, and attempt to make it output data that will be compatible with all viewers (and/or modify the viewers to make sure it is compatible with the general output format).<br />
|tags=python, json, scrapers<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality<br />
|title=fix pairviewer's 2- and 3-letter code conflation problems<br />
|description=[[pairviewer]] doesn't always conflate languages that have two codes. E.g. sv/swe, nb/nob, de/deu, da/dan, uk/ukr, et/est, nl/nld, he/heb, ar/ara, eus/eu are each two separate nodes, but should instead each be collapsed into one node. Figure out why this isn't happening and fix it. Also, implement an algorithm to generate 2-to-3-letter mappings for available languages based on having the identical language name in languages.json instead of loading the huge list from codes.json; try to make this as processor- and memory-efficient as possible. <br />
|tags=javascript<br />
|mentors=Jonathan<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=split nor into nob and nno in pairviewer<br />
|description=Currently in [[pairviewer]], nor is displayed as a language separately from nob and nno. However, the nor pair actually consists of both an nob and an nno component. Figure out a way for pairviewer (or pairsOut.py / get_all_lang_pairs.py) to detect this split. So instead of having swe-nor, there would be swe-nob and swe-nno displayed (connected seemlessly with other nob-* and nno-* pairs), though the paths between the nodes would each still give information about the swe-nor pair. Implement a solution, trying to make sure it's future-proof (i.e., will work with similar sorts of things in the future).<br />
|mentors=Jonathan, Fran, Unhammer<br />
|tags=javascript<br />
}}{{Taskidea<br />
|type=quality, code<br />
|title=add support to pairviewer for regional and alternate orthograpic modes<br />
|description=Currently in [[pairviewer]], there is no way to detect or display modes like zh_TW. Add suppor to pairsOut.py / get_all_lang_pairs.py to detect pairs containing abbreviations like this, as well as alternate orthographic modes in pairs (e.g. uzb_Latn and uzb_Cyrl). Also, figure out a way to display these nicely in the pairviewer's front-end. Get creative. I can imagine something like zh_CN and zh_TW nodes that are in some fixed relation to zho (think Mickey Mouse configuration?). Run some ideas by your mentor and implement what's decided on.<br />
|mentors=Jonathan, Fran<br />
|tags=javascript<br />
}}{{Taskidea<br />
|type=code<br />
|title=Extend visualisation of pairs involving a language in language family visualisation tool<br />
|description=The [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/ language family visualisation tool] currently has a visualisation of all pairs involving the language. Extend this to include pairs that involve those languages, and so on, until there are no more pairs. This should result in a graph of quite a few languages, with the current language in the middle. Note that if language x is the center, and there are x-y and x-z pairs, but also a y-z pair, this should display the y-z pair with a link, not with an extra z and y node each, connected to the original y and z nodes, respectively. The best way to do this may involve some sort of filtering of the data.<br />
|mentors=Jonathan<br />
|tags=javascript<br />
}}{{Taskidea<br />
|type=code<br />
|title=Scrape Crimean Tatar Quran translation from a website<br />
|description=Bible and Quran translations often serve as a parallel corpus useful for solving NLP tasks because both texts are available in many languages. Your goal in this task is to write a program in the language of your choice which scrapes the Quran translation in the Crimean Tatar language available on the following website: http://crimean.org/islam/koran/dizen-qurtnezir/. You can adapt the scraper described on the [[Writing a scraper]] page or write your own from scratch. The output should be plain text in Tanzil format ('text with aya numbers'). You can see examples of that format on http://tanzil.net/trans/ page. When scraping, please be polite and request data at a reasonable rate.<br />
|mentors=Ilnar, Jonathan, fotonzade<br />
|tags=scraper<br />
}}{{Taskidea<br />
|type=code<br />
|title=Scrape Quran translations from a website<br />
|description=Bible and Quran translations often serve as a parallel corpus useful for solving NLP tasks because both texts are available in many languages. Your goal in this task is to write a program in the language of your choice which scrapes the Quran translations available on the following website: http://www.quran-ebook.com/. You can adapt the scraper described on the [[Writing a scraper]] page or write your own from scratch. The output should be plain text in Tanzil format ('text with aya numbers'). You can see examples of that format on http://tanzil.net/trans/ page. Before starting, check whether the translation is not already available on the Tanzil project's page (no need to re-scrape those, but you should use them to test the output of your program). Although the format of the translations seems to be the same and thus your program is expected to work for all of them, translations we are interested the most are the following: [http://www.quran-ebook.com/azerbaijan_version2/1.html Azerbaijani version 2], [http://www.quran-ebook.com/bashkir_version/index_ba.html Bashkir], [http://www.quran-ebook.com/chechen_version/index_cech.html Chechen], [http://www.quran-ebook.com/karachayevo_version/index_krc.html Karachay] and [http://www.quran-ebook.com/kyrgyzstan_version/index_kg.html Kyrgyz]. When scraping, please be polite and request data at a reasonable rate.<br />
|mentors=Ilnar, Jonathan, fotonzade<br />
|tags=scraper<br />
}}{{Taskidea<br />
|type=documentation<br />
|title=Unified documentation on Apertium visualisers<br />
|description=There are currently three prototype visualisers for the translation pairs Apertium offers: [https://github.com/jonorthwash/Apertium-Global-PairViewer Apertium Globe Viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/pairviewer/apertium.html apertium pair viewer] and [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/family-visualizations/ language family visualisation tool]. Make a page on the Apertium wiki that showcases these three visualisers and links to further documentation on each. If documentation for any of them is available somewhere other than the Apertium wiki, then (assuming compatible licenses) integrate it into the Apertium wiki, with a link back to the original.<br />
|mentors=Jonathan<br />
|tags=wiki, visualisers<br />
}}{{Taskidea|type=research|mentors=Jonathan<br />
|title=Investigate FST backends for Swype-type input<br />
|description=Investigate what options exist for implementing an FST (of the sort used in Apertium [[spell checking]]) for auto-correction into an existing open source Swype-type input method on Android. You don't need to do any coding, but you should determine what would need to be done to add an FST backend into the software. Write up your findings on the Apertium wiki.<br />
|mentors=Jonathan<br />
|tags=spelling,android<br />
}}{{Taskidea|type=research|mentors=Jonathan<br />
|title=tesseract interface for apertium languages<br />
|description=Find out what it would take to integrate apertium or voikkospell into tesseract. Document thoroughly available options on the wiki. <br />
|tags=spelling,ocr<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Shardul<br />
|title=Integrate documentation of the Apertium deformatter/reformatter into system architecture page<br />
|description=Integrate documentation of the Apertium deformatter and reformatter into the wiki page on the [[Apertium system architecture]].<br />
|tags=wiki, architecture<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Shardul<br />
|title=Document a full example through the Apertium pipeline<br />
|description=Come up with an example sentence that could hypothetically rely on each stage of the [[Apertium pipeline]], and show the input and output of each stage under the [[Apertium_system_architecture#Example_translation_at_each_stage|Example translation at each stage]] section on the Apertium wiki.<br />
|tags=wiki, architecture<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan, Shardul<br />
|title=Create a visual overview of structural transfer rules<br />
|description=Based on an [https://wikis.swarthmore.edu/ling073/Structural_transfer existing overview of Apertium structural transfer rules], come up with a visual presentation of transfer rules that shows what parts of a set of rules correspond to which changes in input and output, and also which definitions are used where in the rules. Get creative—you can do this all in any format easily viewed across platforms, especially as a webpage using modern effects like those offered by d3 or similar.<br />
|tags=wiki, architecture, visualisations, transfer<br />
}}{{Taskidea<br />
|type=documentation<br />
|mentors=Jonathan<br />
|title=Complete the Linguistic Data chart on Apertium system architecture wiki page<br />
|description=With the assistance of the Apertium community (our [[IRC]] channel) and the resources available on the Apertium wiki, fill in the remaining cells of the table in the "Linguistic data" section of the [[Apertium system architecture]] wiki page.<br />
|tags=wiki, architecture<br />
|beginner=yes<br />
}}{{Taskidea<br />
|type=research<br />
|mentors=Fran<br />
|title=Do a literature review on anaphora resolution<br />
|description=Anaphora resolution (see the [[anaphora resolution|wiki page]] is the task of determining for a pronoun or other item with reference what it refers to. Do a literature review and write up common methods with their success rates.<br />
|tags=anaphora, rbmt, engine<br />
|beginner=<br />
}}{{Taskidea<br />
|type=research<br />
|mentors=Fran<br />
|title=Write up grammatical tables for a grammar of a language that Apertium doesn't have an analyser for<br />
|description=Many descriptive grammars have useful tables that can be used for building morphological analysers. Unfortunately they are in Google Books or in paper and not easily processable by machine. The objective is to find a grammar of a language for which Apertium doesn't have a morphological analyser and write up the tables on a Wiki page.<br />
|tags=grammar, books, data-entry<br />
|beginner=<br />
}}{{Taskidea<br />
|type=research<br />
|mentors=Fran, Xavivars<br />
|title=Phrasebooks and frequency<br />
|description=Apertium is quite terrible in general with phrasebook style sentences in most languages. Try translating "what's up" from English to Spanish. The objective of this task is to look for phrasebook/filler type sentences/utterances in parallel corpora of film subtitles and on the internet and order them by frequency/generality. Frequency is the amount of times you see the utterance, generality is in how many different places you see it. <br />
|tags=phrasebook, translation<br />
|beginner=<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Flammie<br />
|title=Hungarian Open Source dictionaries<br />
|description=There are currently 3+ open source Hungarian open source resources for morphological analysis/dictionaries, study and document on how to install these and get the words and their inflectional informations out, and e.g. tabulate some examples of similarities and differences of word classes/tags/stuff. See [[Hungarian]] for more info.<br />
|tags=hungarian<br />
|beginner=<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Vin, Jonathan, Anna<br />
|title=Create a UD-Apertium morphology mapping<br />
|description=Choose a language that has a Universal Dependencies treebank and tabulate a potential set of Apertium morph labels based on the (universal) UD morph labels. See Apertium's [[list of symbols]] and [http://universaldependencies.org/ UD]'s POS and feature tags for the labels.<br />
|tags=morphology, ud, dependencies<br />
|beginner=<br />
|multi=5<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Vin, Jonathan, Anna<br />
|title=Create an Apertium-UD morphology mapping<br />
|description=Choose a language that has an Apertium morphological analyser and adapt it to convert the morphology to UD morphology<br />
|tags=morphology, ud, dependencies<br />
|beginner=<br />
|multi=5<br />
}}<br />
{{Taskidea<br />
|type=research<br />
|mentors=Vin<br />
|title=Create a full verbal paradigm for an Indo-Aryan language<br />
|description=Choose a regular verb and create a paradigm with all possible tense/aspect/mood inflections for an Indo-Aryan language (except Hindi or Marathi). Use Masica's grammar as a reference.<br />
|tags=morphology, indo-aryan<br />
|beginner=<br />
|multi=10<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Vin<br />
|title=Create a syntactic analogy corpus for a particular POS/language.<br />
|description=Refer to the syntactic section of [https://www.aclweb.org/anthology/N/N16/N16-2002.pdf this paper]. Try to create a data set with more than 2000 * 8 = 16000 entries for a particular POS with any language, using a large corpus for frequency.<br />
|tags=morphology, embeddings<br />
|beginner=<br />
|multi=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Vin<br />
|title=Envision and create a quick utility for tasks like morphological lookup<br />
|description=Many tasks like morphological analysis are annoying to do by navigating to the right directory, typing out an entire pipeline etc. Write a bash script to simplify some of these procedures, taking into account the install paths and prefixes if necessary. eg. echo "hargle" \| ~/analysers/apertium-eng/eng.automorf.bin ==> morph "hargle" eng<br />
|tags=bash, scripting<br />
|beginner=yes<br />
|multi=10<br />
}}<br />
{{Taskidea<br />
|type=research,code<br />
|mentors=Vin<br />
|title=Use open-source OCR to convert open-source non-text news corpora to text. Evaluate an analyser's coverage on them.<br />
|description=Many languages that have online newspapers do not use actual text to store the news but instead use images or GIFs :((( find a newspaper for a language that lacks news text online (eg. Marathi), check licenses, find an OCR tool and scrape a reasonably large corpus from the images if doing so would not violate CC/GPL. Evaluate the morphological analyser on it.<br />
|tags=python,morphology<br />
|beginner=<br />
}}<br />
{{Taskidea<br />
|type=research,quality<br />
|mentors=Shardul, Jonathan<br />
|tags=issues, python<br />
|title=Clean up open issues in [https://github.com/goavki/apertium-html-tools/issues html-tools], [https://github.com/goavki/phenny/issues begiak], or [https://github.com/goavki/apertium-apy/issues APy]<br />
|description=Go through issue threads for [https://github.com/goavki/apertium-html-tools/issues html-tools], [https://github.com/goavki/phenny/issues begiak], or [https://github.com/goavki/apertium-apy/issues APy], and find issues that have been solved in the code but are still open on GitHub. (The fact that they have been solved may not be evident from the comments thread alone.) Once you find such an issue, comment on the thread explaining what code/commit fixed it and how it behaves at the latest revision.<br />
|multi=15<br />
}}<br />
{{Taskidea<br />
|type=code,quality<br />
|mentors=Shardul, Jonathan<br />
|tags=tests, python, IRC<br />
|title=Get [https://github.com/goavki/phenny begiak] to build cleanly<br />
|description=Currently, [https://github.com/goavki/phenny begiak] does not build cleanly because of a number of failing tests. Find what is causing the tests to fail, and either fix the code or the tests if the code has changed its behavior. Document all your changes in the PR that you create.<br />
}}<br />
{{Taskidea<br />
|type=quality<br />
|mentors=Jonathan, Ilnar<br />
|title=Find stems in the Kazakh treebank that are not in the Kazakh analyser<br />
|description=There are quite a few analyses in the [http://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz/texts/puupankki/puupankki.kaz.conllu Kazakh treebank] that don't exist in the [[apertium-kaz|Kazakh analyser]]. Find as many examples of missing stems as you can. Feel free to write a script to automate this so it's as exhaustive (and non-exhausting:) as possible. You may either add what you find to the analyser yourself, commit a list of the missing stems to apertium-kaz/dev, or send a list to your mentor so that they may do one of these.<br />
|tags=treebank, Kazakh, analyses<br />
|beginner=yes<br />
}}<br />
{{Taskidea<br />
|type=quality<br />
|mentors=Jonathan, Ilnar<br />
|title=Find missing analyses in the Kazakh treebank that are not in the Kazakh analyser<br />
|description=There are quite a few analyses in the [http://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz/texts/puupankki/puupankki.kaz.conllu Kazakh treebank] that don't exist in the [[apertium-kaz|Kazakh analyser]]. Find as many examples of missing analyses (for existing stems) as you can. Feel free to write a script to automate this so it's as exhaustive (and non-exhausting:) as possible. You may commit a list of the missing stems to apertium-kaz/dev or send a list to your mentor so that they may do this.<br />
|tags=treebank, Kazakh, analyses<br />
|beginner=yes<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Use apertium-init to bootstrap a new language module<br />
|description=Use [[Apertium-init]] to bootstrap a new language module that doesn't currently exist in Apertium. To see if a language is available, check [[languages]] and [[incubator]], and especially ask on IRC. Add enough stems and morphology to the module so that it analyses and generates at least 100 correct forms. Check your code into Apertium's codebase. [[Task ideas for Google Code-in/Add words from frequency list|Read more about adding stems...]]<br />
|tags=languages, bootstrap, dictionaries<br />
|beginner=yes<br />
|multi=25<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Use apertium-init to bootstrap a new language pair<br />
|description=Use [[Apertium-init]] to bootstrap a new translation pair between two languages which have monolingual modules already in Apertium. To see if a translation pair has already been made, check our [[SVN]] repository, and especially ask on IRC. Add 100 common stems to the dictionary. Check your work into Apertium's codebase.<br />
|tags=languages, bootstrap, dictionaries, translators<br />
|beginner=yes<br />
|multi=25<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan, mlforcada<br />
|title=Add a transfer rule to an existing translation pair<br />
|description=Add a transfer rule to an existing translation pair that fixes an error in translation. Document the rule on the [http://wiki.apertium.org/ Apertium wiki] by adding a [[regression testing|regression tests]] page similar to [[English_and_Portuguese/Regression_tests]] or [[Icelandic_and_English/Regression_tests]]. Check your code into Apertium's codebase. [[Task ideas for Google Code-in/Add transfer rule|Read more...]]<br />
|tags=languages, bootstrap, transfer<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Add stems to an existing translation pair<br />
|description=Add 1000 common stems to the dictionary of an existing translation pair. Check your work into Apertium's codebase. [[Task ideas for Google Code-in/Add words from frequency list|Read more about adding stems...]]<br />
|tags=languages, bootstrap, dictionaries, translators<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Write 10 lexical selection to an existing translation pair<br />
|description=Add 10 lexical selection rules to an existing translation pair. Check your work into Apertium's codebase. [[Task ideas for Google Code-in/Add lexical-select rules|Read more...]]<br />
|tags=languages, bootstrap, lexical selection, translators<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code<br />
|mentors=Jonathan<br />
|title=Write 10 constraint grammar rules for an existing language module<br />
|description=Add 10 constraint grammar rules to an existing language that you know. Check your work into Apertium's codebase. [[Task ideas for Google Code-in/Add constraint-grammar rules|Read more...]]<br />
|tags=languages, bootstrap, constraint grammar<br />
|multi=25<br />
|dup=5<br />
}}<br />
{{Taskidea<br />
|type=code,interface<br />
|mentors=Jonathan<br />
|title=Paradigm generator webpage<br />
|description=Write a standalone webpage that makes queries (though javascript) to an [[apertium-apy]] server to fill in a morphological forms based on morphological tags that are hidden throughout the body of the page. For example, say you have the verb "say", and some tags like inf, past, pres.p3.sg—these forms would get filled in as "say", "said", "says".<br />
|tags=javascript, html, apy<br />
}}<br />
<br />
</table><br />
<br />
<br />
[[Category:Google Code-in]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Task_ideas_for_Google_Code-in/Getting_started&diff=64732Task ideas for Google Code-in/Getting started2017-11-15T06:41:11Z<p>Deltamachine: </p>
<hr />
<div>{{TOCD}}<br />
<br />
This page will describe some steps you can take to get involved with the Apertium project in the Google Code-in. First of all, thanks for reading! We're very enthusiastic about getting new contributors to Apertium and to helping spread our passion for language technology.<br />
<br />
== First steps ==<br />
So, what are the first steps ? <br />
<br />
* '''[[IRC|Talk to us!]]''' This is the most important step! Nothing in Apertium is too hard without the right amount of help. And we like helping, so just get in contact. The best way to contact us is on [[IRC]], and the best way to use IRC is with a client like irssi,<ref>https://irssi.org/</ref> weechat<ref>https://weechat.org/</ref> hexchat<ref>https://hexchat.github.io/</ref> or LimeChat<ref>https://itunes.apple.com/us/app/limechat/id414030210?mt=12</ref>. A good tip is to hang out on IRC, even if no-one is talking when you enter. People can be in different time zones, and channel activity peaks depending on the time.<br />
<br />
:Here's a list of the IRC nicks and wiki usernames of some of the mentors who are regulars on IRC:<br />
:{|class="wikitable sortable"<br />
|-<br />
! GCI name !! IRC nick !! wiki username !! Email address<br />
|-<br />
| Jonathan W || firespeaker, jonorthwash || [[User:Firespeaker|Firespeaker]] || jonathan.north.washington@gmail.com<br />
|-<br />
| Francis Tyers || spectie, spectei, spectre || [[User:Francis_Tyers|Francis Tyers]] || francis.tyers@gmail.com<br />
|-<br />
| Maria Shejanova || maryszmary || Masha || masha.shejanova@gmail.com<br />
|-<br />
| Aida Sundetova || aida27 || Aida || ?<br />
|-<br />
| Kevin Brubeck Unhammer || Unhammer || [[User:Unhammer|Unhammer]] || unhammer+apertium@mm.st<br />
|-<br />
| Vinit Ravishankar || vin-ivar || [[User:Vin-ivar|Vin-ivar]] || <br />
|-<br />
| Memduh Gökırmak || fotonzade || || memduhg@gmail.com<br />
|-<br />
| Sushain Cherivirala || sushain, sushain97 || [[User:Sushain|Sushain]] || sushain97@gmail.com<br />
|-<br />
| Xavi Ivars || xavivars || [[User:Xavivars|Xavi Ivars]] || xavi.ivars@gmail.com<br />
|-<br />
| Irene Tang || irene_ || [[User:Irene|Irene]] || irenetang14@gmail.com<br />
|-<br />
| Shardul Chiplunkar || shardulc || [[User:Shardulc|Shardulc]] || shardul.chiplunkar@gmail.com<br />
|-<br />
| Anna Kondratjeva || deltamachine || [[User:deltamachine|deltamachine]] || an-an-kondratjeva@yandex.ru<br />
|-<br />
| Vinay Singh || SilentFlame || [[User:SilentFlame|SilentFlame]] || csvinay.d@gmail.com<br />
|-<br />
| Jaipal Singh Goud || Schindler || Schindler || jpsinghgoud@gmail.com<br />
|-<br />
| Matthew Marting || m5w, m5w_ || M5w || ?<br />
|-<br />
| Tommi Pirinen || Flammie || || ffflammie@gmail.com<br />
|-<br />
| Inari Listenmaa || inariksit || [[User:Inariksit|Inariksit]] || ?<br />
|-<br />
| Marc Riera || mrieratrad || [[User:Marcriera|Marc Riera]] || marc.riera.irigoyen@gmail.com<br />
|-<br />
| Ng Wei En || wei2912 || [[User:Wei2912|Wei En]] || weien1292@gmail.com<br />
|-<br />
| Marina Kustova || edgeandpearl || [[User:Edgeandpearl|edgeandpearl]] || marinakoustova@gmail.com<br />
|}<br />
<br />
<br />
* '''[[Installation|Install Apertium]]:''' Not all tasks require Apertium to be installed, but if you're planning to work with Apertium, it's a good idea to do this early. <br />
<br />
* '''Find an interesting task:'''<br />
<br />
== Useful guidelines ==<br />
Things you might want to know.<br />
<br />
=== Access ===<br />
For some tasks, you may need access to Apertium resources, like the '''wiki''' or our '''[[subversion|subversion repository]]'''. Usually this is no problem—you just need ask a mentor or an org admin (ask on IRC above).<br />
<br />
=== Tasks on github ===<br />
For tasks relating to code on github (e.g., [[begiak]], [[APy]], and [[html-tools]]), you just need to clone the relevant repository, make your changes, and submit a pull request.<br />
<br />
=== "Fix any bug" tasks ===<br />
For tasks that point you at a repository and ask you to fix any bug, you should decide on a bug and tell your mentor which one you want to work on when you claim the task. You are also encouraged to come onto IRC (see above) and ask which bug might be a good one to work on given your background—i.e., discussing it with a mentor ahead of time.<br />
<br />
=== Where is apertium code? ===<br />
Apertium code is housed in several places:<br />
* Most code, including the core tools, translation and language modules, and a number of other things, live in our '''[[subversion|svn repo]]'''. The language data is found in the following places:<br />
** [[Languages|/languages]] - where stable monolingual language packages live<br />
** [[Incubator|/incubator]] - where the initial stages of language data development takes place, and sometimes stagnates<br />
** [[Nursery|/nursery]] - where translation modules that have begun to become useful/usable live<br />
** [[Staging|/staging]] - where translation modules that are nearly ready—but are still not quite ready for production-environment use—live<br />
** [[Trunk|/trunk]] - where translation modules that are fully developed and considered stable live; also here is the main code base, etc.<br />
* Many tools are also in svn, specifically [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/ /trunk/apertium-tools].<br />
* Several tools live on '''GitHub''', including '''[[begiak]]''' (our IRC bot), '''[[APy]]''' (our web API), and '''[[html-tools]]''' (our website framework). The latter two of these are synchronised back into SVN (in [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/ /trunk/apertium-tools]), but the main development for all three occurs on GitHub.<br />
<br />
=== Language and translation modules ===<br />
* Most translation modules are structured in the form of <tt>apertium-xxx-yyy</tt>, meaning it's a module that translates from language xxx to langauge yyy (and potentially the other way around).<br />
** Some older language modules use two letter abbreviations, like <tt>apertium-xx-yy</tt>, but the standard now is three-letter<br />
** Monolingual language modules are named <tt>apertium-xxx</tt>, where xxx is the ISO 639-3 code for the language<br />
** All but some older translation modules rely on monolingual language modules<br />
* Some monolingual language modules are based on [[HFST]], and some are based on [[lttoolbox]].<br />
* You can [[installation|install]] pre-compiled language and translation modules for end-user use from our package repositories, but if you'd like to work on the data, you need to download the relevant one(s) and compile it/them yourself.<br />
* You can [[installation|install]] pre-compiled core tools from our package repositories for end-user use or for developing language modules, but if you'd like to work on a particular tool, you need to download and compile it yourself.<br />
<br />
==Links==<br />
<br />
<references/><br />
<br />
[[Category:Google Code-in|Getting started]]</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Shallow_syntactic_function_labeller&diff=64249Shallow syntactic function labeller2017-08-31T14:20:25Z<p>Deltamachine: /* Bugs */</p>
<hr />
<div>This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]<br />
<br />
A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller<br />
<br />
A workplan and progress notes can be found here: [[Shallow syntactic function labeller/Workplan]]<br />
<br />
== What was done ==<br />
1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).<br />
<br />
2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.<br />
<br />
3. The labeller itself was created. Also the testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller and installation script.<br />
<br />
== List of commits ==<br />
All commits are listed below:<br />
<br />
https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master<br />
<br />
== Description ==<br />
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.<br />
<br />
=== Labeller in the pipeline ===<br />
The labeller runs between morphological analyzer or disambiguator and pretransfer.<br />
<br />
For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.<br />
<br />
<pre><br />
... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...<br />
</pre><br />
<br />
=== Language pairs support ===<br />
Currently the labeller works with following language pairs:<br />
* sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)<br />
* kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels<br />
<br />
Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.<br />
<br />
=== Labelling performance ===<br />
The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).<br />
<br />
{|class=wikitable<br />
|-<br />
! Language !! Accuracy <br />
<br />
|-<br />
| North Sami || 81,6% <br />
|-<br />
<br />
|-<br />
| Kurmanji || 84% <br />
|- <br />
<br />
|-<br />
| Breton || 79,7% <br />
|-<br />
<br />
|-<br />
| Kazakh || 82,6% <br />
|-<br />
<br />
|-<br />
| English || 79,8% <br />
|-<br />
<br />
|}<br />
<br />
== Installation ==<br />
<br />
=== Prerequisites ===<br />
1. Python libraries:<br />
* DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)<br />
* Streamparser (https://github.com/goavki/streamparser)<br />
<br />
2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)<br />
<br />
=== How to install a testpack ===<br />
NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.<br />
<br />
<pre><br />
git clone https://github.com/deltamachine/sfl_testpack.git<br />
cd sfl_testpack<br />
</pre><br />
<br />
Script ''setup.py'' adds all the needed files in language pair directory and changes all files with modes. <br />
<br />
'''Arguments:'''<br />
* ''work_mode:'' '''-lb''' for installing the labeller and changing modes, '''-cg''' for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.<br />
* ''lang:'' '''-sme''' for installing/uninstalling the labeller only for sme-nob, '''-kmr''' - only for kmr-eng, '''-all''' - for both. <br />
<br />
For example, this script will install the labeller and add it to the pipeline for both pairs:<br />
<pre><br />
python setup.py -lb -all<br />
</pre><br />
<br />
And this script will backward modes changes for sme-nob:<br />
<pre><br />
python setup.py -cg -sme<br />
</pre><br />
<br />
== Bugs ==<br />
1. <s>Installation script changes eng-kmr pipeline along with kmr-eng</s><br />
<br />
2. <s>Problems with tags order (syntactic label is not the last tag)</s><br />
<br />
3. <s>Words-without-a-label bug</s><br />
<pre><br />
<spectre> is it possible that some words don't get a label ?<br />
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger<br />
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$ <br />
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$<br />
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$<br />
</pre><br />
<br />
2 and 3 seem to be fixed, but it should be checked carefully.<br />
<br />
== To do ==<br />
* Do more tests. MORE.<br />
* '''Fix bugs'''<br />
* Refactore the main code.<br />
* '''Continue improving the perfomance of the models.'''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Shallow_syntactic_function_labeller&diff=64248Shallow syntactic function labeller2017-08-31T14:19:38Z<p>Deltamachine: /* Bugs */</p>
<hr />
<div>This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]<br />
<br />
A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller<br />
<br />
A workplan and progress notes can be found here: [[Shallow syntactic function labeller/Workplan]]<br />
<br />
== What was done ==<br />
1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).<br />
<br />
2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.<br />
<br />
3. The labeller itself was created. Also the testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller and installation script.<br />
<br />
== List of commits ==<br />
All commits are listed below:<br />
<br />
https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master<br />
<br />
== Description ==<br />
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.<br />
<br />
=== Labeller in the pipeline ===<br />
The labeller runs between morphological analyzer or disambiguator and pretransfer.<br />
<br />
For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.<br />
<br />
<pre><br />
... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...<br />
</pre><br />
<br />
=== Language pairs support ===<br />
Currently the labeller works with following language pairs:<br />
* sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)<br />
* kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels<br />
<br />
Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.<br />
<br />
=== Labelling performance ===<br />
The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).<br />
<br />
{|class=wikitable<br />
|-<br />
! Language !! Accuracy <br />
<br />
|-<br />
| North Sami || 81,6% <br />
|-<br />
<br />
|-<br />
| Kurmanji || 84% <br />
|- <br />
<br />
|-<br />
| Breton || 79,7% <br />
|-<br />
<br />
|-<br />
| Kazakh || 82,6% <br />
|-<br />
<br />
|-<br />
| English || 79,8% <br />
|-<br />
<br />
|}<br />
<br />
== Installation ==<br />
<br />
=== Prerequisites ===<br />
1. Python libraries:<br />
* DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)<br />
* Streamparser (https://github.com/goavki/streamparser)<br />
<br />
2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)<br />
<br />
=== How to install a testpack ===<br />
NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.<br />
<br />
<pre><br />
git clone https://github.com/deltamachine/sfl_testpack.git<br />
cd sfl_testpack<br />
</pre><br />
<br />
Script ''setup.py'' adds all the needed files in language pair directory and changes all files with modes. <br />
<br />
'''Arguments:'''<br />
* ''work_mode:'' '''-lb''' for installing the labeller and changing modes, '''-cg''' for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.<br />
* ''lang:'' '''-sme''' for installing/uninstalling the labeller only for sme-nob, '''-kmr''' - only for kmr-eng, '''-all''' - for both. <br />
<br />
For example, this script will install the labeller and add it to the pipeline for both pairs:<br />
<pre><br />
python setup.py -lb -all<br />
</pre><br />
<br />
And this script will backward modes changes for sme-nob:<br />
<pre><br />
python setup.py -cg -sme<br />
</pre><br />
<br />
== Bugs ==<br />
1. <s>Installation script changes eng-kmr pipeline along with kmr-eng</s><br />
<br />
2. <s>Problems with tags order (syntactic label is not the last tag)</s><br />
<br />
3. <s>Words-without-a-label bug</s><br />
<pre><br />
<spectre> is it possible that some words don't get a label ?<br />
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger<br />
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$ <br />
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$<br />
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$<br />
</pre><br />
<br />
2 and 3 seem to be fixed, but it chould be checked carefully.<br />
<br />
== To do ==<br />
* Do more tests. MORE.<br />
* '''Fix bugs'''<br />
* Refactore the main code.<br />
* '''Continue improving the perfomance of the models.'''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Shallow_syntactic_function_labeller&diff=64124Shallow syntactic function labeller2017-08-28T13:44:05Z<p>Deltamachine: /* What was done */</p>
<hr />
<div>This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]<br />
<br />
A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller<br />
<br />
A workplan and progress notes can be found here: [[Shallow syntactic function labeller/Workplan]]<br />
<br />
== What was done ==<br />
1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).<br />
<br />
2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.<br />
<br />
3. The labeller itself was created. Also the testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller and installation script.<br />
<br />
== List of commits ==<br />
All commits are listed below:<br />
<br />
https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master<br />
<br />
== Description ==<br />
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.<br />
<br />
=== Labeller in the pipeline ===<br />
The labeller runs between morphological analyzer or disambiguator and pretransfer.<br />
<br />
For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.<br />
<br />
<pre><br />
... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...<br />
</pre><br />
<br />
=== Language pairs support ===<br />
Currently the labeller works with following language pairs:<br />
* sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)<br />
* kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels<br />
<br />
Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.<br />
<br />
=== Labelling performance ===<br />
The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).<br />
<br />
{|class=wikitable<br />
|-<br />
! Language !! Accuracy <br />
<br />
|-<br />
| North Sami || 81,6% <br />
|-<br />
<br />
|-<br />
| Kurmanji || 84% <br />
|- <br />
<br />
|-<br />
| Breton || 79,7% <br />
|-<br />
<br />
|-<br />
| Kazakh || 82,6% <br />
|-<br />
<br />
|-<br />
| English || 79,8% <br />
|-<br />
<br />
|}<br />
<br />
== Installation ==<br />
<br />
=== Prerequisites ===<br />
1. Python libraries:<br />
* DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)<br />
* Streamparser (https://github.com/goavki/streamparser)<br />
<br />
2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)<br />
<br />
=== How to install a testpack ===<br />
NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.<br />
<br />
<pre><br />
git clone https://github.com/deltamachine/sfl_testpack.git<br />
cd sfl_testpack<br />
</pre><br />
<br />
Script ''setup.py'' adds all the needed files in language pair directory and changes all files with modes. <br />
<br />
'''Arguments:'''<br />
* ''work_mode:'' '''-lb''' for installing the labeller and changing modes, '''-cg''' for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.<br />
* ''lang:'' '''-sme''' for installing/uninstalling the labeller only for sme-nob, '''-kmr''' - only for kmr-eng, '''-all''' - for both. <br />
<br />
For example, this script will install the labeller and add it to the pipeline for both pairs:<br />
<pre><br />
python setup.py -lb -all<br />
</pre><br />
<br />
And this script will backward modes changes for sme-nob:<br />
<pre><br />
python setup.py -cg -sme<br />
</pre><br />
<br />
== Bugs ==<br />
* <s>Installation script changes eng-kmr pipeline along with kmr-eng</s><br />
* <s>Problems with tags order (syntactic label is not the last tag)</s> - seems to be fixed, but it should be checked carefully.<br />
* Words-without-a-label bug<br />
<pre><br />
<spectre> is it possible that some words don't get a label ?<br />
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger<br />
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$ <br />
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$<br />
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$<br />
</pre><br />
<br />
== To do ==<br />
* Do more tests. MORE.<br />
* '''Fix bugs'''<br />
* Refactore the main code.<br />
* '''Continue improving the perfomance of the models.'''</div>Deltamachinehttps://wiki.apertium.org/w/index.php?title=Shallow_syntactic_function_labeller&diff=64118Shallow syntactic function labeller2017-08-28T13:19:06Z<p>Deltamachine: /* To do */</p>
<hr />
<div>This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]<br />
<br />
A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller<br />
<br />
A workplan and progress notes can be found here: [[Shallow syntactic function labeller/Workplan]]<br />
<br />
== What was done ==<br />
1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).<br />
<br />
2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.<br />
<br />
3. The testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller itself and installation script.<br />
<br />
== List of commits ==<br />
All commits are listed below:<br />
<br />
https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master<br />
<br />
== Description ==<br />
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.<br />
<br />
=== Labeller in the pipeline ===<br />
The labeller runs between morphological analyzer or disambiguator and pretransfer.<br />
<br />
For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.<br />
<br />
<pre><br />
... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...<br />
</pre><br />
<br />
=== Language pairs support ===<br />
Currently the labeller works with following language pairs:<br />
* sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)<br />
* kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels<br />
<br />
Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.<br />
<br />
=== Labelling performance ===<br />
The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).<br />
<br />
{|class=wikitable<br />
|-<br />
! Language !! Accuracy <br />
<br />
|-<br />
| North Sami || 81,6% <br />
|-<br />
<br />
|-<br />
| Kurmanji || 84% <br />
|- <br />
<br />
|-<br />
| Breton || 79,7% <br />
|-<br />
<br />
|-<br />
| Kazakh || 82,6% <br />
|-<br />
<br />
|-<br />
| English || 79,8% <br />
|-<br />
<br />
|}<br />
<br />
== Installation ==<br />
<br />
=== Prerequisites ===<br />
1. Python libraries:<br />
* DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)<br />
* Streamparser (https://github.com/goavki/streamparser)<br />
<br />
2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)<br />
<br />
=== How to install a testpack ===<br />
NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.<br />
<br />
<pre><br />
git clone https://github.com/deltamachine/sfl_testpack.git<br />
cd sfl_testpack<br />
</pre><br />
<br />
Script ''setup.py'' adds all the needed files in language pair directory and changes all files with modes. <br />
<br />
'''Arguments:'''<br />
* ''work_mode:'' '''-lb''' for installing the labeller and changing modes, '''-cg''' for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.<br />
* ''lang:'' '''-sme''' for installing/uninstalling the labeller only for sme-nob, '''-kmr''' - only for kmr-eng, '''-all''' - for both. <br />
<br />
For example, this script will install the labeller and add it to the pipeline for both pairs:<br />
<pre><br />
python setup.py -lb -all<br />
</pre><br />
<br />
And this script will backward modes changes for sme-nob:<br />
<pre><br />
python setup.py -cg -sme<br />
</pre><br />
<br />
== Bugs ==<br />
* <s>Installation script changes eng-kmr pipeline along with kmr-eng</s><br />
* <s>Problems with tags order (syntactic label is not the last tag)</s> - seems to be fixed, but it should be checked carefully.<br />
* Words-without-a-label bug<br />
<pre><br />
<spectre> is it possible that some words don't get a label ?<br />
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger<br />
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$ <br />
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$<br />
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$<br />
</pre><br />
<br />
== To do ==<br />
* Do more tests. MORE.<br />
* '''Fix bugs'''<br />
* Refactore the main code.<br />
* '''Continue improving the perfomance of the models.'''</div>Deltamachine