Difference between revisions of "User:Naan Dhaan/User friendly lexical training"
Naan Dhaan (talk | contribs) (week 9) |
Naan Dhaan (talk | contribs) (GSoC final week) |
||
(One intermediate revision by the same user not shown) | |||
Line 87: | Line 87: | ||
* Rules are generated no more than MAX_RULES for every (slword, ngram) pair and with crispiness > CRISPHOLD |
* Rules are generated no more than MAX_RULES for every (slword, ngram) pair and with crispiness > CRISPHOLD |
||
* non-parallel training can take both corpus and lang model as input |
* non-parallel training can take both corpus and lang model as input |
||
|- |
|||
| Aug 10- Aug 16 |
|||
| |
|||
* multitrans infinite ambiguous sentences output bug fixed |
|||
* default freq 0.0 issue fixed |
|||
* non-parallel lexical selection training done(as per https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/User-friendly_lexical_selection_training, can be improved) |
|||
* Github action fixes |
|||
| |
|||
* lexical_selection_training.py can do non-parallel corpora training |
|||
|- |
|||
| Aug 17- Aug 23 |
|||
| |
|||
* cleaning scripts |
|||
* moving ambiguous and wrap to common thus reducing the code |
|||
* wrapping error while extracting freq fixed |
|||
* multitrans wiki fixes |
|||
* other fixes in apertium-lex-tools |
|||
| |
|||
* non-parallel corpora training time further reduced as a result of applying filters and removing redundant read_frequencies from biltrans-count-patterns-ngrams.py |
|||
|} |
|} |
Latest revision as of 16:41, 24 August 2021
The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation
Work Plan[edit]
Time Period | Details | Deliverable |
---|---|---|
Community Bonding Period
May 17-31 |
|
driver script can validate if the required tools are setup |
Community Bonding Period
June 1-7 |
reading apertium documentation |
|
June 8-14 |
|
driver script can now, clean corpus, tag it and generate alignments |
June 15-21 | full driver script complete(requires testing (: ) | driver script can now generate rules |
June 22-28 | bug fixes | |
June 29 - July 5 | some more bug fixes | |
July 6-12 |
|
|
July 13-19 |
|
|
July 20-26 |
|
|
July 27- Aug 2 |
|
|
Aug 3- Aug 9 |
|
|
Aug 10- Aug 16 |
|
|
Aug 17- Aug 23 |
|
|