Difference between revisions of "User:Naan Dhaan/User friendly lexical training"

Latest revision as of 16:41, 24 August 2021

The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation

Work Plan[edit]

Time Period	Details	Deliverable
Community Bonding Period May 17-31	helper script check_config.py to check if the configuration and tools are fine automated test script to test check_config.py	driver script can validate if the required tools are setup
Community Bonding Period June 1-7	reading apertium documentation
June 8-14	added installation instructions in README incorporate clean_corpus in the driver script added code for tagging added code for aligning	driver script can now, clean corpus, tag it and generate alignments
June 15-21	full driver script complete(requires testing (: )	driver script can now generate rules
June 22-28	bug fixes
June 29 - July 5	some more bug fixes
July 6-12	Github actions tutorials some minor fixes like formatting string with f"" and fixes in apertium-lex-tools
July 13-19	added Github actions for training and checking config incorporated changes of apertium-lex-tools(60a6ae9)	lexical_selection.py takes config file as an optional input Github actions no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths
July 20-26	revisited lexical selection scripts for rule extraction and made some fixes in them initiated non-parallel corpora training script(bash)
July 27- Aug 2	Added check_config for non-parallel corpora training Github action for non-parallel corpora training replace maxent with MLE and trained on full corpus added functionality for fetching top N rules some fixes in apertium-lex-tools	Github actions for non-parallel corpora training passing false to 'IS_PARALLEL' in config does non-parallel corpora training(till check_config for now). As of now, it takes corpus as input for the target side. Top MAX_NGRAMS rules are selected for every (sl, ngram) pair
Aug 3- Aug 9	Added MAX_RULES and CRISPHOLD to filter the rules Added option for binary lang model as input fixed IRSTLM installation bug added non-parallel corpora training in lexical_selection_training.py some fixes in apertium-lex-tools	lexical_selection_training.py can do non-parallel corpora training. However there are some issues with multitrans Rules are generated no more than MAX_RULES for every (slword, ngram) pair and with crispiness > CRISPHOLD non-parallel training can take both corpus and lang model as input
Aug 10- Aug 16	multitrans infinite ambiguous sentences output bug fixed default freq 0.0 issue fixed non-parallel lexical selection training done(as per https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/User-friendly_lexical_selection_training, can be improved) Github action fixes	lexical_selection_training.py can do non-parallel corpora training
Aug 17- Aug 23	cleaning scripts moving ambiguous and wrap to common thus reducing the code wrapping error while extracting freq fixed multitrans wiki fixes other fixes in apertium-lex-tools	non-parallel corpora training time further reduced as a result of applying filters and removing redundant read_frequencies from biltrans-count-patterns-ngrams.py

@@ Line 23: / Line 23: @@
 |
 |-
+| June 8-14
+|
+* added installation instructions in README
+* incorporate clean_corpus in the driver script
+* added code for tagging
+* added code for aligning
+| driver script can now, clean corpus, tag it and generate alignments
+|-
+| June 15-21
+| full driver script complete(requires testing (: )
+| driver script can now generate rules
+|-
+| June 22-28
+| bug fixes
+|
+|-
+| June 29 - July 5
+| some more bug fixes
+|
+|-
+| July 6-12
+|
+* Github actions tutorials
+* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools
+|
+|-
+| July 13-19
+|
+* added Github actions for training and checking config
+* incorporated changes of apertium-lex-tools(60a6ae9)
+|
+* lexical_selection.py takes config file as an optional input
+* Github actions
+* no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths
+|-
+| July 20-26
+|
+* revisited lexical selection scripts for rule extraction and made some fixes in them
+* initiated non-parallel corpora training script(bash)
+|
+|-
+| July 27- Aug 2
+|
+* Added check_config for non-parallel corpora training
+* Github action for non-parallel corpora training
+* replace maxent with MLE and trained on full corpus
+* added functionality for fetching top N rules
+* some fixes in apertium-lex-tools
+|
+* Github actions for non-parallel corpora training
+* passing false to 'IS_PARALLEL' in config does non-parallel corpora training(till check_config for now). As of now, it takes corpus as input for the target side.
+* Top MAX_NGRAMS rules are selected for every (sl, ngram) pair
+|-
+| Aug 3- Aug 9
+|
+* Added MAX_RULES and CRISPHOLD to filter the rules
+* Added option for binary lang model as input
+* fixed IRSTLM installation bug
+* added non-parallel corpora training in lexical_selection_training.py
+* some fixes in apertium-lex-tools
+|
+* lexical_selection_training.py can do non-parallel corpora training. However there are some issues with multitrans
+* Rules are generated no more than MAX_RULES for every (slword, ngram) pair and with crispiness > CRISPHOLD
+* non-parallel training can take both corpus and lang model as input
+|-
+| Aug 10- Aug 16
+|
+* multitrans infinite ambiguous sentences output bug fixed
+* default freq 0.0 issue fixed
+* non-parallel lexical selection training done(as per https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/User-friendly_lexical_selection_training, can be improved)
+* Github action fixes
+|
+* lexical_selection_training.py can do non-parallel corpora training
+|-
+| Aug 17- Aug 23
+|
+* cleaning scripts
+* moving ambiguous and wrap to common thus reducing the code
+* wrapping error while extracting freq fixed
+* multitrans wiki fixes
+* other fixes in apertium-lex-tools
+|
+* non-parallel corpora training time further reduced as a result of applying filters and removing redundant read_frequencies from biltrans-count-patterns-ngrams.py
 |}

Difference between revisions of "User:Naan Dhaan/User friendly lexical training"

Latest revision as of 16:41, 24 August 2021

Work Plan[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools