https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Naan+Dhaan&feedformat=atomApertium - User contributions [en]2024-03-28T21:13:20ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73722User:Naan Dhaan/User friendly lexical training2021-08-24T16:41:28Z<p>Naan Dhaan: GSoC final week</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29 - July 5<br />
| some more bug fixes<br />
|<br />
|-<br />
| July 6-12<br />
| <br />
* Github actions tutorials<br />
* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools<br />
|<br />
|-<br />
| July 13-19<br />
|<br />
* added Github actions for training and checking config<br />
* incorporated changes of apertium-lex-tools(60a6ae9)<br />
|<br />
* lexical_selection.py takes config file as an optional input<br />
* Github actions<br />
* no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths<br />
|-<br />
| July 20-26<br />
|<br />
* revisited lexical selection scripts for rule extraction and made some fixes in them<br />
* initiated non-parallel corpora training script(bash)<br />
|<br />
|-<br />
| July 27- Aug 2<br />
|<br />
* Added check_config for non-parallel corpora training<br />
* Github action for non-parallel corpora training<br />
* replace maxent with MLE and trained on full corpus<br />
* added functionality for fetching top N rules<br />
* some fixes in apertium-lex-tools<br />
|<br />
* Github actions for non-parallel corpora training<br />
* passing false to 'IS_PARALLEL' in config does non-parallel corpora training(till check_config for now). As of now, it takes corpus as input for the target side.<br />
* Top MAX_NGRAMS rules are selected for every (sl, ngram) pair<br />
|-<br />
| Aug 3- Aug 9<br />
|<br />
* Added MAX_RULES and CRISPHOLD to filter the rules<br />
* Added option for binary lang model as input<br />
* fixed IRSTLM installation bug<br />
* added non-parallel corpora training in lexical_selection_training.py<br />
* some fixes in apertium-lex-tools<br />
|<br />
* lexical_selection_training.py can do non-parallel corpora training. However there are some issues with multitrans<br />
* Rules are generated no more than MAX_RULES for every (slword, ngram) pair and with crispiness > CRISPHOLD<br />
* non-parallel training can take both corpus and lang model as input<br />
|-<br />
| Aug 10- Aug 16<br />
|<br />
* multitrans infinite ambiguous sentences output bug fixed<br />
* default freq 0.0 issue fixed<br />
* non-parallel lexical selection training done(as per https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/User-friendly_lexical_selection_training, can be improved)<br />
* Github action fixes<br />
|<br />
* lexical_selection_training.py can do non-parallel corpora training<br />
|-<br />
| Aug 17- Aug 23<br />
|<br />
* cleaning scripts<br />
* moving ambiguous and wrap to common thus reducing the code<br />
* wrapping error while extracting freq fixed<br />
* multitrans wiki fixes<br />
* other fixes in apertium-lex-tools<br />
|<br />
* non-parallel corpora training time further reduced as a result of applying filters and removing redundant read_frequencies from biltrans-count-patterns-ngrams.py<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=Multitrans&diff=73715Multitrans2021-08-22T05:40:06Z<p>Naan Dhaan: </p>
<hr />
<div>'''multitrans''' is a program found in apertium-lex-tools, used as a helper when training (see [[Learning rules from parallel and non-parallel corpora]]).<br />
<br />
==modes==<br />
<br />
===-b | --biltrans===<br />
This will output the source along with all target translations, like lt-proc -b.<br />
<br />
Doing just<br />
<pre><br />
multitrans -b sl-tl.autobil.bin<br />
</pre><br />
is equivalent to doing <code>lt-proc -b sl-tl.autobil.bin</code> if the input consists of just correctly formatted lexical units (lt-proc -b fails on some misformattings that multitrans ignores).<br />
<br />
===-p | --tagger-output===<br />
This will output the source side only, so used alone it turns into cat, but used with -t you can trim the tags to what bidix has.<br />
<br />
So if bidix has an entry for kake&lt;n&gt;&lt;f&gt;, you'll get<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans -p -t nno-nob.autobil.bin<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
===-m | --multitrans===<br />
This will output one entry on each line with a pair of translations, e.g. <br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -m nor-eng.autobil.bin<br />
.[][0 0].[] ^obsternasig<adj><pst><sg><ind>/obstinate<adj><pst><sg><ind>$<br />
.[][0 1].[] ^obsternasig<adj><pst><sg><ind>/obdurate<adj><pst><sg><ind>$<br />
.[][0 2].[] ^obsternasig<adj><pst><sg><ind>/stubborn<adj><pst><sg><ind>$<br />
.[][0 3].[] ^obsternasig<adj><pst><sg><ind>/refractory<adj><pst><sg><ind>$<br />
</pre><br />
<br />
==Options==<br />
===-t | --trim-lines===<br />
Trims off tags that don't appear in bidix, e.g. if bidix has an entry for kake&lt;n&gt;&lt;f&gt;:<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans -p -t nno-nob.autobil.bin<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
Can be used with -m or -b as well:<br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -m -t nor-eng.autobil.bin<br />
.[][0 0].[] ^obsternasig<adj><*>/obstinate<adj><*>$<br />
.[][0 1].[] ^obsternasig<adj><*>/obdurate<adj><*>$<br />
.[][0 2].[] ^obsternasig<adj><*>/stubborn<adj><*>$<br />
.[][0 3].[] ^obsternasig<adj><*>/refractory<adj><*>$<br />
<br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -b -t nor-eng.autobil.bin<br />
^obsternasig<adj><*>/obstinate<adj><*>/obdurate<adj><*>/stubborn<adj><*>/refractory<adj><*>$<br />
</pre><br />
<br />
===-f | --filter-lines===<br />
Applies filters on the sentences. When applied, outputs only sentences having ambiguous words, <code>fertility < 10000</code>(number of combinations of sentences that can be formed using the ambiguous words) and <code>coverage >= 90</code>(some filter on the number of ambiguous words)<br />
<br />
===-n | --number-lines===<br />
Numbers the lines. Doesn't seem to make a difference under the -m mode.<br />
<br />
===-z | --null-flush===<br />
https://wiki.apertium.org/wiki/Null_flush<br />
<br />
[[Category:Lexical selection]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=Multitrans&diff=73714Multitrans2021-08-22T05:25:03Z<p>Naan Dhaan: Adding description to -f</p>
<hr />
<div>'''multitrans''' is a program found in apertium-lex-tools, used as a helper when training (see [[Learning rules from parallel and non-parallel corpora]]).<br />
<br />
==modes==<br />
<br />
===-b===<br />
This will output the source along with all target translations, like lt-proc -b.<br />
<br />
Doing just<br />
<pre><br />
multitrans -b sl-tl.autobil.bin<br />
</pre><br />
is equivalent to doing <code>lt-proc -b sl-tl.autobil.bin</code> if the input consists of just correctly formatted lexical units (lt-proc -b fails on some misformattings that multitrans ignores).<br />
<br />
===-p===<br />
This will output the source side only, so used alone it turns into cat, but used with -t you can trim the tags to what bidix has.<br />
<br />
So if bidix has an entry for kake&lt;n&gt;&lt;f&gt;, you'll get<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans -p -t nno-nob.autobil.bin<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
===-m===<br />
This will output one entry on each line with a pair of translations, e.g. <br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -m nor-eng.autobil.bin<br />
.[][0 0].[] ^obsternasig<adj><pst><sg><ind>/obstinate<adj><pst><sg><ind>$<br />
.[][0 1].[] ^obsternasig<adj><pst><sg><ind>/obdurate<adj><pst><sg><ind>$<br />
.[][0 2].[] ^obsternasig<adj><pst><sg><ind>/stubborn<adj><pst><sg><ind>$<br />
.[][0 3].[] ^obsternasig<adj><pst><sg><ind>/refractory<adj><pst><sg><ind>$<br />
</pre><br />
<br />
==Options==<br />
===-t===<br />
Trims off tags that don't appear in bidix, e.g. if bidix has an entry for kake&lt;n&gt;&lt;f&gt;:<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans -p -t nno-nob.autobil.bin<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
Can be used with -m or -b as well:<br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -m -t nor-eng.autobil.bin<br />
.[][0 0].[] ^obsternasig<adj><*>/obstinate<adj><*>$<br />
.[][0 1].[] ^obsternasig<adj><*>/obdurate<adj><*>$<br />
.[][0 2].[] ^obsternasig<adj><*>/stubborn<adj><*>$<br />
.[][0 3].[] ^obsternasig<adj><*>/refractory<adj><*>$<br />
<br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -b -t nor-eng.autobil.bin<br />
^obsternasig<adj><*>/obstinate<adj><*>/obdurate<adj><*>/stubborn<adj><*>/refractory<adj><*>$<br />
</pre><br />
<br />
===-f | --filter-lines===<br />
Applies filters on the sentences. When applied, outputs only sentences having ambiguous words, <code>fertility < 10000</code>(number of combinations of sentences that can be formed using the ambiguous words) and <code>coverage >= 90</code>(some filter on the number of ambiguous words)<br />
<br />
===-n===<br />
Numbers the lines. Doesn't seem to make a difference under the -m mode.<br />
<br />
<br />
[[Category:Lexical selection]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73698User:Naan Dhaan/User friendly lexical training2021-08-19T15:10:58Z<p>Naan Dhaan: week 10</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29 - July 5<br />
| some more bug fixes<br />
|<br />
|-<br />
| July 6-12<br />
| <br />
* Github actions tutorials<br />
* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools<br />
|<br />
|-<br />
| July 13-19<br />
|<br />
* added Github actions for training and checking config<br />
* incorporated changes of apertium-lex-tools(60a6ae9)<br />
|<br />
* lexical_selection.py takes config file as an optional input<br />
* Github actions<br />
* no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths<br />
|-<br />
| July 20-26<br />
|<br />
* revisited lexical selection scripts for rule extraction and made some fixes in them<br />
* initiated non-parallel corpora training script(bash)<br />
|<br />
|-<br />
| July 27- Aug 2<br />
|<br />
* Added check_config for non-parallel corpora training<br />
* Github action for non-parallel corpora training<br />
* replace maxent with MLE and trained on full corpus<br />
* added functionality for fetching top N rules<br />
* some fixes in apertium-lex-tools<br />
|<br />
* Github actions for non-parallel corpora training<br />
* passing false to 'IS_PARALLEL' in config does non-parallel corpora training(till check_config for now). As of now, it takes corpus as input for the target side.<br />
* Top MAX_NGRAMS rules are selected for every (sl, ngram) pair<br />
|-<br />
| Aug 3- Aug 9<br />
|<br />
* Added MAX_RULES and CRISPHOLD to filter the rules<br />
* Added option for binary lang model as input<br />
* fixed IRSTLM installation bug<br />
* added non-parallel corpora training in lexical_selection_training.py<br />
* some fixes in apertium-lex-tools<br />
|<br />
* lexical_selection_training.py can do non-parallel corpora training. However there are some issues with multitrans<br />
* Rules are generated no more than MAX_RULES for every (slword, ngram) pair and with crispiness > CRISPHOLD<br />
* non-parallel training can take both corpus and lang model as input<br />
|-<br />
| Aug 10- Aug 16<br />
|<br />
* multitrans infinite ambiguous sentences output bug fixed<br />
* default freq 0.0 issue fixed<br />
* non-parallel lexical selection training done(as per https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/User-friendly_lexical_selection_training, can be improved)<br />
* Github action fixes<br />
|<br />
* lexical_selection_training.py can do non-parallel corpora training<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73681User:Naan Dhaan/User friendly lexical training2021-08-10T16:38:56Z<p>Naan Dhaan: week 9</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29 - July 5<br />
| some more bug fixes<br />
|<br />
|-<br />
| July 6-12<br />
| <br />
* Github actions tutorials<br />
* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools<br />
|<br />
|-<br />
| July 13-19<br />
|<br />
* added Github actions for training and checking config<br />
* incorporated changes of apertium-lex-tools(60a6ae9)<br />
|<br />
* lexical_selection.py takes config file as an optional input<br />
* Github actions<br />
* no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths<br />
|-<br />
| July 20-26<br />
|<br />
* revisited lexical selection scripts for rule extraction and made some fixes in them<br />
* initiated non-parallel corpora training script(bash)<br />
|<br />
|-<br />
| July 27- Aug 2<br />
|<br />
* Added check_config for non-parallel corpora training<br />
* Github action for non-parallel corpora training<br />
* replace maxent with MLE and trained on full corpus<br />
* added functionality for fetching top N rules<br />
* some fixes in apertium-lex-tools<br />
|<br />
* Github actions for non-parallel corpora training<br />
* passing false to 'IS_PARALLEL' in config does non-parallel corpora training(till check_config for now). As of now, it takes corpus as input for the target side.<br />
* Top MAX_NGRAMS rules are selected for every (sl, ngram) pair<br />
|-<br />
| Aug 3- Aug 9<br />
|<br />
* Added MAX_RULES and CRISPHOLD to filter the rules<br />
* Added option for binary lang model as input<br />
* fixed IRSTLM installation bug<br />
* added non-parallel corpora training in lexical_selection_training.py<br />
* some fixes in apertium-lex-tools<br />
|<br />
* lexical_selection_training.py can do non-parallel corpora training. However there are some issues with multitrans<br />
* Rules are generated no more than MAX_RULES for every (slword, ngram) pair and with crispiness > CRISPHOLD<br />
* non-parallel training can take both corpus and lang model as input<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=IRSTLM&diff=73668IRSTLM2021-08-05T15:15:07Z<p>Naan Dhaan: exporting IRSTLM</p>
<hr />
<div>''IRSTLM'' is a free and open source exact statistical language model using memory-mapping. The language models are compatible with those created with the closed-source SRILM Tooolkit.<br />
<br />
See the homepage at https://hlt-mt.fbk.eu/technologies/irstlm<br />
<br />
<br />
==Installation==<br />
see https://github.com/irstlm-team/irstlm<br />
or<br />
<pre><br />
svn checkout svn://svn.code.sf.net/p/irstlm/code/trunk irstlm<br />
cd irstlm<br />
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=/path/prefix<br />
make -j4<br />
make install<br />
</pre><br />
<br />
if you get <code>stdlib.h</code> error, see https://github.com/irstlm-team/irstlm/issues/22<br />
<br />
== Make a language model ==<br />
<pre><br />
# if you specified /path/prefix previously<br />
export IRSTLM=/path/prefix<br />
# else<br />
export IRSTLM=/usr/local<br />
<br />
$IRSTLM/bin/build-lm.sh -i incorpus.txt -o out.lm.gz -t tmp/<br />
</pre><br />
<br />
==See also==<br />
* [[Moses]] (includes alternative LM system KenLM)<br />
* [[Using GIZA++]]<br />
* [[RandLM]] - a randomised LM, based on [http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters]<br />
<br />
[[Category:Tools]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73665User:Naan Dhaan/User friendly lexical training2021-08-04T03:05:09Z<p>Naan Dhaan: week 8</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29 - July 5<br />
| some more bug fixes<br />
|<br />
|-<br />
| July 6-12<br />
| <br />
* Github actions tutorials<br />
* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools<br />
|<br />
|-<br />
| July 13-19<br />
|<br />
* added Github actions for training and checking config<br />
* incorporated changes of apertium-lex-tools(60a6ae9)<br />
|<br />
* lexical_selection.py takes config file as an optional input<br />
* Github actions<br />
* no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths<br />
|-<br />
| July 20-26<br />
|<br />
* revisited lexical selection scripts for rule extraction and made some fixes in them<br />
* initiated non-parallel corpora training script(bash)<br />
|<br />
|-<br />
| July 27- Aug 2<br />
|<br />
* Added check_config for non-parallel corpora training<br />
* Github action for non-parallel corpora training<br />
* replace maxent with MLE and trained on full corpus<br />
* added functionality for fetching top N rules<br />
* some fixes in apertium-lex-tools<br />
|<br />
* Github actions for non-parallel corpora training<br />
* passing false to 'IS_PARALLEL' in config does non-parallel corpora training(till check_config for now). As of now, it takes corpus as input for the target side.<br />
* Top MAX_NGRAMS rules are selected for every (sl, ngram) pair <br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=IRSTLM&diff=73661IRSTLM2021-08-01T08:02:08Z<p>Naan Dhaan: common issue</p>
<hr />
<div>''IRSTLM'' is a free and open source exact statistical language model using memory-mapping. The language models are compatible with those created with the closed-source SRILM Tooolkit.<br />
<br />
See the homepage at https://hlt-mt.fbk.eu/technologies/irstlm<br />
<br />
<br />
==Installation==<br />
see https://github.com/irstlm-team/irstlm<br />
or<br />
<pre><br />
svn checkout svn://svn.code.sf.net/p/irstlm/code/trunk irstlm<br />
cd irstlm<br />
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=/path/prefix<br />
make -j4<br />
make install<br />
</pre><br />
<br />
if you get <code>stdlib.h</code> error, see https://github.com/irstlm-team/irstlm/issues/22<br />
<br />
== Make a language model ==<br />
<pre><br />
# only if you specified /path/prefix previously<br />
export PATH=$PATH:/path/prefix/bin<br />
<br />
build-lm.sh -i incorpus.txt -o out.lm.gz -t tmp/<br />
</pre><br />
<br />
==See also==<br />
* [[Moses]] (includes alternative LM system KenLM)<br />
* [[Using GIZA++]]<br />
* [[RandLM]] - a randomised LM, based on [http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters]<br />
<br />
[[Category:Tools]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=IRSTLM&diff=73660IRSTLM2021-08-01T07:58:30Z<p>Naan Dhaan: Updating installation steps</p>
<hr />
<div>''IRSTLM'' is a free and open source exact statistical language model using memory-mapping. The language models are compatible with those created with the closed-source SRILM Tooolkit.<br />
<br />
See the homepage at https://hlt-mt.fbk.eu/technologies/irstlm<br />
<br />
<br />
==Installation==<br />
see https://github.com/irstlm-team/irstlm<br />
or<br />
<pre><br />
svn checkout svn://svn.code.sf.net/p/irstlm/code/trunk irstlm<br />
cd irstlm<br />
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=/path/prefix<br />
make -j4<br />
make install<br />
</pre><br />
== Make a language model ==<br />
<pre><br />
# only if you specified /path/prefix previously<br />
export PATH=$PATH:/path/prefix/bin<br />
<br />
build-lm.sh -i incorpus.txt -o out.lm.gz -t tmp/<br />
</pre><br />
<br />
==See also==<br />
* [[Moses]] (includes alternative LM system KenLM)<br />
* [[Using GIZA++]]<br />
* [[RandLM]] - a randomised LM, based on [http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters]<br />
<br />
[[Category:Tools]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73658User:Naan Dhaan/User friendly lexical training2021-07-29T17:16:41Z<p>Naan Dhaan: week 7</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29 - July 5<br />
| some more bug fixes<br />
|<br />
|-<br />
| July 6-12<br />
| <br />
* Github actions tutorials<br />
* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools<br />
|<br />
|-<br />
| July 13-19<br />
|<br />
* added Github actions for training and checking config<br />
* incorporated changes of apertium-lex-tools(60a6ae9)<br />
|<br />
* lexical_selection.py takes config file as an optional input<br />
* Github actions<br />
* no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths<br />
|-<br />
| July 20-26<br />
|<br />
* revisited lexical selection scripts for rule extraction and made some fixes in them<br />
* initiated non-parallel corpora training script(bash)<br />
|<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=IRSTLM&diff=73640IRSTLM2021-07-25T15:16:41Z<p>Naan Dhaan: Undo revision 73639 by Naan Dhaan (talk)</p>
<hr />
<div>''IRSTLM'' is a free and open source exact statistical language model using memory-mapping. The language models are compatible with those created with the closed-source SRILM Tooolkit.<br />
<br />
See the homepage at https://hlt-mt.fbk.eu/technologies/irstlm<br />
<br />
<br />
==Installation==<br />
<br />
<pre><br />
svn checkout svn://svn.code.sf.net/p/irstlm/code/trunk irstlm<br />
cd irstlm<br />
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=/path/prefix<br />
make -j4<br />
make install<br />
</pre><br />
== Make a language model ==<br />
<pre><br />
export IRSTLM=/path/prefix<br />
$IRSTLM/bin/build-lm.sh -i incorpus.txt -o out.lm.gz -t tmp/<br />
</pre><br />
<br />
==See also==<br />
* [[Moses]] (includes alternative LM system KenLM)<br />
* [[Using GIZA++]]<br />
* [[RandLM]] - a randomised LM, based on [http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters]<br />
<br />
[[Category:Tools]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=IRSTLM&diff=73639IRSTLM2021-07-25T15:08:47Z<p>Naan Dhaan: IRSTLM path correction</p>
<hr />
<div>''IRSTLM'' is a free and open source exact statistical language model using memory-mapping. The language models are compatible with those created with the closed-source SRILM Tooolkit.<br />
<br />
See the homepage at https://hlt-mt.fbk.eu/technologies/irstlm<br />
<br />
<br />
==Installation==<br />
<br />
<pre><br />
svn checkout svn://svn.code.sf.net/p/irstlm/code/trunk irstlm<br />
cd irstlm<br />
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=/path/prefix<br />
make -j4<br />
make install<br />
</pre><br />
== Make a language model ==<br />
<pre><br />
export IRSTLM=/path/prefix<br />
$IRSTLM/scripts/build-lm.sh -i incorpus.txt -o out.lm.gz -t tmp/<br />
</pre><br />
<br />
==See also==<br />
* [[Moses]] (includes alternative LM system KenLM)<br />
* [[Using GIZA++]]<br />
* [[RandLM]] - a randomised LM, based on [http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters]<br />
<br />
[[Category:Tools]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=IRSTLM&diff=73638IRSTLM2021-07-25T13:43:19Z<p>Naan Dhaan: IRSTLM website update</p>
<hr />
<div>''IRSTLM'' is a free and open source exact statistical language model using memory-mapping. The language models are compatible with those created with the closed-source SRILM Tooolkit.<br />
<br />
See the homepage at https://hlt-mt.fbk.eu/technologies/irstlm<br />
<br />
<br />
==Installation==<br />
<br />
<pre><br />
svn checkout svn://svn.code.sf.net/p/irstlm/code/trunk irstlm<br />
cd irstlm<br />
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=/path/prefix<br />
make -j4<br />
make install<br />
</pre><br />
== Make a language model ==<br />
<pre><br />
export IRSTLM=/path/prefix<br />
$IRSTLM/bin/build-lm.sh -i incorpus.txt -o out.lm.gz -t tmp/<br />
</pre><br />
<br />
==See also==<br />
* [[Moses]] (includes alternative LM system KenLM)<br />
* [[Using GIZA++]]<br />
* [[RandLM]] - a randomised LM, based on [http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters]<br />
<br />
[[Category:Tools]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73615User:Naan Dhaan/User friendly lexical training2021-07-22T04:34:15Z<p>Naan Dhaan: </p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29 - July 5<br />
| some more bug fixes<br />
|<br />
|-<br />
| July 6-12<br />
| <br />
* Github actions tutorials<br />
* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools<br />
|<br />
|-<br />
| July 13-19<br />
|<br />
* added Github actions for training and checking config<br />
* incorporated changes of apertium-lex-tools(60a6ae9)<br />
|<br />
* lexical_selection.py takes config file as an optional input<br />
* Github actions<br />
* no need to run Makefile in apertium-lex-tools/scripts to generate process-tagger-output and path of apertium-lex-tools is not required in config file. Installing apertium-lex-tools installs everything in the std paths<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73594User:Naan Dhaan/User friendly lexical training2021-07-15T07:01:17Z<p>Naan Dhaan: week 6</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29 - July 5<br />
| some more bug fixes<br />
|<br />
|-<br />
| July 6-12<br />
| <br />
* Github actions tutorials<br />
* some minor fixes like formatting string with f"" and fixes in apertium-lex-tools<br />
|<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73559User:Naan Dhaan/User friendly lexical training2021-07-05T18:43:28Z<p>Naan Dhaan: </p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29- July 5<br />
| some more bug fixes<br />
|<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73558User:Naan Dhaan/User friendly lexical training2021-07-05T18:43:02Z<p>Naan Dhaan: week 5</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|-<br />
| June 29-2<br />
| some more bug fixes<br />
|<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73552User:Naan Dhaan/User friendly lexical training2021-06-29T08:08:57Z<p>Naan Dhaan: week 4</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|-<br />
| June 22-28<br />
| bug fixes<br />
|<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73538User:Naan Dhaan/User friendly lexical training2021-06-22T15:23:45Z<p>Naan Dhaan: Week 3</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
| June 15-21<br />
| full driver script complete(requires testing (: )<br />
| driver script can now generate rules<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73521User:Naan Dhaan/User friendly lexical training2021-06-15T22:12:06Z<p>Naan Dhaan: Week 1</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
| June 8-14<br />
|<br />
* added installation instructions in README<br />
* incorporate clean_corpus in the driver script<br />
* added code for tagging<br />
* added code for aligning<br />
| driver script can now, clean corpus, tag it and generate alignments<br />
|-<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73488User:Naan Dhaan/User friendly lexical training2021-06-08T07:10:16Z<p>Naan Dhaan: adding workplan</p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation<br />
<br />
== Work Plan ==<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! Time Period<br />
! Details<br />
! Deliverable<br />
|-<br />
| Community Bonding Period<br />
May 17-31<br />
|<br />
* helper script check_config.py to check if the configuration and tools are fine<br />
* automated test script to test check_config.py<br />
| driver script can validate if the required tools are setup<br />
|-<br />
| Community Bonding Period<br />
June 1-7<br />
|<br />
reading apertium documentation<br />
|<br />
|-<br />
|}</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73478User:Naan Dhaan/User friendly lexical training2021-05-31T05:14:38Z<p>Naan Dhaan: </p>
<hr />
<div>The lexical selection module selects the right sentence in the context, based on lexical selection rules, from the multiple(ambiguous) sentences output by the transfer module. These rules can be written manually or inferred automatically by training on a corpus. But, the training process is a bit tedious with various tools like irstlm, fast-align, moses, etc, and various scripts like extract-sentences, extract-freq-lexicon, process-tagger-output, etc, involved, which require a lot of manual configs.<br /><br />
The goal of this project is to make this as simple and automated as possible with little involvement of the user. In a nutshell, there should be a single config file and the user does the entire training using a driver script. Finally, design regression tests on the driver script so that it works in the face of updates to the third-party tools. Also, train on different corpora and add lexical selection rules to the languages which have few to no lexical selection rules, thereby improving the quality of translation</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan/User_friendly_lexical_training&diff=73477User:Naan Dhaan/User friendly lexical training2021-05-31T05:14:00Z<p>Naan Dhaan: Created blank page</p>
<hr />
<div></div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan&diff=73473User:Naan Dhaan2021-05-31T04:57:51Z<p>Naan Dhaan: </p>
<hr />
<div>* '''IRC nick''': naan_dhaan*/vivekvelda*<br />
* '''Active time''': 3:30 GMT to 16:30 GMT. However, you can tag me any time. I will respond once I am free<br />
I work on lexical training i.e. generating new lexical-selection rules automatically using language corpora<br />
* https://github.com/vivekvardhanadepu/user-friendly-lexical-training<br />
* https://github.com/vivekvardhanadepu/apertium-lexical-training</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=User:Naan_Dhaan&diff=73472User:Naan Dhaan2021-05-31T04:23:16Z<p>Naan Dhaan: Created page with "* '''IRC nick''': naan_dhaan*/vivekvelda* * '''Active time''': 3:30 GMT to 16:30 GMT. However, you can tag me any time. I will respond once I am free I work on lexical trainin..."</p>
<hr />
<div>* '''IRC nick''': naan_dhaan*/vivekvelda*<br />
* '''Active time''': 3:30 GMT to 16:30 GMT. However, you can tag me any time. I will respond once I am free<br />
I work on lexical training i.e. generating new rules automatically using language corpora<br />
* https://github.com/vivekvardhanadepu/user-friendly-lexical-training<br />
* https://github.com/vivekvardhanadepu/apertium-lexical-training</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=Install_Apertium_core_by_compiling&diff=73435Install Apertium core by compiling2021-05-25T09:48:18Z<p>Naan Dhaan: /* Configure, build and install */</p>
<hr />
<div>Either you are planning to work on Apertium core, or have an operating system not covered by packaging, or virtual environments (check the [[Installation | Install overview]]).<br />
<br />
<br />
==Unix (GNU/Linux, Apple, BSD)==<br />
{{TOCD}}<br />
The below instructions are built from GNU/Linux material, but should be much the same on Apple, BSD, and other Unix-like systems.<br />
<br />
If you are on an Apple system, you may want to look at [[Prerequisites for Mac OS X]] then [[Apertium on Mac OS X]] before returning here.<br />
<br />
===Install the prerequisites===<br />
Install prerequisites,<br />
<br />
*[[Prerequisites for nix|*nix (in general)]]<br />
<br />
====Notes for different systems====<br />
These notes have been made at different times. Some may be out-of-date. However, if you are having difficulties, they may contain some tips,<br />
<br />
*[[Prerequisites for Debian|Ubuntu / Debian / other Debian-based]]<br />
*[[Prerequisites for RPM|RHEL / CentOS / Fedora / OpenSUSE]]<br />
*[[Prerequisites for openSUSE|openSUSE]]<br />
*[[Prerequisites for Mac OS X|Mac OS X]]<br />
*[[Prerequisites for Arch Linux|Arch Linux]]<br />
*[[Prerequisites for Gentoo|Gentoo]]<br />
*[[Prerequisites for FreeBSD|FreeBSD]] (untested)<br />
*[[Prerequisites for Slackware|Slackware]] <br />
*[[Apertium_on_SliTaz|SliTaz]]<br />
*[[Apertium_on_Mageia|Mageia]]<br />
<br />
<br />
=== Use git to download the code ===<br />
The main code,<br />
<br />
<pre><br />
git clone https://github.com/apertium/lttoolbox.git<br />
git clone https://github.com/apertium/apertium.git<br />
git clone https://github.com/apertium/apertium-lex-tools.git<br />
</pre><br />
<br />
<br />
* ''Note: Please make sure that the directory where you put these files (i.e. where you run the <code>git</code> command) doesn't contain spaces and other special characters. That may cause errors while compiling/linking.''<br />
<br />
===Set up environment===<br />
By default, Apertium is installed under the directory <code>/usr/local</code>, which requires root (sudo) access when installing. If that's fine with you, begin by pasting these lines into your terminal:<br />
<br />
<pre><br />
LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}<br />
export LD_LIBRARY_PATH<br />
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:${PKG_CONFIG_PATH}<br />
export PKG_CONFIG_PATH<br />
</pre><br />
<br />
You should also put those lines in your <code>~/.bashrc</code> so you don't have to paste them into every terminal you open.<br />
<br />
However, if you want Apertium installed somewhere else or don't want to install it as root, instead paste these lines into your terminal:<br />
<br />
<pre><br />
PREFIX=$HOME/local # or wherever you want apertium stuff installed<br />
LD_LIBRARY_PATH=$PREFIX/lib:${LD_LIBRARY_PATH}<br />
export LD_LIBRARY_PATH<br />
PKG_CONFIG_PATH=$PREFIX/lib/pkgconfig:${PKG_CONFIG_PATH}<br />
export PKG_CONFIG_PATH<br />
</pre><br />
<br />
You should also put those lines in your <code>~/.bashrc</code> so you don't have to paste them into every terminal you open.<br />
<br />
<br />
===Configure, build and install===<br />
The next step is to configure, build and install each of the modules you checked out, in this order:<br />
# <code>lttoolbox</code><br />
# <code>apertium</code><br />
# <code>apertium-lex-tools</code><br />
<br />
<code>cd</code> to each of the directories before you run the the commands shown below.<br />
<br />
If you didn't specify <code>$PREFIX</code> above, or don't know what this means, then do this in each directory,<br />
<br />
<pre><br />
./autogen.sh<br />
make<br />
sudo make install<br />
sudo ldconfig<br />
</pre><br />
<br />
When doing this, you might encounter a "cannot find pkg-config" error or something similar. You can solve this by installing pkg-config: <code>sudo apt-get install pkg-config</code><br />
<br />
If you specified a <code>$PREFIX</code> (e.g. to avoid installing as root), then you need to reset the prefix on 'autogen', so do this in each directory,<br />
<br />
<pre><br />
./autogen.sh --prefix=$PREFIX<br />
make<br />
make install<br />
ldconfig -n $PREFIX/lib<br />
</pre><br />
<br />
<br />
(If you're on a Mac, you don't need to do <code>ldconfig</code>, don't worry that it fails.)<br />
<br />
To compile process-tagger-output, do this in apertium-lex-tools directory,<br />
<pre><br />
cd scripts<br />
make<br />
</pre><br />
<br />
To install IRSTLM or yasmet along with apertium-lex-tools, add <code>--with-yasmet</code> and <code>--with-irstlm</code> to <code>./autogen.sh</code> for yasmet and IRSTLM respectively<br />
<br />
If you had any trouble, see [[Installation troubleshooting]].<br />
<br />
==Windows==<br />
If you do not want to use the [[Apertium VirtualBox]], you can compile Apertium for Windows using Cygwin; documentation for how to compile on Windows manually is at [[Apertium on Windows]].<br />
<br />
There is also a script at [[Apertium guide for Windows users]], but it is currently out-of-date and in need of updating.<br />
<br />
<br />
<br />
==Troubleshooting==<br />
Compiles go wrong. Of course they do,<br />
* Search the page [[Installation Troubleshooting]] for your error message.<br />
<br />
<br />
==Continuing==<br />
One way to test you have something, immediately, it to try invoke a tool. Without language data you can't see a translation, but you can see the help. Try,<br />
<br />
<pre><br />
lt-proc<br />
</pre><br />
<br />
If you know you need the HFST or CG3 modules, see [[Installation of grammar libraries]]. You may also be interested in the many tips at [[Bash completion]].<br />
<br />
You may want to write a new language pair, but you could download a language pair to test the install. Follow the instructions for [[Install language data by compiling]]. Or, if your system has packaging, download a language package (but beware, a package manager may pull in a old package of Apertium core, too). <br />
<br />
<br />
[[Category:Installation]]<br />
[[Category:Documentation in English]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=Multitrans&diff=73408Multitrans2021-05-09T10:21:18Z<p>Naan Dhaan: </p>
<hr />
<div>'''multitrans''' is a program found in apertium-lex-tools, used as a helper when training (see [[Learning rules from parallel and non-parallel corpora]]).<br />
<br />
==modes==<br />
<br />
===-b===<br />
This will output the source along with all target translations, like lt-proc -b.<br />
<br />
Doing just<br />
<pre><br />
multitrans -b sl-tl.autobil.bin<br />
</pre><br />
is equivalent to doing <code>lt-proc -b sl-tl.autobil.bin</code> if the input consists of just correctly formatted lexical units (lt-proc -b fails on some misformattings that multitrans ignores).<br />
<br />
===-p===<br />
This will output the source side only, so used alone it turns into cat, but used with -t you can trim the tags to what bidix has.<br />
<br />
So if bidix has an entry for kake&lt;n&gt;&lt;f&gt;, you'll get<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans -p -t nno-nob.autobil.bin<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
===-m===<br />
This will output one entry on each line with a pair of translations, e.g. <br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -m nor-eng.autobil.bin<br />
.[][0 0].[] ^obsternasig<adj><pst><sg><ind>/obstinate<adj><pst><sg><ind>$<br />
.[][0 1].[] ^obsternasig<adj><pst><sg><ind>/obdurate<adj><pst><sg><ind>$<br />
.[][0 2].[] ^obsternasig<adj><pst><sg><ind>/stubborn<adj><pst><sg><ind>$<br />
.[][0 3].[] ^obsternasig<adj><pst><sg><ind>/refractory<adj><pst><sg><ind>$<br />
</pre><br />
<br />
==Options==<br />
===-t===<br />
Trims off tags that don't appear in bidix, e.g. if bidix has an entry for kake&lt;n&gt;&lt;f&gt;:<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans -p -t nno-nob.autobil.bin<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
Can be used with -m or -b as well:<br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -m -t nor-eng.autobil.bin<br />
.[][0 0].[] ^obsternasig<adj><*>/obstinate<adj><*>$<br />
.[][0 1].[] ^obsternasig<adj><*>/obdurate<adj><*>$<br />
.[][0 2].[] ^obsternasig<adj><*>/stubborn<adj><*>$<br />
.[][0 3].[] ^obsternasig<adj><*>/refractory<adj><*>$<br />
<br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans -b -t nor-eng.autobil.bin<br />
^obsternasig<adj><*>/obstinate<adj><*>/obdurate<adj><*>/stubborn<adj><*>/refractory<adj><*>$<br />
</pre><br />
<br />
===-f===<br />
what does this do?<br />
<br />
===-n===<br />
Numbers the lines. Doesn't seem to make a difference under the -m mode.<br />
<br />
<br />
[[Category:Lexical selection]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=Multitrans&diff=73407Multitrans2021-05-09T10:19:33Z<p>Naan Dhaan: /* -b */</p>
<hr />
<div>'''multitrans''' is a program found in apertium-lex-tools, used as a helper when training (see [[Learning rules from parallel and non-parallel corpora]]).<br />
<br />
==modes==<br />
<br />
===-b===<br />
This will output the source along with all target translations, like lt-proc -b.<br />
<br />
Doing just<br />
<pre><br />
multitrans -b sl-tl.autobil.bin<br />
</pre><br />
is equivalent to doing <code>lt-proc -b sl-tl.autobil.bin</code> if the input consists of just correctly formatted lexical units (lt-proc -b fails on some misformattings that multitrans ignores).<br />
<br />
===-p===<br />
This will output the source side only, so used alone it turns into cat, but used with -t you can trim the tags to what bidix has.<br />
<br />
So if bidix has an entry for kake&lt;n&gt;&lt;f&gt;, you'll get<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans nno-nob.autobil.bin -p -t<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
===-m===<br />
This will output one entry on each line with a pair of translations, e.g. <br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans nor-eng.autobil.bin -m<br />
.[][0 0].[] ^obsternasig<adj><pst><sg><ind>/obstinate<adj><pst><sg><ind>$<br />
.[][0 1].[] ^obsternasig<adj><pst><sg><ind>/obdurate<adj><pst><sg><ind>$<br />
.[][0 2].[] ^obsternasig<adj><pst><sg><ind>/stubborn<adj><pst><sg><ind>$<br />
.[][0 3].[] ^obsternasig<adj><pst><sg><ind>/refractory<adj><pst><sg><ind>$<br />
</pre><br />
<br />
==Options==<br />
===-t===<br />
Trims off tags that don't appear in bidix, e.g. if bidix has an entry for kake&lt;n&gt;&lt;f&gt;:<br />
<pre><br />
$ echo '^kake<n><f><sg><def>$' |multitrans nno-nob.autobil.bin -p -t<br />
^kake<n><f><*>$<br />
</pre><br />
<br />
Can be used with -m or -b as well:<br />
<pre><br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans nor-eng.autobil.bin -m -t<br />
.[][0 0].[] ^obsternasig<adj><*>/obstinate<adj><*>$<br />
.[][0 1].[] ^obsternasig<adj><*>/obdurate<adj><*>$<br />
.[][0 2].[] ^obsternasig<adj><*>/stubborn<adj><*>$<br />
.[][0 3].[] ^obsternasig<adj><*>/refractory<adj><*>$<br />
<br />
$ echo '^obsternasig<adj><pst><sg><ind>$' |multitrans nor-eng.autobil.bin -b -t<br />
^obsternasig<adj><*>/obstinate<adj><*>/obdurate<adj><*>/stubborn<adj><*>/refractory<adj><*>$<br />
</pre><br />
<br />
===-f===<br />
what does this do?<br />
<br />
===-n===<br />
Numbers the lines. Doesn't seem to make a difference under the -m mode.<br />
<br />
<br />
[[Category:Lexical selection]]</div>Naan Dhaanhttps://wiki.apertium.org/w/index.php?title=IRC/Matrix&diff=73406IRC/Matrix2021-05-09T10:00:46Z<p>Naan Dhaan: /* Remove [m] from your IRC nick */</p>
<hr />
<div>If you want persistent IRC history/logs and notifications without having to have a computer online all the time, but you don't have/know how to set up a server, you can use the Matrix network to stay connected. <br />
<br />
[[Image:Riot-matrix-Step1join.png|thumb|400px|right|Before logging in]]<br />
<br />
<br />
'''To get started''', open<br />
<br />
https://riot.im/app/#/room/#freenode_#apertium:matrix.org<br />
<br />
<br />
It'll say "''Click here'' to join the discussion". If you don't have a Matrix account already, just do that, and enter a username, click Continue and fill out the captcha and you're in!<br />
<br />
(If you already have a Matrix account, instead click "Login" and enter your details.)<br />
<br />
[[Image:Riot-matrix-step2profit.png|thumb|400px|right|Logged in]]<br />
<br />
Once you're in, you should as soon as possible click the cogwheel/open https://riot.im/app/#/settings to '''set a password and e-mail''' for your Matrix account.<br />
<br />
<br />
The web client can send Desktop notifications if you use Firefox at least (see your [https://riot.im/app/#/settings settings] if it's not enabled), but there is also a regular [https://riot.im/desktop.html desktop version of Riot] for Mac, Windows and GNU/Linux.<br />
<br />
<br />
== Details ==<br />
<br />
'''Element''' (formerly riot) is a client for the Matrix network. Matrix is sort of a supercharged IRC network/protocol, which "bridges" into regular IRC networks like Freenode but also provides a host of other features.<br />
<br />
Read more about the relation between Matrix and IRC at https://opensource.com/article/17/5/introducing-riot-IRC – including how to change or register your IRC nick.<br />
<br />
See https://matrix.org/ for the "backend" bits.<br />
<br />
<br />
Note that your IRC chats will be going through the matrix.org server. For public, logged channels like #apertium, this isn't any concern, but for one-on-one conversations there will be one more server that technically could log things (although one-on-one conversations on IRC can potentially be logged by freenode too). Matrix is free and open source, so you can set up your own Matrix server, but that seems to take away the point of this being a low-maintenance way to get persistent IRC connections (and in that case, https://weechat.org/ is much simpler to set up).<br />
<br />
OTOH, if you're chatting with other Matrix users, it actually becomes more secure, since Matrix provides end-to-end encryption between Matrix users.<br />
<br />
<br />
== Remove [m] from your IRC nick ==<br />
* Open a private chat with <code>@appservice-irc:matrix.org</code> and tell it <code>!nick chat.freenode.net NewNickGoesHere</code><br />
* See also: https://github.com/matrix-org/matrix-appservice-irc/blob/master/HOWTO.md#changing-nicks<br />
* See also: https://opensource.com/article/17/5/introducing-riot-IRC for information about how to change your IRC nick (and more details in general on using chatting through Matrix).<br />
<br />
== Join new channels ==<br />
Use this template (replace "ChannelName" with the name of the channel you want to join):<br />
<br />
https://riot.im/app/#/room/#freenode_#ChannelName:matrix.org<br />
<br />
[[Category:Users]] <br />
[[Category:Contact]]</div>Naan Dhaan