Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress"
Jump to navigation
Jump to search
(Created page with "Progress on Automatic_blank_handling ==Current task== ===lttoolbox=== * Make lt-proc correctly disperse inline bl...") |
|||
Line 1: | Line 1: | ||
Progress on [[https://wiki.apertium.org/wiki/User:Pmodi/GSOC_2020_proposal:_Hindi-Punjabi]] |
|||
Progress on [[Ideas_for_Google_Summer_of_Code/Automatic_blank_handling|Automatic_blank_handling]] |
|||
==Current task== |
==Current task(Community bonding Week 1)== |
||
===Finalising Frequency lists & Learning Persian script=== |
|||
===lttoolbox=== |
|||
* DONE - Frequency lists using WikiExtractor on latest dump. |
|||
* Make lt-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code> |
|||
* DONE - Read wiki on generating freq lists and monolingual building dictionaries. |
|||
* Task: Create a pull request to https://github.com/unhammer/lttoolbox/ with tests in https://github.com/unhammer/lttoolbox/tree/master/tests/lt_proc |
|||
* IN PROGRESS - Finish scraping more texts and merge results with the current freq. list. |
|||
* IN PROGRESS - Practicing basic understanding of Persian script. |
|||
==TODO(Next week - Community Bonding Week 2)== |
|||
==TODO== |
|||
* Fix bidirectional translation i.e. pan->hin |
|||
===hfst=== |
|||
* Fix transfer rules for postpositions. |
|||
* Make hfst-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code> |
|||
* Start work on function words addition to dictionary. |
|||
* Task: Create a pull request to https://github.com/hfst/hfst/ with tests in https://github.com/hfst/hfst/tree/master/test/tools/ |
|||
===transfer (non-chunking)=== |
|||
* Test if current transfer.cc handles non-chunking/single-stage transfer correctly, if not, fix |
|||
* Task: PR to https://github.com/unhammer/apertium/ with tests showing working transfer.cc for single-stage/non-chunking transfer, with inline vs block-level blank handling and test that rules using misnumbered/missing b-elements should not mess up formatting |
|||
===postchunk=== |
|||
(Should be done after interchunk is complete) |
|||
* Task: PR to https://github.com/unhammer/apertium/ including tests showing working postchunk blank handling – test that rules using misnumbered/missing b-elements should not mess up formatting |
|||
===etc=== |
|||
* Ensure all other modules are fine with the new format for inline blanks (e.g. cg-proc) |
|||
* Work on other deformatters (mediawiki? latex?) |
|||
==Done== |
|||
(Some of these are from coding challenges) |
|||
===deformatting prototypes=== |
|||
# Make the HTML format handler <code>apertium-deshtml</code> turn "<i>foo <b>bar</b></i>" into "<nowiki>[{<i>}]foo [{<i><b>}]bar</nowiki>" |
|||
#* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-1.cpp |
|||
#* make <code>apertium-deshtml</code> *not* wrap tags like {{tag|p}} or {{tag|div}} in <code>{}</code> (ie. only for inline tags) |
|||
#* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-2.cpp |
|||
===pretransfer=== |
|||
* Make pretransfer disperse tags when splitting lexical units https://github.com/unhammer/apertium/commit/39bd7d9fa45c64586d3a9b0f1a7df89e7d007c1a , code cleanup: |
|||
#* Fork https://github.com/unhammer/apertium and check out and compile the <code>master</code> branch |
|||
#* then in a different folder, do <code>git clone -b blank-handling https://github.com/junaidiiith/apertium</code> |
|||
#* from junaidiiith/blank-handling, copy over the changes that were made there to apertium_pretransfer.cc into your fork of unhammer/apertium, along with the pretransfer tests |
|||
#* ensure tests pass |
|||
#* PR at https://github.com/unhammer/apertium/pull/4 |
|||
===transfer (chunker)=== |
|||
# Fix a memory bug |
|||
#* uncommenting apertium/transfer.cc:1259 <code> // delete[] format;</code> in the blank handling branch leads to a double-free – find out why and ensure we're correctly releasing memory |
|||
#* Install valgrind from your package manager or http://valgrind.org/, then compile your program with -O0 -g3, then run <code>valgrind -v --leak-check=full apertium/apertium-transfer</code> and read the output |
|||
===Interchunk=== |
|||
Interchunk needs to ignore the "pos" argument to b elements, and output each superblank exactly once, preferably where the rule has a b element (if there are not enough b's, output the rest at the end of the rule). |
|||
Interchunk shouldn't have to deal with wordblanks, since we can't look inside chunks when in interchunk. |
|||
# Apply changes to transfer.cc to interchunk.cc |
|||
#* Check <code>git clone -b blank-handling https://github.com/unhammer/apertium</code> |
|||
#* Apply the <code>git diff 4c7c4f8f1b..2025182991</code> from transfer.cc to interchunk.cc |
|||
#* Try to make it compile and run – report things that didn't seem to have a 1-1 correspondence |
|||
#* Write tests for interchunk, like those for transfer at https://github.com/unhammer/apertium/tree/blank-handling/tests |
|||
===Deformatters=== |
|||
* Complete prototype HTML deformatters |
|||
** Current prototype code at https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code and https://github.com/SilentFlame/apertium/ |
|||
** Task: Create a clean pull request to https://github.com/unhammer with HTML deformatter and reformatter, including tests |
|||
===Reformatters=== |
|||
* Make reformat turn inline-blanks back into real tags |
|||
** <nowiki>[{<i>}]foo [{<i><b>}]bar</nowiki> should become <i>foo</i> <i><b>bar</b></i> |
|||
** prototypes exist for this in https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code |
Revision as of 20:46, 9 May 2020
Progress on [[1]]
Current task(Community bonding Week 1)
Finalising Frequency lists & Learning Persian script
- DONE - Frequency lists using WikiExtractor on latest dump.
- DONE - Read wiki on generating freq lists and monolingual building dictionaries.
- IN PROGRESS - Finish scraping more texts and merge results with the current freq. list.
- IN PROGRESS - Practicing basic understanding of Persian script.
TODO(Next week - Community Bonding Week 2)
- Fix bidirectional translation i.e. pan->hin
- Fix transfer rules for postpositions.
- Start work on function words addition to dictionary.