Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress"
Jump to navigation
Jump to search
(Created page with "Progress on Automatic_blank_handling ==Current task== ===lttoolbox=== * Make lt-proc correctly disperse inline bl...") |
|||
Line 1: | Line 1: | ||
+ | Progress on [[https://wiki.apertium.org/wiki/User:Pmodi/GSOC_2020_proposal:_Hindi-Punjabi]] |
||
− | Progress on [[Ideas_for_Google_Summer_of_Code/Automatic_blank_handling|Automatic_blank_handling]] |
||
− | ==Current task== |
+ | ==Current task(Community bonding Week 1)== |
+ | ===Finalising Frequency lists & Learning Persian script=== |
||
− | ===lttoolbox=== |
||
+ | * DONE - Frequency lists using WikiExtractor on latest dump. |
||
− | * Make lt-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code> |
||
+ | * DONE - Read wiki on generating freq lists and monolingual building dictionaries. |
||
− | * Task: Create a pull request to https://github.com/unhammer/lttoolbox/ with tests in https://github.com/unhammer/lttoolbox/tree/master/tests/lt_proc |
||
+ | * IN PROGRESS - Finish scraping more texts and merge results with the current freq. list. |
||
+ | * IN PROGRESS - Practicing basic understanding of Persian script. |
||
+ | ==TODO(Next week - Community Bonding Week 2)== |
||
− | ==TODO== |
||
+ | * Fix bidirectional translation i.e. pan->hin |
||
− | ===hfst=== |
||
+ | * Fix transfer rules for postpositions. |
||
− | * Make hfst-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code> |
||
+ | * Start work on function words addition to dictionary. |
||
− | * Task: Create a pull request to https://github.com/hfst/hfst/ with tests in https://github.com/hfst/hfst/tree/master/test/tools/ |
||
− | |||
− | ===transfer (non-chunking)=== |
||
− | * Test if current transfer.cc handles non-chunking/single-stage transfer correctly, if not, fix |
||
− | * Task: PR to https://github.com/unhammer/apertium/ with tests showing working transfer.cc for single-stage/non-chunking transfer, with inline vs block-level blank handling and test that rules using misnumbered/missing b-elements should not mess up formatting |
||
− | |||
− | ===postchunk=== |
||
− | (Should be done after interchunk is complete) |
||
− | |||
− | * Task: PR to https://github.com/unhammer/apertium/ including tests showing working postchunk blank handling – test that rules using misnumbered/missing b-elements should not mess up formatting |
||
− | |||
− | ===etc=== |
||
− | * Ensure all other modules are fine with the new format for inline blanks (e.g. cg-proc) |
||
− | * Work on other deformatters (mediawiki? latex?) |
||
− | |||
− | ==Done== |
||
− | (Some of these are from coding challenges) |
||
− | |||
− | ===deformatting prototypes=== |
||
− | # Make the HTML format handler <code>apertium-deshtml</code> turn "<i>foo <b>bar</b></i>" into "<nowiki>[{<i>}]foo [{<i><b>}]bar</nowiki>" |
||
− | #* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-1.cpp |
||
− | #* make <code>apertium-deshtml</code> *not* wrap tags like {{tag|p}} or {{tag|div}} in <code>{}</code> (ie. only for inline tags) |
||
− | #* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-2.cpp |
||
− | |||
− | ===pretransfer=== |
||
− | * Make pretransfer disperse tags when splitting lexical units https://github.com/unhammer/apertium/commit/39bd7d9fa45c64586d3a9b0f1a7df89e7d007c1a , code cleanup: |
||
− | #* Fork https://github.com/unhammer/apertium and check out and compile the <code>master</code> branch |
||
− | #* then in a different folder, do <code>git clone -b blank-handling https://github.com/junaidiiith/apertium</code> |
||
− | #* from junaidiiith/blank-handling, copy over the changes that were made there to apertium_pretransfer.cc into your fork of unhammer/apertium, along with the pretransfer tests |
||
− | #* ensure tests pass |
||
− | #* PR at https://github.com/unhammer/apertium/pull/4 |
||
− | |||
− | ===transfer (chunker)=== |
||
− | # Fix a memory bug |
||
− | #* uncommenting apertium/transfer.cc:1259 <code> // delete[] format;</code> in the blank handling branch leads to a double-free – find out why and ensure we're correctly releasing memory |
||
− | #* Install valgrind from your package manager or http://valgrind.org/, then compile your program with -O0 -g3, then run <code>valgrind -v --leak-check=full apertium/apertium-transfer</code> and read the output |
||
− | |||
− | ===Interchunk=== |
||
− | Interchunk needs to ignore the "pos" argument to b elements, and output each superblank exactly once, preferably where the rule has a b element (if there are not enough b's, output the rest at the end of the rule). |
||
− | Interchunk shouldn't have to deal with wordblanks, since we can't look inside chunks when in interchunk. |
||
− | |||
− | # Apply changes to transfer.cc to interchunk.cc |
||
− | #* Check <code>git clone -b blank-handling https://github.com/unhammer/apertium</code> |
||
− | #* Apply the <code>git diff 4c7c4f8f1b..2025182991</code> from transfer.cc to interchunk.cc |
||
− | #* Try to make it compile and run – report things that didn't seem to have a 1-1 correspondence |
||
− | #* Write tests for interchunk, like those for transfer at https://github.com/unhammer/apertium/tree/blank-handling/tests |
||
− | |||
− | ===Deformatters=== |
||
− | * Complete prototype HTML deformatters |
||
− | ** Current prototype code at https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code and https://github.com/SilentFlame/apertium/ |
||
− | ** Task: Create a clean pull request to https://github.com/unhammer with HTML deformatter and reformatter, including tests |
||
− | |||
− | ===Reformatters=== |
||
− | |||
− | * Make reformat turn inline-blanks back into real tags |
||
− | ** <nowiki>[{<i>}]foo [{<i><b>}]bar</nowiki> should become <i>foo</i> <i><b>bar</b></i> |
||
− | ** prototypes exist for this in https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code |
Revision as of 20:46, 9 May 2020
Progress on [[1]]
Current task(Community bonding Week 1)
Finalising Frequency lists & Learning Persian script
- DONE - Frequency lists using WikiExtractor on latest dump.
- DONE - Read wiki on generating freq lists and monolingual building dictionaries.
- IN PROGRESS - Finish scraping more texts and merge results with the current freq. list.
- IN PROGRESS - Practicing basic understanding of Persian script.
TODO(Next week - Community Bonding Week 2)
- Fix bidirectional translation i.e. pan->hin
- Fix transfer rules for postpositions.
- Start work on function words addition to dictionary.