Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress"

From Apertium
Jump to navigation Jump to search
(Created page with "Progress on Automatic_blank_handling ==Current task== ===lttoolbox=== * Make lt-proc correctly disperse inline bl...")
 
Line 1: Line 1:
  +
Progress on [[https://wiki.apertium.org/wiki/User:Pmodi/GSOC_2020_proposal:_Hindi-Punjabi]]
Progress on [[Ideas_for_Google_Summer_of_Code/Automatic_blank_handling|Automatic_blank_handling]]
 
   
==Current task==
+
==Current task(Community bonding Week 1)==
  +
===Finalising Frequency lists & Learning Persian script===
===lttoolbox===
 
  +
* DONE - Frequency lists using WikiExtractor on latest dump.
* Make lt-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code>
 
  +
* DONE - Read wiki on generating freq lists and monolingual building dictionaries.
* Task: Create a pull request to https://github.com/unhammer/lttoolbox/ with tests in https://github.com/unhammer/lttoolbox/tree/master/tests/lt_proc
 
  +
* IN PROGRESS - Finish scraping more texts and merge results with the current freq. list.
  +
* IN PROGRESS - Practicing basic understanding of Persian script.
   
   
  +
==TODO(Next week - Community Bonding Week 2)==
==TODO==
 
  +
* Fix bidirectional translation i.e. pan->hin
===hfst===
 
  +
* Fix transfer rules for postpositions.
* Make hfst-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code>
 
  +
* Start work on function words addition to dictionary.
* Task: Create a pull request to https://github.com/hfst/hfst/ with tests in https://github.com/hfst/hfst/tree/master/test/tools/
 
 
===transfer (non-chunking)===
 
* Test if current transfer.cc handles non-chunking/single-stage transfer correctly, if not, fix
 
* Task: PR to https://github.com/unhammer/apertium/ with tests showing working transfer.cc for single-stage/non-chunking transfer, with inline vs block-level blank handling and test that rules using misnumbered/missing b-elements should not mess up formatting
 
 
===postchunk===
 
(Should be done after interchunk is complete)
 
 
* Task: PR to https://github.com/unhammer/apertium/ including tests showing working postchunk blank handling – test that rules using misnumbered/missing b-elements should not mess up formatting
 
 
===etc===
 
* Ensure all other modules are fine with the new format for inline blanks (e.g. cg-proc)
 
* Work on other deformatters (mediawiki? latex?)
 
 
==Done==
 
(Some of these are from coding challenges)
 
 
===deformatting prototypes===
 
# Make the HTML format handler <code>apertium-deshtml</code> turn "&lt;i&gt;foo &lt;b&gt;bar&lt;/b&gt;&lt;/i&gt;" into "<nowiki>[{&lt;i&gt;}]foo [{&lt;i&gt;&lt;b&gt;}]bar</nowiki>"
 
#* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-1.cpp
 
#* make <code>apertium-deshtml</code> *not* wrap tags like {{tag|p}} or {{tag|div}} in <code>{}</code> (ie. only for inline tags)
 
#* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-2.cpp
 
 
===pretransfer===
 
* Make pretransfer disperse tags when splitting lexical units https://github.com/unhammer/apertium/commit/39bd7d9fa45c64586d3a9b0f1a7df89e7d007c1a , code cleanup:
 
#* Fork https://github.com/unhammer/apertium and check out and compile the <code>master</code> branch
 
#* then in a different folder, do <code>git clone -b blank-handling https://github.com/junaidiiith/apertium</code>
 
#* from junaidiiith/blank-handling, copy over the changes that were made there to apertium_pretransfer.cc into your fork of unhammer/apertium, along with the pretransfer tests
 
#* ensure tests pass
 
#* PR at https://github.com/unhammer/apertium/pull/4
 
 
===transfer (chunker)===
 
# Fix a memory bug
 
#* uncommenting apertium/transfer.cc:1259 <code> // delete[] format;</code> in the blank handling branch leads to a double-free – find out why and ensure we're correctly releasing memory
 
#* Install valgrind from your package manager or http://valgrind.org/, then compile your program with -O0 -g3, then run <code>valgrind -v --leak-check=full apertium/apertium-transfer</code> and read the output
 
 
===Interchunk===
 
Interchunk needs to ignore the "pos" argument to b elements, and output each superblank exactly once, preferably where the rule has a b element (if there are not enough b's, output the rest at the end of the rule).
 
Interchunk shouldn't have to deal with wordblanks, since we can't look inside chunks when in interchunk.
 
 
# Apply changes to transfer.cc to interchunk.cc
 
#* Check <code>git clone -b blank-handling https://github.com/unhammer/apertium</code>
 
#* Apply the <code>git diff 4c7c4f8f1b..2025182991</code> from transfer.cc to interchunk.cc
 
#* Try to make it compile and run – report things that didn't seem to have a 1-1 correspondence
 
#* Write tests for interchunk, like those for transfer at https://github.com/unhammer/apertium/tree/blank-handling/tests
 
 
===Deformatters===
 
* Complete prototype HTML deformatters
 
** Current prototype code at https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code and https://github.com/SilentFlame/apertium/
 
** Task: Create a clean pull request to https://github.com/unhammer with HTML deformatter and reformatter, including tests
 
 
===Reformatters===
 
 
* Make reformat turn inline-blanks back into real tags
 
** <nowiki>[{&lt;i&gt;}]foo [{&lt;i&gt;&lt;b&gt;}]bar</nowiki> should become &lt;i&gt;foo&lt;/i&gt; &lt;i&gt;&lt;b&gt;bar&lt;/b&gt;&lt;/i&gt;
 
** prototypes exist for this in https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code
 

Revision as of 20:46, 9 May 2020

Progress on [[1]]

Current task(Community bonding Week 1)

Finalising Frequency lists & Learning Persian script

  • DONE - Frequency lists using WikiExtractor on latest dump.
  • DONE - Read wiki on generating freq lists and monolingual building dictionaries.
  • IN PROGRESS - Finish scraping more texts and merge results with the current freq. list.
  • IN PROGRESS - Practicing basic understanding of Persian script.


TODO(Next week - Community Bonding Week 2)

  • Fix bidirectional translation i.e. pan->hin
  • Fix transfer rules for postpositions.
  • Start work on function words addition to dictionary.