Difference between revisions of "User:Popcorndude/GSoC2021Proposal"

From Apertium
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 27: Line 27:
=== UNIt-Testing Framework ===
=== UNIt-Testing Framework ===


Our regression tests are few and far between and almost none of them are run automatically. This part of the project would be the creation of a Grand Unified Unit-Testing Framework™, probably mainly derived from https://github.com/TinoDidriksen/regtest (though I have included time in the schedule for bikeshedding the exact feature set).
Our regression tests are few and far between and almost none of them are run automatically. This part of the project would be the creation of a Grand Unified Unit-Testing Framework™, probably mainly derived from https://github.com/TinoDidriksen/regtest (<s>though I have included time in the schedule for bikeshedding the exact feature set</s> bikeshedding has been rescheduled to before the coding period begins).


I would also plan to enable this testing on all existing repositories unless the maintainers really didn't want it. This will entail converting any existing tests from whatever state they are currently in.
I would also plan to enable this testing on all existing repositories unless the maintainers really didn't want it. This will entail converting any existing tests from whatever state they are currently in.


See also https://github.com/apertium/apertium-init/issues/51 [[User:Ilnar.salimzyan/On_testing]]
See also https://github.com/apertium/apertium-init/issues/51 [[User:Ilnar.salimzyan/On_testing]] https://github.com/TinoDidriksen/regtest/wiki [[Testvoc]] https://github.com/apertium/apertium-ckb-eng/tree/master/t https://github.com/giellalt/regtest-kal [[Github_Actions]]


=== UNIversal Dependencies-Based Structural Transfer ===
=== UNIversal Dependencies-Based Structural Transfer ===
Line 65: Line 65:
Status as of Feb 9: working Python compiler from minimal formalism to trx.
Status as of Feb 9: working Python compiler from minimal formalism to trx.


Feb 23: the coding challenge listed at [[Ideas_for_Google_Summer_of_Code/Robust_tokenisation]] can be found [https://github.com/mr-martian/gsoc2021-coding-challenge here].
Feb 23: the coding challenge listed at [[Ideas_for_Google_Summer_of_Code/Robust_tokenisation]] can be found [https://github.com/mr-martian/gsoc2021-coding-challenge here] and a draft of the file structure for the unit-testing framework can be found at [[User:Popcorndude/Unit-Testing]].

Mar 2: Unit-testing draft is probably nearly ready to send to the mailing list for discussion.

Mar 2 again: This slight generalization of Tino's regression system should cover pretty much everything we need [[User:Popcorndude/Regression-Testing]] without separate unit tests.


== Work Plan ==
== Work Plan ==
Line 85: Line 89:
* Figure out how ICU custom normalizers work
* Figure out how ICU custom normalizers work
* Determine which ICU types to make default
* Determine which ICU types to make default
* Propose Grand Unified Unit-Testing Framework™
* Bikeshed Grand Unified Unit-Testing Framework™
| Draft of string-handling guidelines (= updated version of [[Code style]])
| Draft of string-handling guidelines (= updated version of [[Code style]])
|-
|-
Line 114: Line 120:
* Deal with anything that comes up in testing
* Deal with anything that comes up in testing
* Finalize string-handling coding guidelines
* Finalize string-handling coding guidelines
* Implement/port regression test system
* Start documenting existing unit testing frameworks
| Unicode transition branches ready for merging
| Unicode transition branches ready for merging
|-
|-
| Week 5
| Week 5
July 4-10
July 4-10
| Propose Grand Unified Unit-Testing Framework™
| Prepare Grand Unified Unit-Testing Framework™
|
|
* Test regression system
* Present proposed framework on the mailing list
* Figure out continuous integration setup
* Begin bikeshedding
* Begin prototyping UD compiler for apertium-recursive
* Incorporate regression tests into apertium-init
* Use apertium-init to begin updating existing repos
| Nothing terribly concrete
| Working test framework
|-
|-
| '''midterm evaluation'''
| '''midterm evaluation'''
Line 133: Line 140:
| Week 6
| Week 6
July 11-17
July 11-17
| Roll out Grand Unified Unit-Testing Framework™
| Finish initial UD compiler
|
|
* Use apertium-init to update existing repos
* Get parser working
| Partial rollout of testing framework
* Converter to CONLL-U
* Figure out how to describe and implement transfer
* More bikeshedding
| Preliminary UD-based transfer system
|-
|-
| Week 7
| Week 7
July 18-24
July 18-24
| Implement Grand Unified Unit-Testing Framework™
| Probably more Grand Unified Unit-Testing Framework™ rollout
|
|
* Use apertium-init to update existing repos
* Incorporate mailing list bikeshedding
* Probably declare undying hatred of makefiles
* Finish implementing/porting/whatevering the system
| Working regression tests in all repositories
* Begin applying to repos
| Working unit-testing framework
|-
|-
| Week 8
| Week 8
July 25-31
July 25-31
| UD experimentation
| Framework rollout
|
|
* Convert parser output to CONLL-U
* Enable unit-testing in all languages and pairs
* Transfer of some sort
* Convert existing tests to new format
* Deal with the inevitable issues that arise from rolling out changes to several hundred repositories
| Test reports everywhere
| Yet Another Transfer System
|-
|-
| Week 9
| Week 9
Line 165: Line 169:
** See how well it translates with no pair-specific rules, just the reorderings specified by the monolingual grammar
** See how well it translates with no pair-specific rules, just the reorderings specified by the monolingual grammar
** See how fast or slow it ends up being
** See how fast or slow it ends up being
* Port compiler and whatnot to C++ if it turns out to be any good
* Deal with the inevitable issues that arise from rolling out changes to several hundred repositories
| Preliminary evaluation of the effectiveness of UD transfer
| Preliminary evaluation of the effectiveness of UD transfer
|-
|-
Line 171: Line 175:
August 8-14
August 8-14
| Unknown
| Unknown
| What I write here will have little bear on what actually happens and also by not making a plan for this week, I ensure that my overall plan can be at most 90% wrong.
| What I write here will have little bearing on what actually happens and also by not making a plan for this week, I ensure that my overall plan can be at most 90% wrong.
| Unknown
| Unknown
|-
|-

Latest revision as of 13:35, 17 April 2021

Google Summer of Code 2021 proposal draft

Contact[edit]

Name: Daniel Swanson

Email: awesomeevildudes@gmail.com

IRC: popcorndude

GitHub: https://github.com/mr-martian

Timezone: UTC-5

Proposal: UNIpertium[edit]

This project is really 3 projects UNIfied only by the fact that they all start with UNI:

UNIcode Support[edit]

This portion of the project is Ideas for Google Summer of Code/Robust tokenisation. All string processing would be transferred to ICU and all inputs would be normalized. A program would be inserted at the beginning of the pipeline which normalizes incoming text and compilers would emit warnings for non-normalized input.

The normalization will likely be a modified version of NFC. See https://github.com/apertium/organisation/issues/24 for discussion.

See also https://github.com/apertium/lttoolbox/issues/81 and https://github.com/apertium/lttoolbox/issues/85

UNIt-Testing Framework[edit]

Our regression tests are few and far between and almost none of them are run automatically. This part of the project would be the creation of a Grand Unified Unit-Testing Framework™, probably mainly derived from https://github.com/TinoDidriksen/regtest (though I have included time in the schedule for bikeshedding the exact feature set bikeshedding has been rescheduled to before the coding period begins).

I would also plan to enable this testing on all existing repositories unless the maintainers really didn't want it. This will entail converting any existing tests from whatever state they are currently in.

See also https://github.com/apertium/apertium-init/issues/51 User:Ilnar.salimzyan/On_testing https://github.com/TinoDidriksen/regtest/wiki Testvoc https://github.com/apertium/apertium-ckb-eng/tree/master/t https://github.com/giellalt/regtest-kal Github_Actions

UNIversal Dependencies-Based Structural Transfer[edit]

I have worked out in broad outlines how to get apertium-recursive's rtx-proc to parse a dependency grammar. This third phase would be implementing that concept and seeing if it's actually useful.

If it works, we'll have a way of creating UD parsers (though I suppose we already have that with CG) though I think the more useful thing is that I have some ideas for a UD generator uses the rules for the parser and uses them to reorder a set of relations. The result of that would be a semi-monolingualized syntactic transfer system, where translation pairs could immediately start using the UD parsers in the monolingual repos and would only need rules for particular problematic constructions.

There's probably also cool stuff we could do with treebanks.

See also [1] [2] [3] [4] [5]

Background[edit]

I am a senior at Swarthmore College studying math and linguistics. I have a lot of experience with Python and a decent knowledge of C++. I am a native speaker of English and can read Spanish and Biblical Hebrew.

Repositories I maintain:

Repositories I have been very involved in:

Coding Challenge[edit]

Preliminary UD work: https://github.com/mr-martian/ud-experiments/tree/master

Status as of Feb 9: working Python compiler from minimal formalism to trx.

Feb 23: the coding challenge listed at Ideas_for_Google_Summer_of_Code/Robust_tokenisation can be found here and a draft of the file structure for the unit-testing framework can be found at User:Popcorndude/Unit-Testing.

Mar 2: Unit-testing draft is probably nearly ready to send to the mailing list for discussion.

Mar 2 again: This slight generalization of Tino's regression system should cover pretty much everything we need User:Popcorndude/Regression-Testing without separate unit tests.

Work Plan[edit]

The following work plan is written assuming the number of hours that GSoC recommends (17 or so per week), but it is very likely that having fun or not having much else to do will result in me putting in substantially more time than that. Either this will balance out my inevitable underestimate of the complexity of the task, or I'll end up way ahead and start inventing new tasks for myself.

Time Period Goal Details Deliverable
Community Bonding Period

May 17-June 5

Finalize implementation plan for Unicode support
  • Decide on normalization scheme
  • Figure out how ICU custom normalizers work
  • Determine which ICU types to make default
  • Propose Grand Unified Unit-Testing Framework™
  • Bikeshed Grand Unified Unit-Testing Framework™
Draft of string-handling guidelines (= updated version of Code style)
Week 1

June 6-12

Convert Lttoolbox
  • Apply plan from Community Bonding Period
PR on lttoolbox repository
Week 2

June 13-19

Convert Apertium
  • Apply plan from Community Bonding Period
PR on apertium repository
Week 3

June 20-26

Convert remaining tools
  • Apply plan from Community Bonding Period
PRs on apertium-separable, apertium-anaphora, apertium-recursive, and apertium-lex-tools
Week 4

June 27-July 3

Transition week
  • Deal with anything that comes up in testing
  • Finalize string-handling coding guidelines
  • Implement/port regression test system
Unicode transition branches ready for merging
Week 5

July 4-10

Prepare Grand Unified Unit-Testing Framework™
  • Test regression system
  • Figure out continuous integration setup
  • Incorporate regression tests into apertium-init
  • Use apertium-init to begin updating existing repos
Working test framework
midterm evaluation Full Unicode support Full Unicode support with proper tokenization and normalization in all native pipeline modules and coding guidelines for future contributors
Week 6

July 11-17

Roll out Grand Unified Unit-Testing Framework™
  • Use apertium-init to update existing repos
Partial rollout of testing framework
Week 7

July 18-24

Probably more Grand Unified Unit-Testing Framework™ rollout
  • Use apertium-init to update existing repos
  • Probably declare undying hatred of makefiles
Working regression tests in all repositories
Week 8

July 25-31

UD experimentation
  • Convert parser output to CONLL-U
  • Transfer of some sort
  • Deal with the inevitable issues that arise from rolling out changes to several hundred repositories
Yet Another Transfer System
Week 9

August 1-7

Further UD experimentation
  • Try some actual UD grammars
    • See how well it translates with no pair-specific rules, just the reorderings specified by the monolingual grammar
    • See how fast or slow it ends up being
  • Port compiler and whatnot to C++ if it turns out to be any good
Preliminary evaluation of the effectiveness of UD transfer
Week 10

August 8-14

Unknown What I write here will have little bearing on what actually happens and also by not making a plan for this week, I ensure that my overall plan can be at most 90% wrong. Unknown
final evaluation Project done Unicode support with accompanying developer documentation, unit-testing framework for all language and pair repositories, and some potentially useful messing around with UD-based transfer

I have no other commitments this summer and would be able to work on this project full-time.