Difference between revisions of "User:Popcorndude/GSoC2021Proposal"
Popcorndude (talk | contribs) (Created page with "Google Summer of Code 2021 proposal draft == Contact == Name: Daniel Swanson Email: awesomeevildudes@gmail.com IRC: popcorndude GitHub: https://github.com/mr-martian Tim...") |
Popcorndude (talk | contribs) |
||
Line 31: | Line 31: | ||
I would also plan to enable this testing on all existing repositories unless the maintainers really didn't want it. This will entail converting any existing tests from whatever state they are currently in. |
I would also plan to enable this testing on all existing repositories unless the maintainers really didn't want it. This will entail converting any existing tests from whatever state they are currently in. |
||
See also https://github.com/apertium/apertium-init/issues/51 |
See also https://github.com/apertium/apertium-init/issues/51 [[User:Ilnar.salimzyan/On_testing]] |
||
=== UNIversal Dependencies-Based Structural Transfer === |
=== UNIversal Dependencies-Based Structural Transfer === |
Revision as of 21:10, 9 February 2021
Google Summer of Code 2021 proposal draft
Contents
Contact
Name: Daniel Swanson
Email: awesomeevildudes@gmail.com
IRC: popcorndude
GitHub: https://github.com/mr-martian
Timezone: UTC-5
Proposal: UNIpertium
This project is really 3 projects UNIfied only by the fact that they all start with UNI:
UNIcode Support
This portion of the project is Ideas for Google Summer of Code/Robust tokenisation. All string processing would be transferred to ICU and all inputs would be normalized.
The normalization will likely be a modified version of NFC. See https://github.com/apertium/organisation/issues/24 for discussion.
See also https://github.com/apertium/lttoolbox/issues/81 and https://github.com/apertium/lttoolbox/issues/85
UNIt-Testing Framework
Our regression tests are few and far between and almost none of them are run automatically. This part of the project would be the creation of a Grand Unified Unit-Testing Framework™, probably mainly derived from https://github.com/TinoDidriksen/regtest (though I have included time in the schedule for bikeshedding the exact feature set).
I would also plan to enable this testing on all existing repositories unless the maintainers really didn't want it. This will entail converting any existing tests from whatever state they are currently in.
See also https://github.com/apertium/apertium-init/issues/51 User:Ilnar.salimzyan/On_testing
UNIversal Dependencies-Based Structural Transfer
I have worked out in broad outlines how to get apertium-recursive's rtx-proc to parse a dependency grammar. This third phase would be implementing that concept and seeing if it's actually useful.
If it works, we'll have a way of creating UD parsers (though I suppose we already have that with CG) though I think the more useful thing is that I have some ideas for a UD generator uses the rules for the parser and uses them to reorder a set of relations. The result of that would be a semi-monolingualized syntactic transfer system, where translation pairs could immediately start using the UD parsers in the monolingual repos and would only need rules for particular problematic constructions.
There's probably also cool stuff we could do with treebanks.
Background
I am a senior at Swarthmore College studying math and linguistics. I have a lot of experience with Python and a decent knowledge of C++. I am a native speaker of English and can read Spanish and Biblical Hebrew.
Repositories I maintain:
- https://github.com/apertium/apertium-recursive
- https://github.com/apertium/lexd
- https://github.com/apertium/apertium-wad
- https://github.com/apertium/apertium-bkl
Repositories I have been very involved in:
- https://github.com/apertium/apertium-eng-kir
- https://github.com/apertium/apertium-separable
- https://github.com/apertium/apertium-init
- https://github.com/apertium/apertium-lex-tools
Coding Challenge
Preliminary UD work: https://github.com/mr-martian/ud-experiments/tree/master
Work Plan
The following work plan is written assuming the number of hours that GSoC recommends (17 or so per week), but it is very likely that having fun or not having much else to do will result in me putting in substantially more time than that. Either this will balance out my inevitable underestimate of the complexity of the task, or I'll end up way ahead and start inventing new tasks for myself.
Time Period | Goal | Details | Deliverable |
---|---|---|---|
Community Bonding Period
May 17-June 5 |
Finalize implementation plan for Unicode support |
|
Draft of string-handling guidelines (= updated version of Code style) |
Week 1
June 6-12 |
Convert Lttoolbox |
|
PR on lttoolbox repository |
Week 2
June 13-19 |
Convert Apertium |
|
PR on apertium repository |
Week 3
June 20-26 |
Convert remaining tools |
|
PRs on apertium-separable, apertium-anaphora, apertium-recursive, and apertium-lex-tools |
Week 4
June 27-July 3 |
Transition week |
|
Unicode transition branches ready for merging |
Week 5
July 4-10 |
Propose Grand Unified Unit-Testing Framework™ |
|
Nothing terribly concrete |
midterm evaluation | Full Unicode support | Full Unicode support with proper tokenization and normalization in all native pipeline modules and coding guidelines for future contributors | |
Week 6
July 11-17 |
Finish initial UD compiler |
|
Preliminary UD-based transfer system |
Week 7
July 18-24 |
Implement Grand Unified Unit-Testing Framework™ |
|
Working unit-testing framework |
Week 8
July 25-31 |
Framework rollout |
|
Test reports everywhere |
Week 9
August 1-7 |
Further UD experimentation |
|
Preliminary evaluation of the effectiveness of UD transfer |
Week 10
August 8-14 |
Unknown | What I write here will have little bear on what actually happens and also by not making a plan for this week, I ensure that my overall plan can be at most 90% wrong. | Unknown |
final evaluation | Project done | Unicode support with accompanying developer documentation, unit-testing framework for all language and pair repositories, and some potentially useful messing around with UD-based transfer |
I have no other commitments this summer and would be able to work on this project full-time.