Difference between revisions of "User:Popcorndude/Recursive Transfer"

From Apertium
Jump to navigation Jump to search
(→‎Work Plan: add idea)
 
(One intermediate revision by the same user not shown)
Line 55: Line 55:
 
|-
 
|-
 
! Time Period
 
! Time Period
  +
! Goal
! Work Plan
 
  +
! Details
 
! Deliverable
 
! Deliverable
 
|-
 
|-
| Community Bonding Period and week 1
+
| Community Bonding Period
  +
May 6-26
| Read up on GLR parsers and finalize first draft of formalism
 
  +
| Finalize formalism
| List of operations with syntax
 
  +
|
  +
* Read up on GLR parsers
  +
* Decide variable semantics and syntax
  +
* See if there's a good way to handle interpolation (e.g. inserting clitics after first word of phrase)
  +
| Full description of planned formalism
 
|-
 
|-
  +
| Week 1
| weeks 2 and 3
 
  +
May 27-June 2
| Build parser which implements a subset of the formalism
 
  +
| Begin parser
| Parser-generator which can extract attributes from lexical units and build trees with some agreement
 
  +
|
  +
* Get input
  +
* Match and build trees based on literal tags and attribute categories
  +
| Minimal parser
 
|-
 
|-
| week 4
+
| Week 2
  +
June 3-9
| Test by writing noun phrase rules for eng->spa
 
  +
| Add variables
| Ruleset which accurately translates basic English noun phrases to Spanish
 
  +
|
  +
* Agreement
  +
* Passing variables up the tree
  +
* Setting variables for child nodes
  +
| Minimal parser with agreement
  +
|-
  +
| Week 3
  +
June 10-16
  +
| Test with eng->spa
  +
|
  +
* Noun phrases (this was started in the coding challenge)
  +
* Basic verb phrases (some agreement, if time)
  +
| Simple eng->spa parser
  +
|-
  +
| Week 4
  +
June 17-23
  +
| Continue parser
  +
|
  +
* Weights
  +
* Conditionals
  +
* Multiple output nodes
  +
* Anything else deemed necessary during Community Bonding or testing
  +
| Majority of initial specifications implemented
 
|-
 
|-
 
| '''evaluation 1'''
 
| '''evaluation 1'''
 
| Basic parser done
 
| Basic parser done
 
|
 
|
  +
| Parser-generator compliant with majority of initial specifications and rudimentary eng->spa instantiation
 
|-
 
|-
| weeks 5 and 6
+
| Week 5
  +
June 24-30
| Implement remainder of formalism, testing with eng->spa
 
  +
| Finish parser and continue eng->spa
| Parser-generator with all behavior specified in week 1
 
  +
|
  +
* Finish anything left over from week 4
  +
* Finish verb phrases
  +
| Fully implemented parser and working eng->spa for simple sentences
 
|-
 
|-
  +
| Week 6
| weeks 7 and 8
 
  +
July 1-7
| Write the rest of eng->spa, begin working on spa->eng
 
  +
| Finish eng->spa and write reverser
| Transfer system equivalent to current chunking system for eng-spa
 
  +
|
  +
* Convert any remaining eng->spa rules
  +
* Evaluate parser against chunking system
  +
** Metrics: accuracy, speed of parser, compilation speed
  +
* Write script to automatically reverse a ruleset
  +
** All features currently described are at least in princible reversible
  +
| System comparison and rule-reverser
  +
|-
  +
| Week 7
  +
July 8-14
  +
| Evaluation and testing
  +
|
  +
* Evaluate the output of the reverser against current spa->eng system
  +
* Write tests for all features
  +
* Begin adding error messages
  +
| Test suite and report on the general effectiveness of direct rule-reversal
  +
|-
  +
| Week 8
  +
July 15-21
  +
| Optimization and interface
  +
|
  +
* Speed up the parser and compiler where possible
  +
* Build interfaces for compiler, parser, and reverser
  +
* Clean up code
  +
* Re-evaluate speed
  +
| Command-line interfaces and updated system comparison
 
|-
 
|-
 
| '''evaluation 2'''
 
| '''evaluation 2'''
| Working eng->spa transfer program
+
| Complete program
 
|
 
|
  +
| Optimized and polished parser-generator compliant with initial specifications, and complete end->spa transfer rules
 
|-
 
|-
| week 9
+
| Week 9
  +
July 22-28
| Finish spa->eng
 
  +
| Do spa->eng
It may be possible to automatically reverse a ruleset and post-edit the result, which would significantly reduce the time needed for spa->eng
 
  +
|
| Transfer system equivalent to current chunking system for spa->eng
 
  +
* Identify differences between generated spa->eng and chunking spa->eng
  +
* Fix generated spa->eng rules
  +
* Report on effort required to correct reverser
  +
| Working spa->eng rules and report on the usefulness of rule-reverser
 
|-
 
|-
| week 10
+
| Week 10
  +
July 29-August 4
| Documentation and further testing
 
  +
| Documentation
Specifically, attempt to write minimal examples for all the phenomena listed at [[User_talk:Popcorndude/Recursive_Transfer#Linguistic.2Ftransfer_phenomena]].
 
  +
|
| Full description of each syntax feature and examples for common and tricky transfer phenomena
 
  +
* Convert initial specifications to full documentation
  +
* Write tutorial
  +
* Write recipe book containing at least minimal examples of everything listed at [[User_talk:Popcorndude/Recursive_Transfer#Linguistic.2Ftransfer_phenomena]]
  +
| Complete documentation of system
 
|-
 
|-
| weeks 11 and 12
+
| Weeks 11 and 12
  +
August 5-18
| Either a buffer for things taking longer than expected or conversion of a second language pair
 
  +
| Buffer zone
  +
|
  +
These weeks will be used for one of the following, depending on preceding weeks and discussions with mentors:
  +
* Make up for delays in prior weeks
  +
* Converting another language pair
  +
* Experimenting with automated conversion of chunking rules
  +
* Writing a ruleset composer for generating a preliminary ruleset from two other pairs (e.g. combine eng->spa and spa->cat to get approximate rules for eng->cat)
 
| TBD
 
| TBD
 
|-
 
|-
 
| '''final evaluation'''
 
| '''final evaluation'''
  +
| Project done
|
 
 
|
 
|
  +
| Complete, fully documented system with full ruleset for at least one language pair
 
|}
 
|}
   

Latest revision as of 04:56, 4 April 2019

Google Summer of Code 2019 proposal draft

Contact[edit]

Name: Daniel Swanson

Email: awesomeevildudes@gmail.com

IRC: popcorndude

GitHub: https://github.com/mr-martian

Timezone: UTC-5

Proposal[edit]

I would like to implement an alternative to the current chunking system for structural transfer as described at Ideas_for_Google_Summer_of_Code/Robust_recursive_transfer. The new system would take a set of recursively defined rules generate a GLR parser which will make it much easier to handle long-distance phrasal reordering and will probably also significantly reduce the size of existing rule sets. A draft of the formalism for these rules can be found at User:Popcorndude/Recursive_Transfer/Formalism.

This project would benefit the community by making it much easier to write transfer rules for syntactically dissimilar languages and to the extent that it makes rule sets smaller, it will presumably also make them easier to maintain.

Background[edit]

I am a sophomore at Swarthmore College studying math and linguistics. Last year I took a class in computational linguistics using Apertium and this year I am a course assistant for that class. Last summer I worked on a personal translation project (code here) which involved a lot of structural transfer and writing a recursive descent parser.

I have a lot of experience with Python and a basic knowledge of C++. I am a native speaker of English and can read Spanish and Biblical Hebrew.

I have been interested in rule-based machine translation for several years, particularly as it might be applied to Bible translation. I am interested in Apertium because it already does pretty much everything I was trying to do with the system I was building on my own except for complex syntactic relations, and this GSoC project would fill that gap.

Coding Challenge[edit]

All my code is on GitHub at https://github.com/mr-martian/GSoC19-recursive

3/4/19[edit]

So far I have reimplemented the Python script from the prototype and added support for attribute categories and parameterized nodes.

Example:

gender = m f;
#noun $gender -> #(n.$gender);
#adj $gender -> #(adj.$gender);
NP $gender -> noun adj { 2 1 } ;

The defines a category "gender" consisting of <m> and <f>. The lexical categories "noun" and "adj", matching things of the form "word<n><C>" and "word<adj><C>", respectively, where C is <m> or <f>. The last line defines a non-terminal node "NP" which matches a noun followed by an adjective of the same gender, so "carro<n><m>/car<n> rojo<adj><m>/red<adj>" would become "rojo<adj><m>/red<adj> carro<n><m>/car<n>" but "carro<n><m>/car<n> roja<adj><f>" would not be matched.

3/12/19[edit]

I rewrote a portion of the English->Spanish noun phrase rules in a potential transfer formalism. https://github.com/mr-martian/GSoC19-recursive/blob/master/eng-spa.rtx

Could you add comments the rules that give examples of what they do? —Firespeaker (talk) 04:40, 22 March 2019 (CET)

Done Popcorndude (talk) 15:07, 22 March 2019 (CET)

Work Plan[edit]

Time Period Goal Details Deliverable
Community Bonding Period

May 6-26

Finalize formalism
  • Read up on GLR parsers
  • Decide variable semantics and syntax
  • See if there's a good way to handle interpolation (e.g. inserting clitics after first word of phrase)
Full description of planned formalism
Week 1

May 27-June 2

Begin parser
  • Get input
  • Match and build trees based on literal tags and attribute categories
Minimal parser
Week 2

June 3-9

Add variables
  • Agreement
  • Passing variables up the tree
  • Setting variables for child nodes
Minimal parser with agreement
Week 3

June 10-16

Test with eng->spa
  • Noun phrases (this was started in the coding challenge)
  • Basic verb phrases (some agreement, if time)
Simple eng->spa parser
Week 4

June 17-23

Continue parser
  • Weights
  • Conditionals
  • Multiple output nodes
  • Anything else deemed necessary during Community Bonding or testing
Majority of initial specifications implemented
evaluation 1 Basic parser done Parser-generator compliant with majority of initial specifications and rudimentary eng->spa instantiation
Week 5

June 24-30

Finish parser and continue eng->spa
  • Finish anything left over from week 4
  • Finish verb phrases
Fully implemented parser and working eng->spa for simple sentences
Week 6

July 1-7

Finish eng->spa and write reverser
  • Convert any remaining eng->spa rules
  • Evaluate parser against chunking system
    • Metrics: accuracy, speed of parser, compilation speed
  • Write script to automatically reverse a ruleset
    • All features currently described are at least in princible reversible
System comparison and rule-reverser
Week 7

July 8-14

Evaluation and testing
  • Evaluate the output of the reverser against current spa->eng system
  • Write tests for all features
  • Begin adding error messages
Test suite and report on the general effectiveness of direct rule-reversal
Week 8

July 15-21

Optimization and interface
  • Speed up the parser and compiler where possible
  • Build interfaces for compiler, parser, and reverser
  • Clean up code
  • Re-evaluate speed
Command-line interfaces and updated system comparison
evaluation 2 Complete program Optimized and polished parser-generator compliant with initial specifications, and complete end->spa transfer rules
Week 9

July 22-28

Do spa->eng
  • Identify differences between generated spa->eng and chunking spa->eng
  • Fix generated spa->eng rules
  • Report on effort required to correct reverser
Working spa->eng rules and report on the usefulness of rule-reverser
Week 10

July 29-August 4

Documentation Complete documentation of system
Weeks 11 and 12

August 5-18

Buffer zone

These weeks will be used for one of the following, depending on preceding weeks and discussions with mentors:

  • Make up for delays in prior weeks
  • Converting another language pair
  • Experimenting with automated conversion of chunking rules
  • Writing a ruleset composer for generating a preliminary ruleset from two other pairs (e.g. combine eng->spa and spa->cat to get approximate rules for eng->cat)
TBD
final evaluation Project done Complete, fully documented system with full ruleset for at least one language pair

I have no other commitments this summer and would be able to work on this project full-time.