User:Popcorndude/Recursive Transfer/Progress
Contents
Work Plan (from proposal)
Time Period | Goal | Details | Deliverable |
---|---|---|---|
Community Bonding Period
May 6-26 |
Finalize formalism |
|
Full description of planned formalism |
Week 1
May 27-June 2 |
Begin parser |
|
Minimal parser |
Week 2
June 3-9 |
Add variables |
|
Minimal parser with agreement |
Week 3
June 10-16 |
Test with eng->spa |
|
Simple eng->spa parser |
Week 4
June 17-23 |
Continue parser |
|
Majority of initial specifications implemented |
evaluation 1 | Basic parser done | Parser-generator compliant with majority of initial specifications and rudimentary eng->spa instantiation | |
Week 5
June 24-30 |
Finish parser and continue eng->spa |
|
Fully implemented parser and working eng->spa for simple sentences |
Week 6
July 1-7 |
Finish eng->spa and write reverser |
|
System comparison and rule-reverser |
Week 7
July 8-14 |
Evaluation and testing |
|
Test suite and report on the general effectiveness of direct rule-reversal |
Week 8
July 15-21 |
Optimization and interface |
|
Command-line interfaces and updated system comparison |
evaluation 2 | Complete program | Optimized and polished parser-generator compliant with initial specifications, and complete end->spa transfer rules | |
Week 9
July 22-28 |
Do spa->eng |
|
Working spa->eng rules and report on the usefulness of rule-reverser |
Week 10
July 29-August 4 |
Documentation |
|
Complete documentation of system |
Weeks 11 and 12
August 5-18 |
Buffer zone |
These weeks will be used for one of the following, depending on preceding weeks and discussions with mentors:
|
TBD |
final evaluation | Project done | Complete, fully documented system with full ruleset for at least one language pair |
Community Bonding
Todo list
Determine exact semantics of lexical unit tag-matchingAre they ordered?Are they consecutive?
- See if anyone has input on formalism syntax in general
- Mechanism for clitic-insertion
- e.g. V2, Wackernagel
- Read about GLR parser algorithms
- Find reading materials
- Is there anything that can be done to make this finite-state? (probably not)
- Should we just start with the naive implementation (what the Python script does) as a baseline?
Conjoined lexical units - just treat as consecutive elements with no blank between?Syntax for mapping between sets of tags (e.g. <o3pl> -> <p3><pl>, <o3sg> -> <p3><sg>)Conditional output (e.g. modal verbs in English)Make sure all syntax is written down- Begin writing tests
- Some way to match absence of a tag
May 25: LU tags are unordered (basically, every tag operation has the same semantics as <clip>, <equal><clip>..., or <let><clip>... in the chunker). Various other things have syntax but that syntax may not be properly documented yet.
Week 1
May 27: As it turns out, I should have accounted for building a compiler/parser-generator in the workplan. We can currently mostly parse the file. Todo for tomorrow: finish parsing reduction rules and add some more error messages.
May 28: Can now parse a basic test file and we're maybe 2/3 of the way through generating the LR parsing table (multiple output nodes may be more complicated than I thought).
May 29: Idea: in a rule like
clitic NP -> @adj @cli @n {2} {1 _1 _2 3};
The parser just has to treat NP
as both terminal and non-terminal, and when it applies this rule it can then reduce clitic
and pretend that NP
was the next token in the stream.
In discussion with mentors it was decided that we should try using Bison instead of building a new parser. The Bison parsers now compile, but they don't seem to succeed at outputting anything. Further analysis required.
May 31: The bytecode interpreter for interchunk seems to be working.
Things that now exist:
- Python brute-force interpreter for formalism (works)
- C++ parser and parse-table-generator for formalism (partially tested)
- Python formalism->Yacc converter (might be close to working)
- Python t2x->bytecode compiler (works)
- C++ bytecode interpreter for interchunk (works for toy examples, not fully tested)
I could make the last one recursive in maybe half an hour.
I tested the bytecode on eng->spa (because what else would one use for a stress test :) ). It appears to work, but I didn't look very closely at the output because it gets stuck in an infinite loop if the input has a non-ASCII character in it.
Week 2
June 3: There are several inputs + rulesets for which the bytecode works just fine. There are a few, however, which give a segfault in the output code. The bytecode is now documented: User:Popcorndude/Recursive_Transfer/Bytecode
June 4: The interpreter can now handle eng->spa with a 67k word corpus. The error rate seems to be around 5% compared with apertium-interchunk (though I've had some difficulty counting). A lot of the errors seem to the be the interpreter spitting out newlines for some reason.