User:Popcorndude/Recursive Transfer/Progress
Contents
Work Plan (from proposal)
Time Period | Goal | Details | Deliverable |
---|---|---|---|
Community Bonding Period
May 6-26 |
Finalize formalism |
|
Full description of planned formalism |
Week 1
May 27-June 2 |
Begin parser |
|
Minimal parser |
Week 2
June 3-9 |
Add variables |
|
Minimal parser with agreement |
Week 3
June 10-16 |
Test with eng->spa |
|
Simple eng->spa parser |
Week 4
June 17-23 |
Continue parser |
|
Majority of initial specifications implemented |
evaluation 1 | Basic parser done | Parser-generator compliant with majority of initial specifications and rudimentary eng->spa instantiation | |
Week 5
June 24-30 |
Finish parser and continue eng->spa |
|
Fully implemented parser and working eng->spa for simple sentences |
Week 6
July 1-7 |
Finish eng->spa and write reverser |
|
System comparison and rule-reverser |
Week 7
July 8-14 |
Evaluation and testing |
|
Test suite and report on the general effectiveness of direct rule-reversal |
Week 8
July 15-21 |
Optimization and interface |
|
Command-line interfaces and updated system comparison |
evaluation 2 | Complete program | Optimized and polished parser-generator compliant with initial specifications, and complete end->spa transfer rules | |
Week 9
July 22-28 |
Do spa->eng |
|
Working spa->eng rules and report on the usefulness of rule-reverser |
Week 10
July 29-August 4 |
Documentation |
|
Complete documentation of system |
Weeks 11 and 12
August 5-18 |
Buffer zone |
These weeks will be used for one of the following, depending on preceding weeks and discussions with mentors:
|
TBD |
final evaluation | Project done | Complete, fully documented system with full ruleset for at least one language pair |
Community Bonding
Todo list
Determine exact semantics of lexical unit tag-matchingAre they ordered?Are they consecutive?
- See if anyone has input on formalism syntax in general
- Mechanism for clitic-insertion
- e.g. V2, Wackernagel
- Read about GLR parser algorithms
- Find reading materials
- Is there anything that can be done to make this finite-state? (probably not)
- Should we just start with the naive implementation (what the Python script does) as a baseline?
Conjoined lexical units - just treat as consecutive elements with no blank between?Syntax for mapping between sets of tags (e.g. <o3pl> -> <p3><pl>, <o3sg> -> <p3><sg>)Conditional output (e.g. modal verbs in English)Make sure all syntax is written down- Begin writing tests
- Some way to match absence of a tag
May 25: LU tags are unordered (basically, every tag operation has the same semantics as <clip>, <equal><clip>..., or <let><clip>... in the chunker). Various other things have syntax but that syntax may not be properly documented yet.
Week 1
May 27: As it turns out, I should have accounted for building a compiler/parser-generator in the workplan. We can currently mostly parse the file. Todo for tomorrow: finish parsing reduction rules and add some more error messages.
May 28: Can now parse a basic test file and we're maybe 2/3 of the way through generating the LR parsing table (multiple output nodes may be more complicated than I thought).
May 29: Idea: in a rule like
clitic NP -> @adj @cli @n {2} {1 _1 _2 3};
The parser just has to treat NP
as both terminal and non-terminal, and when it applies this rule it can then reduce clitic
and pretend that NP
was the next token in the stream.
In discussion with mentors it was decided that we should try using Bison instead of building a new parser. The Bison parsers now compile, but they don't seem to succeed at outputting anything. Further analysis required.
May 31: The bytecode interpreter for interchunk seems to be working.
Things that now exist:
- Python brute-force interpreter for formalism (works)
- C++ parser and parse-table-generator for formalism (partially tested)
- Python formalism->Yacc converter (might be close to working)
- Python t2x->bytecode compiler (works)
- C++ bytecode interpreter for interchunk (works for toy examples, not fully tested)
I could make the last one recursive in maybe half an hour.
I tested the bytecode on eng->spa (because what else would one use for a stress test :) ). It appears to work, but I didn't look very closely at the output because it gets stuck in an infinite loop if the input has a non-ASCII character in it.
Week 2
June 3: There are several inputs + rulesets for which the bytecode works just fine. There are a few, however, which give a segfault in the output code. The bytecode is now documented: User:Popcorndude/Recursive_Transfer/Bytecode
June 4: The interpreter can now handle eng->spa with a 67k word corpus. The error rate seems to be around 5% compared with apertium-interchunk (though I've had some difficulty counting). A lot of the errors seem to the be the interpreter spitting out newlines for some reason.
Notes for tomorrow: if the chunk content is empty it will get treated like a blank and throw off rule application - maybe keep {} as part of chunk data? Also, remove <call-macro n="f_bcond"> from rule 56 and see how much that changes things.
June 5: The only remaining difference in output between apertium-interchunk and the bytecode is that interchunk deduplicates blanks. We had a meeting and decided that the next step was to rewrite eng->spa in the formalism, make a compiler for the formalism, and then compare that to chunker/interchunk/postchunk. Initial rewriting is 1/4-1/3 done and the parser is nearly ready to begin actually compiling.
June 6-7: rtx-comp2 will now turn the basic formalism into both bytecode and the pattern files that apertium-preprocess-transfer generates. More work has been done on eng->spa, but we don't have agreement yet, so on average we're about on track.
Week 3
Todo List
- The formalism is in flux again
Clipping?- Variables???
- Bytecode is also in flux
- Data types
Manner of chunk construction
- Bytecode is incomplete
rejectrule- interpolation???
make a separate file that lists all the codes so the compiler isn't relying on literals
- eng->spa
- t1x - about
120109 rules to go - t2x - 61 rules
- t3x - 37 rules
- t1x - about
- Compiler
- Agreement (rejectrule or change transducer?)
- Conditionals
- Compiler for t?x issues
- Name conflicts
- t2x doesn't generate new chunks, so rules might conflict with t3x
- Maybe change the rule applier so it applies as many rules as possible before getting more input?
June 10: worked on bytecode a bit, cleaned up string construction in compiler, added added verb rules to eng-spa.rtx but haven't checked how well they align with the t1x versions. The compiler can now handle most of eng-spa.rtx with the exception of conditionals and the character #
, which apparently has to be escaped in regexs.
June 11: with all the conditionals commented out, eng-spa.rtx actually works now.
June 12: Conditionals work now. Some thought should probably be given to what are valid ways to write the operators and how many there should be (there are currently 46 ways to write 17 operators but NOT can only be ~). This leaves for tomorrow implicit agreement in the pattern, multiple output chunks, and maybe weights.