Difference between revisions of "User:Popcorndude/Recursive Transfer/Progress"
Popcorndude (talk | contribs) (→Week 7) |
Popcorndude (talk | contribs) (→Week 8) |
||
Line 268: | Line 268: | ||
== Week 8 == |
== Week 8 == |
||
July 15: Made a workaround for completed branches getting discarded, fixed some bugs, and merged the output binary files. |
|||
July 16: Assorted VM optimizations and added lookahead. |
|||
== Week 9 == |
== Week 9 == |
||
== Week 10 == |
== Week 10 == |
Revision as of 13:51, 17 July 2019
Contents
Work Plan (from proposal)
Time Period | Goal | Details | Deliverable |
---|---|---|---|
Community Bonding Period
May 6-26 |
Finalize formalism |
|
Full description of planned formalism |
Week 1
May 27-June 2 |
Begin parser |
|
Minimal parser |
Week 2
June 3-9 |
Add variables |
|
Minimal parser with agreement |
Week 3
June 10-16 |
Test with eng->spa |
|
Simple eng->spa parser |
Week 4
June 17-23 |
Continue parser |
|
Majority of initial specifications implemented |
evaluation 1 | Basic parser done | Parser-generator compliant with majority of initial specifications and rudimentary eng->spa instantiation | |
Week 5
June 24-30 |
Finish parser and continue eng->spa |
|
Fully implemented parser and working eng->spa for simple sentences |
Week 6
July 1-7 |
Finish eng->spa and write reverser |
|
System comparison and rule-reverser |
Week 7
July 8-14 |
Evaluation and testing |
|
Test suite and report on the general effectiveness of direct rule-reversal |
Week 8
July 15-21 |
Optimization and interface |
|
Command-line interfaces and updated system comparison |
evaluation 2 | Complete program | Optimized and polished parser-generator compliant with initial specifications, and complete end->spa transfer rules | |
Week 9
July 22-28 |
Do spa->eng |
|
Working spa->eng rules and report on the usefulness of rule-reverser |
Week 10
July 29-August 4 |
Documentation |
|
Complete documentation of system |
Weeks 11 and 12
August 5-18 |
Buffer zone |
These weeks will be used for one of the following, depending on preceding weeks and discussions with mentors:
|
TBD |
final evaluation | Project done | Complete, fully documented system with full ruleset for at least one language pair |
Community Bonding
Todo list
Determine exact semantics of lexical unit tag-matchingAre they ordered?Are they consecutive?
- See if anyone has input on formalism syntax in general
- Mechanism for clitic-insertion
- e.g. V2, Wackernagel
- Read about GLR parser algorithms
- Find reading materials
- Is there anything that can be done to make this finite-state? (probably not)
- Should we just start with the naive implementation (what the Python script does) as a baseline?
Conjoined lexical units - just treat as consecutive elements with no blank between?Syntax for mapping between sets of tags (e.g. <o3pl> -> <p3><pl>, <o3sg> -> <p3><sg>)Conditional output (e.g. modal verbs in English)Make sure all syntax is written down- Begin writing tests
- Some way to match absence of a tag
May 25: LU tags are unordered (basically, every tag operation has the same semantics as <clip>, <equal><clip>..., or <let><clip>... in the chunker). Various other things have syntax but that syntax may not be properly documented yet.
Week 1
May 27: As it turns out, I should have accounted for building a compiler/parser-generator in the workplan. We can currently mostly parse the file. Todo for tomorrow: finish parsing reduction rules and add some more error messages.
May 28: Can now parse a basic test file and we're maybe 2/3 of the way through generating the LR parsing table (multiple output nodes may be more complicated than I thought).
May 29: Idea: in a rule like
clitic NP -> @adj @cli @n {2} {1 _1 _2 3};
The parser just has to treat NP
as both terminal and non-terminal, and when it applies this rule it can then reduce clitic
and pretend that NP
was the next token in the stream.
In discussion with mentors it was decided that we should try using Bison instead of building a new parser. The Bison parsers now compile, but they don't seem to succeed at outputting anything. Further analysis required.
May 31: The bytecode interpreter for interchunk seems to be working.
Things that now exist:
- Python brute-force interpreter for formalism (works)
- C++ parser and parse-table-generator for formalism (partially tested)
- Python formalism->Yacc converter (might be close to working)
- Python t2x->bytecode compiler (works)
- C++ bytecode interpreter for interchunk (works for toy examples, not fully tested)
I could make the last one recursive in maybe half an hour.
I tested the bytecode on eng->spa (because what else would one use for a stress test :) ). It appears to work, but I didn't look very closely at the output because it gets stuck in an infinite loop if the input has a non-ASCII character in it.
Week 2
June 3: There are several inputs + rulesets for which the bytecode works just fine. There are a few, however, which give a segfault in the output code. The bytecode is now documented: User:Popcorndude/Recursive_Transfer/Bytecode
June 4: The interpreter can now handle eng->spa with a 67k word corpus. The error rate seems to be around 5% compared with apertium-interchunk (though I've had some difficulty counting). A lot of the errors seem to the be the interpreter spitting out newlines for some reason.
Notes for tomorrow: if the chunk content is empty it will get treated like a blank and throw off rule application - maybe keep {} as part of chunk data? Also, remove <call-macro n="f_bcond"> from rule 56 and see how much that changes things.
June 5: The only remaining difference in output between apertium-interchunk and the bytecode is that interchunk deduplicates blanks. We had a meeting and decided that the next step was to rewrite eng->spa in the formalism, make a compiler for the formalism, and then compare that to chunker/interchunk/postchunk. Initial rewriting is 1/4-1/3 done and the parser is nearly ready to begin actually compiling.
June 6-7: rtx-comp2 will now turn the basic formalism into both bytecode and the pattern files that apertium-preprocess-transfer generates. More work has been done on eng->spa, but we don't have agreement yet, so on average we're about on track.
Week 3
Todo List
- The formalism is in flux again
Clipping?- Variables???
- Bytecode is also in flux
- Data types
Manner of chunk construction
- Bytecode is incomplete
rejectrule- interpolation???
make a separate file that lists all the codes so the compiler isn't relying on literals
- eng->spa
- t1x - about
120109 rules to go - t2x - 61 rules
- t3x - 37 rules
- t1x - about
- Compiler
Agreement (rejectrule or change transducer?)Conditionals- Case (capitalization)
- Compiler for t?x issues
- Name conflicts
- t2x doesn't generate new chunks, so rules might conflict with t3x
- Maybe change the rule applier so it applies as many rules as possible before getting more input?
June 10: worked on bytecode a bit, cleaned up string construction in compiler, added added verb rules to eng-spa.rtx but haven't checked how well they align with the t1x versions. The compiler can now handle most of eng-spa.rtx with the exception of conditionals and the character #
, which apparently has to be escaped in regexs.
June 11: with all the conditionals commented out, eng-spa.rtx actually works now.
June 12: Conditionals work now. Some thought should probably be given to what are valid ways to write the operators and how many there should be (there are currently 46 ways to write 17 operators but NOT can only be ~). This leaves for tomorrow implicit agreement in the pattern, multiple output chunks, and maybe weights.
June 13: Given the importance of output patterns, variable lists at the beginnings of rules have now been made optional. Weights work. They can also be left off now. Documentation has been updated. I'm thinking implicit agreement might actually be undesirable (further discussion needed).
June 14: Changed the outer loop of transfer and now the whole program is about 90x faster. Worked on eng->spa a bit, got kind of stuck. Started writing tests. The second one written revealed bugs related to default attributes (probably why I was stuck on eng->spa).
Week 4
June 17: fixed some default attribute bugs, worked on eng-spa, added overrides for output patterns, added constant attributes and chunk/lu distinctions to the parser (will add to compiler tomorrow), started redoing compileClip() to hopefully generate less bytecode most of the time.
June 18: Significantly restructured parts of the compiler. It now handles multiple output chunks and most of the names are a bit closer to what they're actually doing now. We're up to 25 basic tests, which all pass, but running the eng-spa ruleset gives some confusing failures.
June 19-21: Multipass parsing is inefficient - time to try LR. Unfortunately, Lttoolbox/match_exe.h won't really work for tracking multiple states, so some rewriting of it was necessary. The initial attempt relied far too heavily on std::map and std::set, resulting in runtimes around 10s (multipass was running around 3s for the test corpus as was chunker/interchunk).
Week 5
June 24: Started over on the match_exe rewrite, making it more like the original. This brought runtimes back down to 2.5s. Further optimization got down to 2.1-2.2s.
June 25: Unfortunately, the implementation of LR occasionally screws up if there's a shift-reduce conflict and the shift later turns out to be a dead end, so we're switching to GLR. The switch has had the interesting effect of breaking some but not all of the tests (previous changes have generally either broken either all the tests or just the one that deals with weights).
June 26: Found the bug, wrote some documentation and a test, found some more bugs. Hopefully tomorrow we can find out whether the difference in output from multipass is just due to the nature of the rules (several were modified to account for how multipass worked).
June 27: Memory leaks. Also cleaning up the src/ directory.
June 28: Output-time rules and pool allocators.
Week 6
July 1: Tried a few optimizations that didn't help very much. Added docstrings to pretty much everything in the compiler header.
July 2: Implemented most of an XML compiler. Added LUCOUNT to the bytecode.
July 3: More compiler stuff.
July 5: Compiled eng-spa and started fixing variations with chunker/interchunk. Also fixed bugs that popped up due to rtx-comp being used by someone other than me.
Week 7
July 8: The XML compiler hit some roadblocks with rules that build literal chunks. The other compiler, meanwhile, now handles rewrite rules and output conditionals that refer to chunk tags.
July 9: Worked on filtering the parse forest which led to some nice speedups.
July 10: Thoroughly redid output conditionals.
July 11: Tag order + output conditionals = macros, I guess.
July 12: Started over on converting eng-spa rules and fixed a few bugs.
Week 8
July 15: Made a workaround for completed branches getting discarded, fixed some bugs, and merged the output binary files.
July 16: Assorted VM optimizations and added lookahead.