User:Popcorndude/Recursive Transfer/Progress

From Apertium
Jump to navigation Jump to search

Work Plan (from proposal)[edit]

Time Period Goal Details Deliverable
Community Bonding Period

May 6-26

Finalize formalism
  • Read up on GLR parsers
  • Decide variable semantics and syntax
  • See if there's a good way to handle interpolation (e.g. inserting clitics after first word of phrase)
Full description of planned formalism
Week 1

May 27-June 2

Begin parser
  • Get input
  • Match and build trees based on literal tags and attribute categories
Minimal parser
Week 2

June 3-9

Add variables
  • Agreement
  • Passing variables up the tree
  • Setting variables for child nodes
Minimal parser with agreement
Week 3

June 10-16

Test with eng->spa
  • Noun phrases (this was started in the coding challenge)
  • Basic verb phrases (some agreement, if time)
Simple eng->spa parser
Week 4

June 17-23

Continue parser
  • Weights
  • Conditionals
  • Multiple output nodes
  • Anything else deemed necessary during Community Bonding or testing
Majority of initial specifications implemented
evaluation 1 Basic parser done Parser-generator compliant with majority of initial specifications and rudimentary eng->spa instantiation
Week 5

June 24-30

Finish parser and continue eng->spa
  • Finish anything left over from week 4
  • Finish verb phrases
Fully implemented parser and working eng->spa for simple sentences
Week 6

July 1-7

Finish eng->spa and write reverser
  • Convert any remaining eng->spa rules
  • Evaluate parser against chunking system
    • Metrics: accuracy, speed of parser, compilation speed
  • Write script to automatically reverse a ruleset
    • All features currently described are at least in princible reversible
System comparison and rule-reverser
Week 7

July 8-14

Evaluation and testing
  • Evaluate the output of the reverser against current spa->eng system
  • Write tests for all features
  • Begin adding error messages
Test suite and report on the general effectiveness of direct rule-reversal
Week 8

July 15-21

Optimization and interface
  • Speed up the parser and compiler where possible
  • Build interfaces for compiler, parser, and reverser
  • Clean up code
  • Re-evaluate speed
Command-line interfaces and updated system comparison
evaluation 2 Complete program Optimized and polished parser-generator compliant with initial specifications, and complete end->spa transfer rules
Week 9

July 22-28

Do spa->eng
  • Identify differences between generated spa->eng and chunking spa->eng
  • Fix generated spa->eng rules
  • Report on effort required to correct reverser
Working spa->eng rules and report on the usefulness of rule-reverser
Week 10

July 29-August 4

Documentation Complete documentation of system
Weeks 11 and 12

August 5-18

Buffer zone

These weeks will be used for one of the following, depending on preceding weeks and discussions with mentors:

  • Make up for delays in prior weeks
  • Converting another language pair
  • Experimenting with automated conversion of chunking rules
  • Writing a ruleset composer for generating a preliminary ruleset from two other pairs (e.g. combine eng->spa and spa->cat to get approximate rules for eng->cat)
TBD
final evaluation Project done Complete, fully documented system with full ruleset for at least one language pair

Community Bonding[edit]

Todo list[edit]

  • Determine exact semantics of lexical unit tag-matching
    • Are they ordered?
    • Are they consecutive?
  • See if anyone has input on formalism syntax in general
  • Mechanism for clitic-insertion
    • e.g. V2, Wackernagel
  • Read about GLR parser algorithms
    • Find reading materials
    • Is there anything that can be done to make this finite-state? (probably not)
    • Should we just start with the naive implementation (what the Python script does) as a baseline?
  • Conjoined lexical units - just treat as consecutive elements with no blank between?
  • Syntax for mapping between sets of tags (e.g. <o3pl> -> <p3><pl>, <o3sg> -> <p3><sg>)
  • Conditional output (e.g. modal verbs in English)
  • Make sure all syntax is written down
  • Begin writing tests
  • Some way to match absence of a tag

May 25: LU tags are unordered (basically, every tag operation has the same semantics as <clip>, <equal><clip>..., or <let><clip>... in the chunker). Various other things have syntax but that syntax may not be properly documented yet.

Week 1[edit]

May 27: As it turns out, I should have accounted for building a compiler/parser-generator in the workplan. We can currently mostly parse the file. Todo for tomorrow: finish parsing reduction rules and add some more error messages.

May 28: Can now parse a basic test file and we're maybe 2/3 of the way through generating the LR parsing table (multiple output nodes may be more complicated than I thought).

May 29: Idea: in a rule like

clitic NP -> @adj @cli @n {2} {1 _1 _2 3};

The parser just has to treat NP as both terminal and non-terminal, and when it applies this rule it can then reduce clitic and pretend that NP was the next token in the stream.

In discussion with mentors it was decided that we should try using Bison instead of building a new parser. The Bison parsers now compile, but they don't seem to succeed at outputting anything. Further analysis required.

May 31: The bytecode interpreter for interchunk seems to be working.

Things that now exist:

  • Python brute-force interpreter for formalism (works)
  • C++ parser and parse-table-generator for formalism (partially tested)
  • Python formalism->Yacc converter (might be close to working)
  • Python t2x->bytecode compiler (works)
  • C++ bytecode interpreter for interchunk (works for toy examples, not fully tested)

I could make the last one recursive in maybe half an hour.

I tested the bytecode on eng->spa (because what else would one use for a stress test :) ). It appears to work, but I didn't look very closely at the output because it gets stuck in an infinite loop if the input has a non-ASCII character in it.

Week 2[edit]

June 3: There are several inputs + rulesets for which the bytecode works just fine. There are a few, however, which give a segfault in the output code. The bytecode is now documented: User:Popcorndude/Recursive_Transfer/Bytecode

June 4: The interpreter can now handle eng->spa with a 67k word corpus. The error rate seems to be around 5% compared with apertium-interchunk (though I've had some difficulty counting). A lot of the errors seem to the be the interpreter spitting out newlines for some reason.

Notes for tomorrow: if the chunk content is empty it will get treated like a blank and throw off rule application - maybe keep {} as part of chunk data? Also, remove <call-macro n="f_bcond"> from rule 56 and see how much that changes things.

June 5: The only remaining difference in output between apertium-interchunk and the bytecode is that interchunk deduplicates blanks. We had a meeting and decided that the next step was to rewrite eng->spa in the formalism, make a compiler for the formalism, and then compare that to chunker/interchunk/postchunk. Initial rewriting is 1/4-1/3 done and the parser is nearly ready to begin actually compiling.

June 6-7: rtx-comp2 will now turn the basic formalism into both bytecode and the pattern files that apertium-preprocess-transfer generates. More work has been done on eng->spa, but we don't have agreement yet, so on average we're about on track.

Week 3[edit]

Todo List[edit]

  • The formalism is in flux again
    • Clipping?
    • Variables???
  • Bytecode is also in flux
    • Data types
    • Manner of chunk construction
  • Bytecode is incomplete
    • rejectrule
    • interpolation???
    • make a separate file that lists all the codes so the compiler isn't relying on literals
  • eng->spa
    • t1x - about 120 109 rules to go
    • t2x - 61 rules
    • t3x - 37 rules
  • Compiler
    • Agreement (rejectrule or change transducer?)
    • Conditionals
    • Case (capitalization)
  • Compiler for t?x issues
    • Name conflicts
    • t2x doesn't generate new chunks, so rules might conflict with t3x
  • Maybe change the rule applier so it applies as many rules as possible before getting more input?

June 10: worked on bytecode a bit, cleaned up string construction in compiler, added added verb rules to eng-spa.rtx but haven't checked how well they align with the t1x versions. The compiler can now handle most of eng-spa.rtx with the exception of conditionals and the character #, which apparently has to be escaped in regexs.

June 11: with all the conditionals commented out, eng-spa.rtx actually works now.

June 12: Conditionals work now. Some thought should probably be given to what are valid ways to write the operators and how many there should be (there are currently 46 ways to write 17 operators but NOT can only be ~). This leaves for tomorrow implicit agreement in the pattern, multiple output chunks, and maybe weights.

June 13: Given the importance of output patterns, variable lists at the beginnings of rules have now been made optional. Weights work. They can also be left off now. Documentation has been updated. I'm thinking implicit agreement might actually be undesirable (further discussion needed).

June 14: Changed the outer loop of transfer and now the whole program is about 90x faster. Worked on eng->spa a bit, got kind of stuck. Started writing tests. The second one written revealed bugs related to default attributes (probably why I was stuck on eng->spa).

Week 4[edit]

June 17: fixed some default attribute bugs, worked on eng-spa, added overrides for output patterns, added constant attributes and chunk/lu distinctions to the parser (will add to compiler tomorrow), started redoing compileClip() to hopefully generate less bytecode most of the time.

June 18: Significantly restructured parts of the compiler. It now handles multiple output chunks and most of the names are a bit closer to what they're actually doing now. We're up to 25 basic tests, which all pass, but running the eng-spa ruleset gives some confusing failures.

June 19-21: Multipass parsing is inefficient - time to try LR. Unfortunately, Lttoolbox/match_exe.h won't really work for tracking multiple states, so some rewriting of it was necessary. The initial attempt relied far too heavily on std::map and std::set, resulting in runtimes around 10s (multipass was running around 3s for the test corpus as was chunker/interchunk).

Week 5[edit]

June 24: Started over on the match_exe rewrite, making it more like the original. This brought runtimes back down to 2.5s. Further optimization got down to 2.1-2.2s.

June 25: Unfortunately, the implementation of LR occasionally screws up if there's a shift-reduce conflict and the shift later turns out to be a dead end, so we're switching to GLR. The switch has had the interesting effect of breaking some but not all of the tests (previous changes have generally either broken either all the tests or just the one that deals with weights).

June 26: Found the bug, wrote some documentation and a test, found some more bugs. Hopefully tomorrow we can find out whether the difference in output from multipass is just due to the nature of the rules (several were modified to account for how multipass worked).

June 27: Memory leaks. Also cleaning up the src/ directory.

June 28: Output-time rules and pool allocators.

Week 6[edit]

July 1: Tried a few optimizations that didn't help very much. Added docstrings to pretty much everything in the compiler header.

July 2: Implemented most of an XML compiler. Added LUCOUNT to the bytecode.

July 3: More compiler stuff.

July 5: Compiled eng-spa and started fixing variations with chunker/interchunk. Also fixed bugs that popped up due to rtx-comp being used by someone other than me.

Week 7[edit]

July 8: The XML compiler hit some roadblocks with rules that build literal chunks. The other compiler, meanwhile, now handles rewrite rules and output conditionals that refer to chunk tags.

July 9: Worked on filtering the parse forest which led to some nice speedups.

July 10: Thoroughly redid output conditionals.

July 11: Tag order + output conditionals = macros, I guess.

July 12: Started over on converting eng-spa rules and fixed a few bugs.

Week 8[edit]

July 15: Made a workaround for completed branches getting discarded, fixed some bugs, and merged the output binary files.

July 16: Assorted VM optimizations and added lookahead.

July 17: Tree output.

July 18: More tree output.

July 19: Writing out a full example parse and trying to improve conditional parsing.

Week 9[edit]

July 22 & 23: Working on debug options.

July 24: Evaluate t*x compilation.

July 25: Optimize regexps a bit and make random sentence generator.

July 26: Make a random path selector for transducers so the random sentence generator can output tags.

Week 10[edit]

July 29: Assorted error checking and some work on kir-eng.

July 30: Implemented lexicalization in rtx-comp.

July 31: Vim syntax file.

August 1: Lexicalization and lookahead for trx-comp, start on header file comments.

August 2: Finish header files, add DTD.

Week 11[edit]

August 5: Even more output for -e, add conditionals for clips, and work on kir->eng a bit.

August 6: Change syntax a bit to be slightly more consistent. Track down minor compiler bugs.

August 7: Bug-chasing and some work on kir-eng.

August 8: Capitalization and interpolation.

Week 12[edit]

August 12: Work on paper and agree on semantics for global variables.

August 13: Compiler refactoring and global variables.