User:Popcorndude/Recursive Transfer/Progress

Work Plan (from proposal)

Time Period	Goal	Details	Deliverable
Community Bonding Period May 6-26	Finalize formalism	Read up on GLR parsers Decide variable semantics and syntax See if there's a good way to handle interpolation (e.g. inserting clitics after first word of phrase)	Full description of planned formalism
Week 1 May 27-June 2	Begin parser	Get input Match and build trees based on literal tags and attribute categories	Minimal parser
Week 2 June 3-9	Add variables	Agreement Passing variables up the tree Setting variables for child nodes	Minimal parser with agreement
Week 3 June 10-16	Test with eng->spa	Noun phrases (this was started in the coding challenge) Basic verb phrases (some agreement, if time)	Simple eng->spa parser
Week 4 June 17-23	Continue parser	Weights Conditionals Multiple output nodes Anything else deemed necessary during Community Bonding or testing	Majority of initial specifications implemented
evaluation 1	Basic parser done		Parser-generator compliant with majority of initial specifications and rudimentary eng->spa instantiation
Week 5 June 24-30	Finish parser and continue eng->spa	Finish anything left over from week 4 Finish verb phrases	Fully implemented parser and working eng->spa for simple sentences
Week 6 July 1-7	Finish eng->spa and write reverser	Convert any remaining eng->spa rules Evaluate parser against chunking system Metrics: accuracy, speed of parser, compilation speed Write script to automatically reverse a ruleset All features currently described are at least in princible reversible	System comparison and rule-reverser
Week 7 July 8-14	Evaluation and testing	Evaluate the output of the reverser against current spa->eng system Write tests for all features Begin adding error messages	Test suite and report on the general effectiveness of direct rule-reversal
Week 8 July 15-21	Optimization and interface	Speed up the parser and compiler where possible Build interfaces for compiler, parser, and reverser Clean up code Re-evaluate speed	Command-line interfaces and updated system comparison
evaluation 2	Complete program		Optimized and polished parser-generator compliant with initial specifications, and complete end->spa transfer rules
Week 9 July 22-28	Do spa->eng	Identify differences between generated spa->eng and chunking spa->eng Fix generated spa->eng rules Report on effort required to correct reverser	Working spa->eng rules and report on the usefulness of rule-reverser
Week 10 July 29-August 4	Documentation	Convert initial specifications to full documentation Write tutorial Write recipe book containing at least minimal examples of everything listed at User_talk:Popcorndude/Recursive_Transfer#Linguistic.2Ftransfer_phenomena	Complete documentation of system
Weeks 11 and 12 August 5-18	Buffer zone	These weeks will be used for one of the following, depending on preceding weeks and discussions with mentors: Make up for delays in prior weeks Converting another language pair Experimenting with automated conversion of chunking rules Writing a ruleset composer for generating a preliminary ruleset from two other pairs (e.g. combine eng->spa and spa->cat to get approximate rules for eng->cat)	TBD
final evaluation	Project done		Complete, fully documented system with full ruleset for at least one language pair

Community Bonding

Todo list

~~Determine exact semantics of lexical unit tag-matching~~
- ~~Are they ordered?~~
- ~~Are they consecutive?~~
See if anyone has input on formalism syntax in general
Mechanism for clitic-insertion
- e.g. V2, Wackernagel
Read about GLR parser algorithms
- Find reading materials
- Is there anything that can be done to make this finite-state? (probably not)
- Should we just start with the naive implementation (what the Python script does) as a baseline?
~~Conjoined lexical units - just treat as consecutive elements with no blank between?~~
~~Syntax for mapping between sets of tags (e.g. <o3pl> -> <p3><pl>, <o3sg> -> <p3><sg>)~~
~~Conditional output (e.g. modal verbs in English)~~
~~Make sure all syntax is written down~~
Begin writing tests
Some way to match absence of a tag

May 25: LU tags are unordered (basically, every tag operation has the same semantics as <clip>, <equal><clip>..., or <let><clip>... in the chunker). Various other things have syntax but that syntax may not be properly documented yet.

Week 1

May 27: As it turns out, I should have accounted for building a compiler/parser-generator in the workplan. We can currently mostly parse the file. Todo for tomorrow: finish parsing reduction rules and add some more error messages.

May 28: Can now parse a basic test file and we're maybe 2/3 of the way through generating the LR parsing table (multiple output nodes may be more complicated than I thought).

May 29: Idea: in a rule like

clitic NP -> @adj @cli @n {2} {1 _1 _2 3};

The parser just has to treat NP as both terminal and non-terminal, and when it applies this rule it can then reduce clitic and pretend that NP was the next token in the stream.

In discussion with mentors it was decided that we should try using Bison instead of building a new parser. The Bison parsers now compile, but they don't seem to succeed at outputting anything. Further analysis required.

May 31: The bytecode interpreter for interchunk seems to be working.

Things that now exist:

Python brute-force interpreter for formalism (works)
C++ parser and parse-table-generator for formalism (partially tested)
Python formalism->Yacc converter (might be close to working)
Python t2x->bytecode compiler (works)
C++ bytecode interpreter for interchunk (works for toy examples, not fully tested)

I could make the last one recursive in maybe half an hour.

I tested the bytecode on eng->spa (because what else would one use for a stress test :) ). It appears to work, but I didn't look very closely at the output because it gets stuck in an infinite loop if the input has a non-ASCII character in it.

Week 2

June 3: There are several inputs + rulesets for which the bytecode works just fine. There are a few, however, which give a segfault in the output code. The bytecode is now documented: User:Popcorndude/Recursive_Transfer/Bytecode

June 4: The interpreter can now handle eng->spa with a 67k word corpus. The error rate seems to be around 5% compared with apertium-interchunk (though I've had some difficulty counting). A lot of the errors seem to the be the interpreter spitting out newlines for some reason.

Notes for tomorrow: if the chunk content is empty it will get treated like a blank and throw off rule application - maybe keep {} as part of chunk data? Also, remove <call-macro n="f_bcond"> from rule 56 and see how much that changes things.

June 5: The only remaining difference in output between apertium-interchunk and the bytecode is that interchunk deduplicates blanks. We had a meeting and decided that the next step was to rewrite eng->spa in the formalism, make a compiler for the formalism, and then compare that to chunker/interchunk/postchunk. Initial rewriting is 1/4-1/3 done and the parser is nearly ready to begin actually compiling.

June 6-7: rtx-comp2 will now turn the basic formalism into both bytecode and the pattern files that apertium-preprocess-transfer generates. More work has been done on eng->spa, but we don't have agreement yet, so on average we're about on track.

Week 3

Todo List

The formalism is in flux again
- ~~Clipping?~~
- Variables???
Bytecode is also in flux
- Data types
- ~~Manner of chunk construction~~
Bytecode is incomplete
- ~~rejectrule~~
- interpolation???
- ~~make a separate file that lists all the codes so the compiler isn't relying on literals~~
eng->spa
- t1x - about ~~120~~ 109 rules to go
- t2x - 61 rules
- t3x - 37 rules
Compiler
- Agreement (rejectrule or change transducer?)
- Conditionals
Compiler for t?x issues
- Name conflicts
- t2x doesn't generate new chunks, so rules might conflict with t3x
Maybe change the rule applier so it applies as many rules as possible before getting more input?

June 10: worked on bytecode a bit, cleaned up string construction in compiler, added added verb rules to eng-spa.rtx but haven't checked how well they align with the t1x versions. The compiler can now handle most of eng-spa.rtx with the exception of conditionals and the character #, which apparently has to be escaped in regexs.

June 11: with all the conditionals commented out, eng-spa.rtx actually works now.

June 12: Conditionals work now. Some thought should probably be given to what are valid ways to write the operators and how many there should be (there are currently 46 ways to write 17 operators but NOT can only be ~). This leaves for tomorrow implicit agreement in the pattern, multiple output chunks, and maybe weights.

User:Popcorndude/Recursive Transfer/Progress

Contents

Work Plan (from proposal)

Community Bonding

Todo list

Week 1

Week 2

Week 3

Todo List

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools