Apertium-recursive/Parser
A full walkthrough of parsing an output sentence can be found at Apertium-recursive/Example.
Pseudocode[edit]
def checkForReduce(branch): result = [] if it would be possible to shift another token to this branch: result.append(copy(branch)) rule = get best rule from current transducer state (longest pattern, highest weight) while rule != -1: take the appropriate number of chunks off branch and pass them to the virtual machine if the rule is rejected: get the next best rule else: add the output to the output to the branch update the transducer state call checkForReduce() append updated branch(es) to result break if result is empty: result.append(branch) else if the next non-blank in the input stream matches a lookahead path from the current transducer state: result.append(branch) return result parseGraph = the parse stack (actually a directed graph in this case) next = readToken() while true: add next to each branch of parseGraph, calling checkForReduce() for each branch in parseGraph: if the number of live transducer states is 0: discard branch if another branch ends with a parse tree covering the same span of input and has a higher weight: discard branch if parseGraph is empty: output everything in the branch that had the fewest nodes if there is a tie, choose the branch with the highest weight if there is still a tie, choose the one that happens to be listed first next = readToken() if the input stream is now empty: output next (it's a blank) end
Notes[edit]
The premise of this algorithm is that the information contained in an LR parse table is also contained in the transducers generated by apertium-preprocess-transfer
.
Since we allow incomplete parses, ACCEPT
and ERROR
are equivalent and both are represented by the transducer having no live states.
SHIFT
is represented by an outgoing connection from the current state, and REDUCE
is represented by the current node being final.
Since there are rules which generate different output in different situations as well as rules which generate multiple outputs, the simplest way do deal with them is to take the output of a reduction and act as if it were normal input. The result is that pattern matching is linear in the number of nodes in the tree, rather than being linear in just the leaves, but this increase appears to be cancelled out by the decrease in the amount of checking and rejecting that would have to be done in the virtual machine otherwise.
This algorithm currently branches on shift-reduce conflicts but not reduce-reduce conflicts. The reason for this is that it seemed to me like we had enough information to decide reduce-reduce conflicts by applying LRLM and it also limits the number of times we need to fork the stack, which helps with speed. If it is desired, I can change this, however.
Since reductions treat rule outputs as if they were inputs, I have allowed for the possibility of rules which output multiple nodes. The effect of such rules will probably in generally be to reorder the input without reducing, which will be useful for handling some clitics.