Apertium-recursive/Parser

From Apertium
Jump to navigation Jump to search

A full walkthrough of parsing an output sentence can be found at Apertium-recursive/Example.

Pseudocode

def checkForReduce(branch):
  result = []
  if it would be possible to shift another token to this branch:
    result.append(copy(branch))
  rule = get best rule from current transducer state (longest pattern, highest weight)
  while rule != -1:
    take the appropriate number of chunks off branch and pass them to the virtual machine
    if the rule is rejected:
      get the next best rule
    else:
      add the output to the output to the branch
      update the transducer state
      call checkForReduce()
      append updated branch(es) to result
      break
  if result is empty:
    result.append(branch)
  else if the next non-blank in the input stream matches a lookahead path from the current transducer state:
    result.append(branch)
  return result

parseGraph = the parse stack (actually a directed graph in this case)
next = readToken()
while true:
  add next to each branch of parseGraph, calling checkForReduce()
  for each branch in parseGraph:
    if the number of live transducer states is 0:
      discard branch
    if another branch ends with a parse tree covering the same span of input and has a higher weight:
      discard branch
  if parseGraph is empty:
    output everything in the branch that had the fewest nodes
    if there is a tie, choose the branch with the highest weight
    if there is still a tie, choose the one that happens to be listed first
  next = readToken()
  if the input stream is now empty:
    output next (it's a blank)
    end

Notes

The premise of this algorithm is that the information contained in an LR parse table is also contained in the transducers generated by apertium-preprocess-transfer.

Since we allow incomplete parses, ACCEPT and ERROR are equivalent and both are represented by the transducer having no live states.

SHIFT is represented by an outgoing connection from the current state, and REDUCE is represented by the current node being final.

Since there are rules which generate different output in different situations as well as rules which generate multiple outputs, the simplest way do deal with them is to take the output of a reduction and act as if it were normal input. The result is that pattern matching is linear in the number of nodes in the tree, rather than being linear in just the leaves, but this increase appears to be cancelled out by the decrease in the amount of checking and rejecting that would have to be done in the virtual machine otherwise.

This algorithm currently branches on shift-reduce conflicts but not reduce-reduce conflicts. The reason for this is that it seemed to me like we had enough information to decide reduce-reduce conflicts by applying LRLM and it also limits the number of times we need to fork the stack, which helps with speed. If it is desired, I can change this, however.

Since reductions treat rule outputs as if they were inputs, I have allowed for the possibility of rules which output multiple nodes. The effect of such rules will probably in generally be to reorder the input without reducing, which will be useful for handling some clitics.