Apertium-recursive/Parser

From Apertium
< Apertium-recursive
Revision as of 20:01, 26 June 2019 by Popcorndude (talk | contribs) (Describe the algorithm)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Pseudocode

def checkForReduce(branch):
  result = []
  if it would be possible to shift another token to this branch:
    result.append(copy(branch))
  rule = get best rule from current transducer state (longest pattern, highest weight)
  while rule != -1:
    take the appropriate number of chunks off branch and pass them to the virtual machine
    if the rule is rejected:
      get the next best rule
    else:
      add the output to the output to the branch
      update the transducer state
      call checkForReduce()
      append updated branch(es) to result
      break
  if result is empty:
    result.append(branch)
  return result

parseGraph = the parse stack (actually a directed graph in this case)
next = readToken()
while true:
  add next to each branch of parseGraph, calling checkForReduce()
  for each branch in parseGraph:
    if the number of transducer states is 0:
      discard branch
  if parseGraph is empty:
    output everything in the branch that had the fewest nodes
    if there is a tie, choose the branch with the highest weight
    if there is still a tie, choose the one that happens to be listed first
  next = readToken()
  if the input stream is now empty:
    output next (it's a blank)
    end

Notes

The premise of this algorithm is that the information contained in an LR parse table is also contained in the transducers generated by apertium-preprocess-transfer.

Since we allow incomplete parses, ACCEPT and ERROR are equivalent and both are represented by the transducer having no live states.

SHIFT is represented by an outgoing connection from the current state, and REDUCE is represented by the current node being final.

Since there are rules which generate different output in different situations as well as rules which generate multiple outputs, the simplest way do deal with them is to take the output of a reduction and act as if it were normal input. The result is that pattern matching is linear in the number of nodes in the tree, rather than being linear in just the leaves, but this increase appears to be cancelled out by the decrease in the amount of checking and rejecting that would have to be done in the virtual machine otherwise.

This algorithm currently branches on shift-reduce conflicts but not reduce-reduce conflicts. The reason for this is that it seemed to me like we had enough information to decide reduce-reduce conflicts by applying LRLM and it also limits the number of times we need to fork the stack, which helps with speed. If it is desired, I can change this, however.

The nice thing about checking for shift-reduce conflicts, is that if there is a possible reduction, the token on the top of the stack is definitely an LU or a chunk and if there is a conflicting shift, then the token to be shifted is definitely a blank, so we can check whether any shifts are possible by stepping the transducer forward by a space and checking whether there are any live states.