Apertium-recursive/Bytecode

From Apertium
Jump to navigation Jump to search

The first 2 characters of the file are the length of the longest pattern and the number of rules.

Code Name Action
R [int] rule marks the start of a new rule composed of the next [int] characters
s [int] string pushes the next [int] characters onto the stack as a literal string
j [int] jump increments the instruction pointer by [int]
? [int] jump if not pops a bool off the stack, increments instruction pointer by [int] if its false
& [int] and pops [int] bools of the stack and pushes whether all of them are true
[int] or pops [int] bools of the stack and pushes whether any of them are true
! not logically negates top of stack
= / =# equal push whether the first two strings popped are the same (=# ignores case)
( / (# begins with push whether the first string popped occurs at the beginning of the second ((# ignores case )
) / )# ends with push whether the first string popped occurs at the end of the second ((# ignores case )
[ / [# begins with list push whether the second string popped begins with any member of the list named by the first string popped ([# ignores case)
] / ]# ends with list push whether the second string popped ends with any member of the list named by the first string popped (]# ignores case)
c / c# contains push whether the first string popped appears anywhere in the second (c# ignores case)
n / n# in push whether the second string popped is a member of the list named by the first (n# ignores case)
> begin let indicates that the next clip or var statement should not be evaluated
* / *# end let clip pops a value and an unevaluated clip and sets the clip to the value (*# copies the case of the value to the clip)
4 / 4# end let var pops a value and a variable name and sets the variable to the value (4# copies the case of the value to the variable)
< [int] out pops [int] chunks off the stack and appends them to the output queue in the order that they were pushed (in recursive mode, the output queue is later passed back to the rule applier)
. [int] clip if preceded by >, pushes [int] onto the stack, otherwise pops a string off the stack and retrieves that property of the position indicated by [int]
$ var if preceded by >, do nothing, otherwise pops a string off the stack and pushes the value of the variable with that name
G get case pops a string off the stack, pushes "AA", "Aa", or "aa" depending on its case
A copy case pops a string off the stack, copies its cases onto the next string on the stack
+ [int] concat pops [int] strings off the stack, concatenates them, and pushes the result
{ [int] chunk pops [int] items off the stack and puts them into a chunk (there are currently problems with this command)
p pseudolemma pop a chunk off the stack and push its pseudolemma
(space) space push a blank containing a single space onto the stack
_ [int] blank push the superblank after position [int] onto the stack

Features of .t?x that aren't covered yet:

  • reject-current-rule (add skip_rules list as input to interchunk_do_pass)
  • mlu
  • lu-count
  • clip side (also add anaphora as an option)

How it works

There is an object called parseTower which is an array of arrays (which I call "layers"). When tokens are read from the input stream they are added to layer 0. longestPattern is the length of the longest pattern of any rule and MAXLAYERS is an optional user-defined limit the recursion (currently 1).

def do_pass():
  if any layer contains more tokens than longestPattern, use the highest one
  else if there is more input return and wait for it to be read in
  else use the lowest layer that contains tokens
  
  for the layer chosen, attempt to match as in apertium-interchunk
  if any rules match, apply the longest one
  else move the first token in this layer to the next layer

def interchunk():
  while parseTower and the input stream are not both empty:
    if there is input, read 1 token
    do_pass()
    if the number of layers has reached MAXLAYERS: output everything in the top layer
    if longestPattern tokens have been shifted to the top layer without matching, output the first one