Difference between revisions of "Apertium-recursive/Bytecode"

From Apertium
Jump to navigation Jump to search
(update table)
Line 1: Line 1:
The first 2 characters of the file are the length of the longest pattern and the number of rules.
+
The first 2 characters of the file are the length of the longest pattern and the number of rules. Each rule begins with a byte indicating specifying the length of the rule.
  +
  +
[int] after the name indicates that this instruction is two characters long and the second is to be interpreted as an integer.
   
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
|-
 
|-
! Code
 
 
! Name
 
! Name
 
! Action
 
! Action
 
|-
 
|-
| R [int]
+
| drop
  +
| pop the top of the stack
| rule
 
| marks the start of a new rule composed of the next [int] characters
 
 
|-
 
|-
| s [int]
+
| dup
  +
| push a copy of the top element
| string
 
  +
|-
  +
| over
  +
| push a copy of the second element
  +
|-
  +
| swap
  +
| exchange the first and second elements
  +
|-
  +
| string [int]
 
| pushes the next [int] characters onto the stack as a literal string
 
| pushes the next [int] characters onto the stack as a literal string
 
|-
 
|-
| j [int]
+
| int [int]
  +
| pushes [int] onto the stack
| jump
 
  +
|-
  +
| pushfalse
  +
| pushes false onto the stack
  +
|-
  +
| pushtrue
  +
| pushes true onto the stack
  +
|-
  +
| jump [int]
 
| increments the instruction pointer by [int]
 
| increments the instruction pointer by [int]
 
|-
 
|-
| ? [int]
+
| jumpontrue [int]
  +
| pops a bool off the stack and increments the instruction pointer by [int] if it is true
| jump if not
 
  +
|-
| pops a bool off the stack, increments instruction pointer by [int] if its false
 
  +
| jumponfalse [int]
  +
| pops a bool off the stack and increments the instruction pointer by [int] if it is false
 
|-
 
|-
| & [int]
 
 
| and
 
| and
| pops [int] bools of the stack and pushes whether all of them are true
+
| pops 2 bools of the stack and pushes whether both of them are true
 
|-
 
|-
| <code>| [int]</code>
 
 
| or
 
| or
| pops [int] bools of the stack and pushes whether any of them are true
+
| pops 2 bools of the stack and pushes whether either of them is true
 
|-
 
|-
| !
 
 
| not
 
| not
 
| logically negates top of stack
 
| logically negates top of stack
 
|-
 
|-
| = / =#
 
 
| equal
 
| equal
| push whether the first two strings popped are the same (=# ignores case)
+
| push whether the first two strings popped are the same
 
|-
 
|-
  +
| isprefix
| ( / (#
 
  +
| push whether the first string popped occurs at the beginning of the second
| begins with
 
| push whether the first string popped occurs at the beginning of the second (<code>(#</code> ignores case )
 
 
|-
 
|-
  +
| issuffix
| ) / )#
 
  +
| push whether the first string popped occurs at the end of the second
| ends with
 
| push whether the first string popped occurs at the end of the second (<code>(#</code> ignores case )
 
 
|-
 
|-
  +
| issubstring
| [ / [#
 
  +
| pushes whether the first string popped appears anywhere in the second
| begins with list
 
| push whether the second string popped begins with any member of the list named by the first string popped ([# ignores case)
 
 
|-
 
|-
| ] / ]#
+
| equalcl
  +
| <code>equal</code>, but ignores case
| ends with list
 
| push whether the second string popped ends with any member of the list named by the first string popped (]# ignores case)
 
 
|-
 
|-
  +
| isprefixcl
| c / c#
 
  +
| <code>isprefix</code>, but ignores case
| contains
 
  +
|-
| push whether the first string popped appears anywhere in the second (c# ignores case)
 
  +
| issuffixcl
  +
| <code>issuffix</code>, but ignores case
  +
|-
  +
| issubstringcl
  +
| <code>issubstring</code>, but ignores case
  +
|-
  +
| hasprefix
  +
| push whether the second string popped begins with any member of the list named by the first string popped
  +
|-
  +
| hassuffix
  +
| push whether the second string popped ends with any member of the list named by the first string popped
 
|-
 
|-
| n / n#
 
 
| in
 
| in
| push whether the second string popped is a member of the list named by the first (n# ignores case)
+
| push whether the second string popped is a member of the list named by the first
 
|-
 
|-
  +
| hasprefixcl
| >
 
  +
| <code>hasprefix</code>, but ignores case
| begin let
 
| indicates that the next clip or var statement should not be evaluated
 
 
|-
 
|-
  +
| hassuffixcl
| * / *#
 
  +
| <code>hassuffix</code>, but ignores case
| end let clip
 
| pops a value and an unevaluated clip and sets the clip to the value (*# copies the case of the value to the clip)
 
 
|-
 
|-
| 4 / 4#
+
| incl
  +
| <code>in</code>, but ignores case
| end let var
 
| pops a value and a variable name and sets the variable to the value (4# copies the case of the value to the variable)
 
 
|-
 
|-
| < [int]
+
| getcase
  +
| pushes "aa", "Aa", or "AA", depending on the case of the first string popped
| out
 
| pops [int] chunks off the stack and appends them to the output queue in the order that they were pushed (in recursive mode, the output queue is later passed back to the rule applier)
 
 
|-
 
|-
| . [int]
+
| setcase
  +
| pops two strings, copies the case of the first to the second and pushes the result
| clip
 
| if preceded by >, pushes [int] onto the stack, otherwise pops a string off the stack and retrieves that property of the position indicated by [int]
 
 
|-
 
|-
  +
| fetchvar
| $
 
  +
| pops a string and pushes the value of the variable with that name
| var
 
| if preceded by >, do nothing, otherwise pops a string off the stack and pushes the value of the variable with that name
 
 
|-
 
|-
| G
+
| setvar
  +
| pops a two strings and sets the second as the value of the variable named by the first
| get case
 
| pops a string off the stack, pushes "AA", "Aa", or "aa" depending on its case
 
 
|-
 
|-
  +
| sourceclip
| A
 
  +
| pops an int and a string, pushes the value of the source-side clip identified by them
| copy case
 
| pops a string off the stack, copies its cases onto the next string on the stack
 
 
|-
 
|-
  +
| targetclip
| + [int]
 
  +
| pops an int and a string, pushes the value of the target-side clip identified by them
| concat
 
  +
|-
| pops [int] strings off the stack, concatenates them, and pushes the result
 
  +
| referenceclip
  +
| pops an int and a string, pushes the value of the reference-side clip identified by them
  +
|-
  +
| setclip
  +
| pops an int and two strings, sets the second string as the value of the target-side clip identified by the int and the first string
 
|-
 
|-
| { [int]
 
 
| chunk
 
| chunk
  +
| creates an empty chunk and pushes it
| pops [int] items off the stack and puts them into a chunk (there are currently problems with this command)
 
 
|-
 
|-
  +
| appendchild
| p
 
  +
| pops a chunk and appends it as a child to the chunk underneath it (which remains on the stack)
| pseudolemma
 
| pop a chunk off the stack and push its pseudolemma
 
 
|-
 
|-
  +
| appendsurface
| (space)
 
  +
| pops a string and appends it to the target-side surface chunk underneath it (which remains on the stack)
| space
 
  +
|-
| push a blank containing a single space onto the stack
 
  +
| appendallchildren
  +
| pops a chunk and appends all of its children as children to the chunk underneath it (which remains on the stack)
  +
|-
  +
| output
  +
| pops a chunk and appends it to the output queue
 
|-
 
|-
| _ [int]
 
 
| blank
 
| blank
  +
| pops an int and pushes the corresponding blank (or a single space if the int is 0)
| push the superblank after position [int] onto the stack
 
  +
|-
  +
| concat
  +
| pops two strings, concatenates them, and pushes the result
  +
|-
  +
| rejectrule
  +
| abort evaluation of current rule and attempt to match a different one
 
|}
 
|}
 
Features of .t?x that aren't covered yet:
 
* reject-current-rule (add skip_rules list as input to interchunk_do_pass)
 
* mlu
 
* lu-count
 
* clip side (also add anaphora as an option)
 
   
 
== How it works ==
 
== How it works ==

Revision as of 19:12, 10 June 2019

The first 2 characters of the file are the length of the longest pattern and the number of rules. Each rule begins with a byte indicating specifying the length of the rule.

[int] after the name indicates that this instruction is two characters long and the second is to be interpreted as an integer.

Name Action
drop pop the top of the stack
dup push a copy of the top element
over push a copy of the second element
swap exchange the first and second elements
string [int] pushes the next [int] characters onto the stack as a literal string
int [int] pushes [int] onto the stack
pushfalse pushes false onto the stack
pushtrue pushes true onto the stack
jump [int] increments the instruction pointer by [int]
jumpontrue [int] pops a bool off the stack and increments the instruction pointer by [int] if it is true
jumponfalse [int] pops a bool off the stack and increments the instruction pointer by [int] if it is false
and pops 2 bools of the stack and pushes whether both of them are true
or pops 2 bools of the stack and pushes whether either of them is true
not logically negates top of stack
equal push whether the first two strings popped are the same
isprefix push whether the first string popped occurs at the beginning of the second
issuffix push whether the first string popped occurs at the end of the second
issubstring pushes whether the first string popped appears anywhere in the second
equalcl equal, but ignores case
isprefixcl isprefix, but ignores case
issuffixcl issuffix, but ignores case
issubstringcl issubstring, but ignores case
hasprefix push whether the second string popped begins with any member of the list named by the first string popped
hassuffix push whether the second string popped ends with any member of the list named by the first string popped
in push whether the second string popped is a member of the list named by the first
hasprefixcl hasprefix, but ignores case
hassuffixcl hassuffix, but ignores case
incl in, but ignores case
getcase pushes "aa", "Aa", or "AA", depending on the case of the first string popped
setcase pops two strings, copies the case of the first to the second and pushes the result
fetchvar pops a string and pushes the value of the variable with that name
setvar pops a two strings and sets the second as the value of the variable named by the first
sourceclip pops an int and a string, pushes the value of the source-side clip identified by them
targetclip pops an int and a string, pushes the value of the target-side clip identified by them
referenceclip pops an int and a string, pushes the value of the reference-side clip identified by them
setclip pops an int and two strings, sets the second string as the value of the target-side clip identified by the int and the first string
chunk creates an empty chunk and pushes it
appendchild pops a chunk and appends it as a child to the chunk underneath it (which remains on the stack)
appendsurface pops a string and appends it to the target-side surface chunk underneath it (which remains on the stack)
appendallchildren pops a chunk and appends all of its children as children to the chunk underneath it (which remains on the stack)
output pops a chunk and appends it to the output queue
blank pops an int and pushes the corresponding blank (or a single space if the int is 0)
concat pops two strings, concatenates them, and pushes the result
rejectrule abort evaluation of current rule and attempt to match a different one

How it works

There is an object called parseTower which is an array of arrays (which I call "layers"). When tokens are read from the input stream they are added to layer 0. longestPattern is the length of the longest pattern of any rule and MAXLAYERS is an optional user-defined limit the recursion (currently 1).

def do_pass():
  if any layer contains more tokens than longestPattern, use the highest one
  else if there is more input return and wait for it to be read in
  else use the lowest layer that contains tokens
  
  for the layer chosen, attempt to match as in apertium-interchunk
  if any rules match, apply the longest one
  else move the first token in this layer to the next layer

def interchunk():
  while parseTower and the input stream are not both empty:
    if there is input, read 1 token
    do_pass()
    if the number of layers has reached MAXLAYERS: output everything in the top layer
    if longestPattern tokens have been shifted to the top layer without matching, output the first one