Difference between revisions of "Apertium-recursive/Bytecode"

From Apertium
Jump to navigation Jump to search
(add todo list)
 
(16 intermediate revisions by one other user not shown)
Line 1: Line 1:
=== File Structure ===
The first 2 characters of the file are the length of the longest pattern and the number of rules.

Recursive transfer bytecode files are written using <code>Lttoolbox/compression.h</code>. The structure of the file is as follows:

Length of the longest input-time pattern (including blanks)
Number of input-time rules
[
for each input-time rule:
the length the pattern
the the rule
]
Number of output-time rules
Bytecode of each rule
Number of global chunk variable slots
Alphabet for the pattern transducer
Pattern transducer
The mapping from final states to rules
Attribute patterns
Global variables
Lists

=== Datatypes ===

The datatypes available to bytecode instructions are <code>string</code>, <code>integer</code>, <code>boolean</code>, and <code>Chunk</code>, where Chunk objects represent lexical units, chunks, and blanks.

=== Bytecode Operations ===

[int] after the name indicates that this instruction is two characters long and the second is to be interpreted as an integer.


{| class="wikitable" border="1"
{| class="wikitable" border="1"
|-
|-
! Code
! Name
! Name
! Action
! Action
! Stack before
! Stack after
|-
|-
| R [int]
| drop
| pop the top of the stack
| rule
| <pre>
| marks the start of a new rule composed of the next [int] characters
[1] X
[2] ...
</pre>
| <pre>
[1] ...
</pre>
|-
|-
| s [int]
| dup
| push a copy of the top element
| string
| <pre>
[1] X
[2] ...
</pre>
| <pre>
[1] X
[2] X
[3] ...
</pre>
|-
| over
| push a copy of the second element
| <pre>
[1] X
[2] Y
[3] ...
</pre>
| <pre>
[1] Y
[2] X
[3] Y
[4] ...
</pre>
|-
| swap
| exchange the first and second elements
| <pre>
[1] X
[2] Y
[3] ...
</pre>
| <pre>
[1] Y
[2] X
[3] ...
</pre>
|-
| string [int]
| pushes the next [int] characters onto the stack as a literal string
| pushes the next [int] characters onto the stack as a literal string
| <pre>
[1] ...
</pre>
| <pre>
[1] string
[2] ...
</pre>
|-
| int [int]
| pushes [int] onto the stack
| <pre>
[1] ...
</pre>
| <pre>
[1] int
[2] ...
</pre>
|-
| pushfalse
| pushes false onto the stack
| <pre>
[1] ...
</pre>
| <pre>
[1] false
[2] ...
</pre>
|-
| pushtrue
| pushes true onto the stack
| <pre>
[1] ...
</pre>
| <pre>
[1] true
[2] ...
</pre>
|-
|-
| j [int]
| jump [int]
| jump
| increments the instruction pointer by [int]
| increments the instruction pointer by [int]
| <pre>
[1] ...
</pre>
| <pre>
[1] ...
</pre>
|-
|-
| ? [int]
| jumpontrue [int]
| pops a bool off the stack and increments the instruction pointer by [int] if it is true
| jump if not
| <pre>
| pops a bool off the stack, increments instruction pointer by [int] if its false
[1] bool
[2] ...
</pre>
| <pre>
[1] ...
</pre>
|-
| jumponfalse [int]
| pops a bool off the stack and increments the instruction pointer by [int] if it is false
| <pre>
[1] bool
[2] ...
</pre>
| <pre>
[1] ...
</pre>
|-
|-
| & [int]
| and
| and
| pops [int] bools of the stack and pushes whether all of them are true
| pops 2 bools of the stack and pushes whether both of them are true
| <pre>
[1] bool
[2] bool
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| <code>| [int]</code>
| or
| or
| pops [int] bools of the stack and pushes whether any of them are true
| pops 2 bools of the stack and pushes whether either of them is true
| <pre>
[1] bool
[2] bool
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| !
| not
| not
| logically negates top of stack
| logically negates top of stack
| <pre>
[1] bool
[2] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| = / =#
| equal
| equal
| push whether the first two strings popped are the same (=# ignores case)
| push whether the first two strings popped are the same
| <pre>
[1] string
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| isprefix
| ( / (#
| push whether the first string popped occurs at the beginning of the second
| begins with
| <pre>
| push whether the first string popped occurs at the beginning of the second (<code>(#</code> ignores case )
[1] string (part)
[2] string (whole)
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| issuffix
| ) / )#
| push whether the first string popped occurs at the end of the second
| ends with
| <pre>
| push whether the first string popped occurs at the end of the second (<code>(#</code> ignores case )
[1] string (part)
[2] string (whole)
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| issubstring
| [ / [#
| pushes whether the first string popped appears anywhere in the second
| begins with list
| <pre>
| push whether the second string popped begins with any member of the list named by the first string popped ([# ignores case)
[1] string (part)
[2] string (whole)
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| ] / ]#
| equalcl
| <code>equal</code>, but ignores case
| ends with list
| <pre>
| push whether the second string popped ends with any member of the list named by the first string popped (]# ignores case)
[1] string
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| isprefixcl
| c / c#
| <code>isprefix</code>, but ignores case
| contains
| <pre>
| push whether the first string popped appears anywhere in the second (c# ignores case)
[1] string (part)
[2] string (whole)
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
| issuffixcl
| <code>issuffix</code>, but ignores case
| <pre>
[1] string (part)
[2] string (whole)
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
| issubstringcl
| <code>issubstring</code>, but ignores case
| <pre>
[1] string (part)
[2] string (whole)
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
| hasprefix
| push whether the second string popped begins with any member of the list named by the first string popped
| <pre>
[1] string (list)
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
| hassuffix
| push whether the second string popped ends with any member of the list named by the first string popped
| <pre>
[1] string (list)
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| n / n#
| in
| in
| push whether the second string popped is a member of the list named by the first (n# ignores case)
| push whether the second string popped is a member of the list named by the first
| <pre>
[1] string (list)
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| hasprefixcl
| >
| <code>hasprefix</code>, but ignores case
| begin let
| <pre>
| indicates that the next clip or var statement should not be evaluated
[1] string (list)
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| hassuffixcl
| * / *#
| <code>hassuffix</code>, but ignores case
| end let clip
| <pre>
| pops a value and an unevaluated clip and sets the clip to the value (*# copies the case of the value to the clip)
[1] string (list)
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| 4 / 4#
| incl
| <code>in</code>, but ignores case
| end let var
| <pre>
| pops a value and a variable name and sets the variable to the value (4# copies the case of the value to the variable)
[1] string (list)
[2] string
[3] ...
</pre>
| <pre>
[1] bool
[2] ...
</pre>
|-
|-
| < [int]
| getcase
| pushes "aa", "Aa", or "AA", depending on the case of the first string popped
| out
| <pre>
| pops [int] chunks off the stack and appends them to the output queue in the order that they were pushed (in recursive mode, the output queue is later passed back to the rule applier)
[1] string (text)
[2] ...
</pre>
| <pre>
[1] string (case)
[2] ...
</pre>
|-
|-
| . [int]
| setcase
| pops two strings, copies the case of the first to the second and pushes the result
| clip
| <pre>
| if preceded by >, pushes [int] onto the stack, otherwise pops a string off the stack and retrieves that property of the position indicated by [int]
[1] string (case)
[2] string (text)
[3] ...
</pre>
| <pre>
[1] string (text)
[2] ...
</pre>
|-
|-
| fetchvar
| $
| pops a string and pushes the value of the variable with that name
| var
| <pre>
| if preceded by >, do nothing, otherwise pops a string off the stack and pushes the value of the variable with that name
[1] string (name)
[2] ...
</pre>
| <pre>
[1] string (value)
[2] ...
</pre>
|-
|-
| G
| setvar
| pops a two strings and sets the second as the value of the variable named by the first
| get case
| <pre>
| pops a string off the stack, pushes "AA", "Aa", or "aa" depending on its case
[1] string (name)
[2] string (value)
[3] ...
</pre>
| <pre>
[1] ...
</pre>
|-
|-
| fetchchunk
| + [int]
| pops an integer and pushes the value of the chunk variable at that index
| concat
| <pre>
| pops [int] strings off the stack, concatenates them, and pushes the result
[1] int
[2] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
| setchunk
| pops an integer and a chunk and sets the chunk as the value of the chunk variable at that index
| <pre>
[1] int
[2] chunk
[3] ...
</pre>
| <pre>
[1] ...
</pre>
|-
| pushinput
| pops an int and pushes the corresponding input chunk
| <pre>
[1] int
[2] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
| sourceclip
| pops a string and a chunk, pushes the value of the corresponding source-side clip
| <pre>
[1] string (part)
[2] chunk
[3] ...
</pre>
| <pre>
[1] string (clip)
[2] ...
</pre>
|-
| targetclip
| pops a string and a chunk, pushes the value of the corresponding target-side clip
| <pre>
[1] string (part)
[2] chunk
[3] ...
</pre>
| <pre>
[1] string (clip)
[2] ...
</pre>
|-
| referenceclip
| pops a string and a chunk, pushes the value of the corresponding reference-side clip
| <pre>
[1] string (part)
[2] chunk
[3] ...
</pre>
| <pre>
[1] string (clip)
[2] ...
</pre>
|-
| setclip
| pops an int and two strings, sets the second string as the value of the target-side clip identified by the int and the first string. If the integer is 0, the chunk on top of the stack is used.
| <pre>
[1] int
[2] string (part)
[3] string (value)
[4] (chunk)
[5] ...
</pre>
| <pre>
[1] (chunk)
[2] ...
</pre>
|-
|-
| { [int]
| chunk
| chunk
| creates an empty chunk and pushes it
| pops [int] items off the stack and puts them into a chunk (there are currently problems with this command)
| <pre>
[1] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
|-
| appendchild
| p
| pops a chunk and appends it as a child to the chunk underneath it (which remains on the stack)
| pseudolemma
| <pre>
| pop a chunk off the stack and push its pseudolemma
[1] chunk (child)
[2] chunk (parent)
[3] ...
</pre>
| <pre>
[1] chunk (parent)
[2] ...
</pre>
|-
|-
| appendsurface
| (space)
| pops a string and appends it to the target-side surface chunk underneath it (which remains on the stack)
| space
| <pre>
| push a blank containing a single space onto the stack
[1] string
[2] chunk
[3] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
| appendsurfacesl
| pops a string and appends it to the source-side surface chunk underneath it (which remains on the stack)
| <pre>
[1] string
[2] chunk
[3] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
| appendsurfaceref
| pops a string and appends it to the reference-side surface chunk underneath it (which remains on the stack)
| <pre>
[1] string
[2] chunk
[3] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
| appendallchildren
| pops a chunk and appends all of its children as children to the chunk underneath it (which remains on the stack)
| <pre>
[1] chunk (source)
[2] chunk (destination)
[3] ...
</pre>
| <pre>
[1] chunk (destination)
[2] ...
</pre>
|-
| output
| pops a chunk and appends it to the output queue
| <pre>
[1] chunk
[2] ...
</pre>
| <pre>
[1] ...
</pre>
|-
| appendallinput
| append the entire input queue as children of the chunk on top of the stack
| <pre>
[1] chunk
[2] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
|-
| _ [int]
| blank
| blank
| pops an int and pushes the corresponding blank (or a single space if the int is 0)
| push the superblank after position [int] onto the stack
| <pre>
[1] int
[2] ...
</pre>
| <pre>
[1] chunk (blank)
[2] ...
</pre>
|-
| outputall
| moves everything in the input queue to the output queue and ends the rule execution (creates a no-op rule)
| <pre>
[1] ...
</pre>
| <pre>
[1] ...
</pre>
|-
| concat
| pops two strings, concatenates them, and pushes the result
| <pre>
[1] string X
[2] string Y
[3] ...
</pre>
| <pre>
[1] string YX
[2] ...
</pre>
|-
| rejectrule
| abort evaluation of current rule and attempt to match a different one
| <pre>
[1] ...
</pre>
| <pre>
[1] ...
</pre>
|-
| distag
| removes initial < and final > from the string on top of the stack (this makes compiling comparisons easier)
| <pre>
[1] string (tag)
[2] ...
</pre>
| <pre>
[1] string (text)
[2] ...
</pre>
|-
| getrule
| pop an int and push the index of the output rule associated with the chunk in that position.
| <pre>
[1] int (position)
[2] ...
</pre>
| <pre>
[1] int (rule)
[2] ...
</pre>
|-
| setrule
| pop two ints, a position and a rule and push the output rule associated with the chunk in that position. 0 refers to the top of the stack
| <pre>
[1] int (position)
[1] int (rule)
[2] [chunk]
[3] ...
</pre>
| <pre>
[1] chunk
[2] ...
</pre>
|-
| lucount
| push a string corresponding to the number of chunks in the input to the rule
| <pre>
[1] ...
</pre>
| <pre>
[1] string (number)
[2] ...
</pre>
|-
| conjoin
| push a joiner blank onto the stack
| <pre>
[1] ...
</pre>
| <pre>
[1] + (chunk)
[2] ...
</pre>
|}
|}


[[Category:Recursive transfer]]
Features of .t?x that aren't covered yet:
* reject-current-rule (add skip_rules list as input to interchunk_do_pass)
* mlu
* lu-count
* clip side (also add anaphora as an option)

Latest revision as of 06:09, 1 June 2023

File Structure[edit]

Recursive transfer bytecode files are written using Lttoolbox/compression.h. The structure of the file is as follows:

Length of the longest input-time pattern (including blanks)
Number of input-time rules
[
  for each input-time rule:
  the length the pattern
  the the rule
]
Number of output-time rules
Bytecode of each rule
Number of global chunk variable slots
Alphabet for the pattern transducer
Pattern transducer
The mapping from final states to rules
Attribute patterns
Global variables
Lists

Datatypes[edit]

The datatypes available to bytecode instructions are string, integer, boolean, and Chunk, where Chunk objects represent lexical units, chunks, and blanks.

Bytecode Operations[edit]

[int] after the name indicates that this instruction is two characters long and the second is to be interpreted as an integer.

Name Action Stack before Stack after
drop pop the top of the stack
[1] X
[2] ...
[1] ...
dup push a copy of the top element
[1] X
[2] ...
[1] X
[2] X
[3] ...
over push a copy of the second element
[1] X
[2] Y
[3] ...
[1] Y
[2] X
[3] Y
[4] ...
swap exchange the first and second elements
[1] X
[2] Y
[3] ...
[1] Y
[2] X
[3] ...
string [int] pushes the next [int] characters onto the stack as a literal string
[1] ...
[1] string
[2] ...
int [int] pushes [int] onto the stack
[1] ...
[1] int
[2] ...
pushfalse pushes false onto the stack
[1] ...
[1] false
[2] ...
pushtrue pushes true onto the stack
[1] ...
[1] true
[2] ...
jump [int] increments the instruction pointer by [int]
[1] ...
[1] ...
jumpontrue [int] pops a bool off the stack and increments the instruction pointer by [int] if it is true
[1] bool
[2] ...
[1] ...
jumponfalse [int] pops a bool off the stack and increments the instruction pointer by [int] if it is false
[1] bool
[2] ...
[1] ...
and pops 2 bools of the stack and pushes whether both of them are true
[1] bool
[2] bool
[3] ...
[1] bool
[2] ...
or pops 2 bools of the stack and pushes whether either of them is true
[1] bool
[2] bool
[3] ...
[1] bool
[2] ...
not logically negates top of stack
[1] bool
[2] ...
[1] bool
[2] ...
equal push whether the first two strings popped are the same
[1] string
[2] string
[3] ...
[1] bool
[2] ...
isprefix push whether the first string popped occurs at the beginning of the second
[1] string (part)
[2] string (whole)
[3] ...
[1] bool
[2] ...
issuffix push whether the first string popped occurs at the end of the second
[1] string (part)
[2] string (whole)
[3] ...
[1] bool
[2] ...
issubstring pushes whether the first string popped appears anywhere in the second
[1] string (part)
[2] string (whole)
[3] ...
[1] bool
[2] ...
equalcl equal, but ignores case
[1] string
[2] string
[3] ...
[1] bool
[2] ...
isprefixcl isprefix, but ignores case
[1] string (part)
[2] string (whole)
[3] ...
[1] bool
[2] ...
issuffixcl issuffix, but ignores case
[1] string (part)
[2] string (whole)
[3] ...
[1] bool
[2] ...
issubstringcl issubstring, but ignores case
[1] string (part)
[2] string (whole)
[3] ...
[1] bool
[2] ...
hasprefix push whether the second string popped begins with any member of the list named by the first string popped
[1] string (list)
[2] string
[3] ...
[1] bool
[2] ...
hassuffix push whether the second string popped ends with any member of the list named by the first string popped
[1] string (list)
[2] string
[3] ...
[1] bool
[2] ...
in push whether the second string popped is a member of the list named by the first
[1] string (list)
[2] string
[3] ...
[1] bool
[2] ...
hasprefixcl hasprefix, but ignores case
[1] string (list)
[2] string
[3] ...
[1] bool
[2] ...
hassuffixcl hassuffix, but ignores case
[1] string (list)
[2] string
[3] ...
[1] bool
[2] ...
incl in, but ignores case
[1] string (list)
[2] string
[3] ...
[1] bool
[2] ...
getcase pushes "aa", "Aa", or "AA", depending on the case of the first string popped
[1] string (text)
[2] ...
[1] string (case)
[2] ...
setcase pops two strings, copies the case of the first to the second and pushes the result
[1] string (case)
[2] string (text)
[3] ...
[1] string (text)
[2] ...
fetchvar pops a string and pushes the value of the variable with that name
[1] string (name)
[2] ...
[1] string (value)
[2] ...
setvar pops a two strings and sets the second as the value of the variable named by the first
[1] string (name)
[2] string (value)
[3] ...
[1] ...
fetchchunk pops an integer and pushes the value of the chunk variable at that index
[1] int
[2] ...
[1] chunk
[2] ...
setchunk pops an integer and a chunk and sets the chunk as the value of the chunk variable at that index
[1] int
[2] chunk
[3] ...
[1] ...
pushinput pops an int and pushes the corresponding input chunk
[1] int
[2] ...
[1] chunk
[2] ...
sourceclip pops a string and a chunk, pushes the value of the corresponding source-side clip
[1] string (part)
[2] chunk
[3] ...
[1] string (clip)
[2] ...
targetclip pops a string and a chunk, pushes the value of the corresponding target-side clip
[1] string (part)
[2] chunk
[3] ...
[1] string (clip)
[2] ...
referenceclip pops a string and a chunk, pushes the value of the corresponding reference-side clip
[1] string (part)
[2] chunk
[3] ...
[1] string (clip)
[2] ...
setclip pops an int and two strings, sets the second string as the value of the target-side clip identified by the int and the first string. If the integer is 0, the chunk on top of the stack is used.
[1] int
[2] string (part)
[3] string (value)
[4] (chunk)
[5] ...
[1] (chunk)
[2] ...
chunk creates an empty chunk and pushes it
[1] ...
[1] chunk
[2] ...
appendchild pops a chunk and appends it as a child to the chunk underneath it (which remains on the stack)
[1] chunk (child)
[2] chunk (parent)
[3] ...
[1] chunk (parent)
[2] ...
appendsurface pops a string and appends it to the target-side surface chunk underneath it (which remains on the stack)
[1] string
[2] chunk
[3] ...
[1] chunk
[2] ...
appendsurfacesl pops a string and appends it to the source-side surface chunk underneath it (which remains on the stack)
[1] string
[2] chunk
[3] ...
[1] chunk
[2] ...
appendsurfaceref pops a string and appends it to the reference-side surface chunk underneath it (which remains on the stack)
[1] string
[2] chunk
[3] ...
[1] chunk
[2] ...
appendallchildren pops a chunk and appends all of its children as children to the chunk underneath it (which remains on the stack)
[1] chunk (source)
[2] chunk (destination)
[3] ...
[1] chunk (destination)
[2] ...
output pops a chunk and appends it to the output queue
[1] chunk
[2] ...
[1] ...
appendallinput append the entire input queue as children of the chunk on top of the stack
[1] chunk
[2] ...
[1] chunk
[2] ...
blank pops an int and pushes the corresponding blank (or a single space if the int is 0)
[1] int
[2] ...
[1] chunk (blank)
[2] ...
outputall moves everything in the input queue to the output queue and ends the rule execution (creates a no-op rule)
[1] ...
[1] ...
concat pops two strings, concatenates them, and pushes the result
[1] string X
[2] string Y
[3] ...
[1] string YX
[2] ...
rejectrule abort evaluation of current rule and attempt to match a different one
[1] ...
[1] ...
distag removes initial < and final > from the string on top of the stack (this makes compiling comparisons easier)
[1] string (tag)
[2] ...
[1] string (text)
[2] ...
getrule pop an int and push the index of the output rule associated with the chunk in that position.
[1] int (position)
[2] ...
[1] int (rule)
[2] ...
setrule pop two ints, a position and a rule and push the output rule associated with the chunk in that position. 0 refers to the top of the stack
[1] int (position)
[1] int (rule)
[2] [chunk]
[3] ...
[1] chunk
[2] ...
lucount push a string corresponding to the number of chunks in the input to the rule
[1] ...
[1] string (number)
[2] ...
conjoin push a joiner blank onto the stack
[1] ...
[1] + (chunk)
[2] ...