Difference between revisions of "Apertium-regtest"
| Popcorndude (talk | contribs) m (→File Structure Example:  typo) | Popcorndude (talk | contribs)  | ||
| Line 22: | Line 22: | ||
|    "mode": [mode-name] |    "mode": [mode-name] | ||
|    "input": [file-name]} |    "input": [file-name]} | ||
| === Input Corpus === | |||
| Where <code>name</code> is the name of this corpus, <code>mode-name</code> names a pipeline mode (usually <code>abc-xyz</code> or <code>xyz-abc</code>), and the value of <code>"input":</code> is a text file where each line contains an input sentence. Line breaks can be included in the input by writing <code>\n</code> and comments beginning with <code>#</code> will be ignored. | Where <code>name</code> is the name of this corpus, <code>mode-name</code> names a pipeline mode (usually <code>abc-xyz</code> or <code>xyz-abc</code>), and the value of <code>"input":</code> is a text file where each line contains an input sentence. Line breaks can be included in the input by writing <code>\n</code> and comments beginning with <code>#</code> will be ignored. | ||
| === Mode Specification | |||
| The mode will be read from <code>modes.xml</code> and each step will be named in the same fashion as <code>gendebug="yes"</code>. That is, using <code>debug-suff</code> is present, otherwise trying to guess a standard suffix, and finally falling back to <code>NAMEME</code>. If more than one step has the same debug suffix, they will be numbered sequentially. | The mode will be read from <code>modes.xml</code> and each step will be named in the same fashion as <code>gendebug="yes"</code>. That is, using <code>debug-suff</code> is present, otherwise trying to guess a standard suffix, and finally falling back to <code>NAMEME</code>. If more than one step has the same debug suffix, they will be numbered sequentially. | ||
| If the input file is not intended to be passed through the entire pipeline, the option <code>start-step: [suffix]</code> can be added, where <code>suffix</code> is the suffix of one of the steps in the pipeline. | If the input file is not intended to be passed through the entire pipeline, the option <code>"start-step": [suffix]</code> can be added, where <code>suffix</code> is the suffix of one of the steps in the pipeline. | ||
| ⚫ | If the test does not correspond to any pipeline in <code>modes.xml</code>, <code>mode: [mode-name]</code> can be replaced with <code>"command": [cmd]</code> where <code>cmd</code> is an arbitrary bash command which will be run in the main directory of the repository. For the purposes of <code>expected.txt</code> and <code>gold.txt</code>, this will be treated as a pipeline containing a single step named <code>all</code>. | ||
| === Other === | |||
| ⚫ | If the test does not correspond to any pipeline in <code>modes.xml</code>, <code>mode: [mode-name]</code> can be replaced with <code>command: [cmd]</code> where <code>cmd</code> is an arbitrary bash command which will be run in the main directory of the repository. For the purposes of <code>expected.txt</code> and <code>gold.txt</code>, this will be treated as a pipeline containing a single step named <code>all</code>. | ||
| For each step, the test runner will check for files named <code>[name].[step-name].expected.txt</code> and <code>[name].[step-name].gold.txt</code> in the same directory as the input file. | For each step, the test runner will check for files named <code>[name].[step-name].expected.txt</code> and <code>[name].[step-name].gold.txt</code> in the same directory as the input file. | ||
| Line 35: | Line 41: | ||
| <code>expected.txt</code> is assumed to be the output of a previous run and <code>gold.txt</code> is assumed to be the ideal output. <code>gold.txt</code> can contain multiple ideal outputs for each line. | <code>expected.txt</code> is assumed to be the output of a previous run and <code>gold.txt</code> is assumed to be the ideal output. <code>gold.txt</code> can contain multiple ideal outputs for each line. | ||
| An individual input line is considered a passing test if it appears in either <code>expected.txt</code> or <code>gold.txt</code> for each of the relevant steps and failing otherwise. By default only the final step of the pipeline is considered relevant. A list of relevant steps can be provided by setting <code>"relevant": [suffixes...]</code> (for example, <code>"relevant": ["morph", "transfer", "postgen"]</code>). | |||
| In static mode, if the output of a step does not appear in either <code>expected.txt</code> or <code>gold.txt</code>, the test fails. | |||
| In dynamic mode, differences between the output and the files will be presented to the user, who will have the option to add the output to either file. | In dynamic mode, differences between the output and the files will be presented to the user, who will have the option to add the output to either file. | ||
Revision as of 21:29, 12 July 2021
This is a proposal for a regression testing system for use in language modules and translation pairs.
See https://github.com/TinoDidriksen/regtest/wiki for examples of what using it might look like in practice.
Contents
Overview
The regression testing system will run a corpus through a pipeline (whether analysis or translation or whatever) recording the output of each step. These outputs will be compared to the expected outputs and an optional list of ideal outputs.
If the actual output is different from the expected or the ideal, this counts as a failing test.
Differences are presented to the developer who can choose to accept some or all of the actual output as the new expected output.
In this way, the regression tests help to ensure that changes to the system do not cause anything to get worse and that the test data is an accurate reflection of the current state of the system while minimizing the effort required of the developer to keep things up-to-date.
Specification
The test runner can be run in either static mode (which functions as a test that can pass or fail) or in interactive mode (which updates the data to reflect the state of the translator).
The test runner will by default check for a file named tests/tests.json. This file will contain one or more entries of the form
{[name]:
  "mode": [mode-name]
  "input": [file-name]}
Input Corpus
Where name is the name of this corpus, mode-name names a pipeline mode (usually abc-xyz or xyz-abc), and the value of "input": is a text file where each line contains an input sentence. Line breaks can be included in the input by writing \n and comments beginning with # will be ignored.
=== Mode Specification
The mode will be read from modes.xml and each step will be named in the same fashion as gendebug="yes". That is, using debug-suff is present, otherwise trying to guess a standard suffix, and finally falling back to NAMEME. If more than one step has the same debug suffix, they will be numbered sequentially.
If the input file is not intended to be passed through the entire pipeline, the option "start-step": [suffix] can be added, where suffix is the suffix of one of the steps in the pipeline.
If the test does not correspond to any pipeline in modes.xml, mode: [mode-name] can be replaced with "command": [cmd] where cmd is an arbitrary bash command which will be run in the main directory of the repository. For the purposes of expected.txt and gold.txt, this will be treated as a pipeline containing a single step named all.
Other
For each step, the test runner will check for files named [name].[step-name].expected.txt and [name].[step-name].gold.txt in the same directory as the input file.
expected.txt is assumed to be the output of a previous run and gold.txt is assumed to be the ideal output. gold.txt can contain multiple ideal outputs for each line.
An individual input line is considered a passing test if it appears in either expected.txt or gold.txt for each of the relevant steps and failing otherwise. By default only the final step of the pipeline is considered relevant. A list of relevant steps can be provided by setting "relevant": [suffixes...] (for example, "relevant": ["morph", "transfer", "postgen"]).
In dynamic mode, differences between the output and the files will be presented to the user, who will have the option to add the output to either file.
See https://github.com/TinoDidriksen/regtest/wiki for images of what the workflow in dynamic mode might look like. A command line interface may also be available.
Repository Structure
If the test runner does not find a file named tests/tests.json, it will guess that the tests live in test-[name] (for example apertium-eng would have test repository test-eng) and offer to clone that repository if it exists.
File Structure Example
apertium-wad/test/tests.json
{
    "general": {
        "mode": "wad-tagger",
        "input": "general-input.txt"
    }
}
apertium-wad/test/general-input.txt
wona pasi siri muandu
apertium-wad/test/general-disam-expected.txt
Each entry is delimited by blanks containing the hash of the corresponding input line in order to track insertions and deletions in the input file and also so that line breaks in the input will not cause problems.
The lines are sorted by hash rather than being in the same order as the input for simplicity and to minimize the diffs resulting from reorganizing the input.
[APxOFSXUCZrF#0] ^muandu/muandu<num>$ [/APxOFSXUCZrF] [ZeKQm_Ed8zYn#0] ^wona/wona<n>$ ^pasi/pa<det><def><mid><pl><nh>$ [/ZeKQm_Ed8zYn] [ge00E0i-0UxQ#0] ^siri/ra<v><p3><pl><nh><o3sg>/siri<num>/ri<v><p3><pl><nh>$ [/ge00E0i-0UxQ]
apertium-wad/test/general-disam-gold.txt
Like the expected output, gold output is delimited and sorted by hash. Multiple possible ideals are separated by [/option].
[ZeKQm_Ed8zYn] ^wona/wona<n>$ ^pasi/pa<det><def><mid><pl><nh>$ [/option] [/ZeKQm_Ed8zYn]
Conversion Process
Any existing tests will be converted to this format with their inputs placed in an input file and their outputs in the appropriate gold.txt. All expected.txt files will be filled in with the current output of the pipeline.
Any test which does not correspond to an existing mode or the beginning of an existing mode will use command.
In any monolingual module for which I cannot find existing tests, I will select a few random forms from the analyzer as the corpus. In translation pairs, I will take a few sentences from other pairs which use the same languages. In both cases, gold.txt will be left empty.

