Difference between revisions of "Apertium-regtest"

From Apertium
Jump to navigation Jump to search
(path typo)
 
(2 intermediate revisions by one other user not shown)
Line 28: Line 28:
 
| <code>gold</code> || ideal outputs || We want what's in <code>expected</code> to be as close to one of these as possible.
 
| <code>gold</code> || ideal outputs || We want what's in <code>expected</code> to be as close to one of these as possible.
 
|}
 
|}
  +
  +
A "corpus" for Apertium-regtest can be any input to one of the translation or analysis pipelines that a repository defines. It will normally be a list of words or sentences, but it can also be analyses (such as when testing the generator).
   
 
=== Typical Workflow ===
 
=== Typical Workflow ===
Line 64: Line 66:
 
Dynamic mode begins the same way, but instead of failing when an output doesn't match <code>expected</code>, it presents those outputs to the user. If the new output is more correct than what is in <code>expected</code>, the user can accept it, in which case it becomes the new <code>expected</code>. On the other hand, if the output has gotten worse, the user can reject it, in which case the test is still failing and the rules in the pipeline need to be fixed.
 
Dynamic mode begins the same way, but instead of failing when an output doesn't match <code>expected</code>, it presents those outputs to the user. If the new output is more correct than what is in <code>expected</code>, the user can accept it, in which case it becomes the new <code>expected</code>. On the other hand, if the output has gotten worse, the user can reject it, in which case the test is still failing and the rules in the pipeline need to be fixed.
   
The test runner will by default check for a file named <code>tests/tests.json</code>. This file will contain one or more entries of the form
+
The test runner will by default check for a file named <code>test/tests.json</code>. This file will contain one or more entries of the form
   
 
<pre>
 
<pre>
Line 116: Line 118:
   
 
An individual input line is considered a passing test if it either matches expected or appears in gold for each of the relevant steps and failing otherwise. By default, only the final step of the pipeline is considered relevant. A list of relevant steps can be provided by setting <code>"relevant": [suffixes...]</code> (for example, <code>"relevant": ["morph", "transfer", "postgen"]</code>).
 
An individual input line is considered a passing test if it either matches expected or appears in gold for each of the relevant steps and failing otherwise. By default, only the final step of the pipeline is considered relevant. A list of relevant steps can be provided by setting <code>"relevant": [suffixes...]</code> (for example, <code>"relevant": ["morph", "transfer", "postgen"]</code>).
  +
  +
For modes which output [[Apertium stream format]], it is often desirable to ignore differences in ordering so that, for example, <code>^runs/run&lt;n&gt;&lt;sg&gt;/run&lt;v&gt;&lt;pres&gt;&lt;p3&gt;&lt;sg&gt;$</code> and <code>^runs/run&lt;v&gt;&lt;pres&gt;&lt;p3&gt;&lt;sg&gt;/run&lt;n&gt;&lt;sg&gt;$</code> will be counted as matching. This can be done by setting either <code>"sort": true</code> to apply sorting to all modes in pipeline, or by providing a list such as <code>"sort": ["morph", "biltrans"]</code>.
   
 
In dynamic mode, differences between the output and the files will be presented to the user, who will have the option to add the output to either file.
 
In dynamic mode, differences between the output and the files will be presented to the user, who will have the option to add the output to either file.

Latest revision as of 12:17, 6 June 2023

Apertium-regtest is a program for managing regression tests and corpora.

For each input, rather than treating the output as correct or incorrect, Apertium-regtest has three possible designations: incorrect, expected, and gold. Expected outputs are what the pipeline should produce, given the rules that have been written whereas gold is what the output would be if the pipeline were perfect. This allows us to use the same tests to ensure that there are no regressions and also measure the quality of the translator.

Overview[edit]

The regression testing system runs a corpus through a pipeline (whether analysis or translation or whatever) recording the output of each step. These outputs will be compared to the expected outputs and an optional list of ideal outputs.

If the actual output is different from the expected or the ideal, this counts as a failing test.

Differences are presented to the developer who can choose to accept some or all of the actual output as the new expected output.

In this way, the regression tests help to ensure that changes to the system do not cause anything to get worse and that the test data is an accurate reflection of the current state of the system while minimizing the effort required of the developer to keep things up-to-date.

Terminology[edit]

There are 4 types of corpus files that Apertium-regtest deals with:

name description comments
input text to be passed through the pipeline This is a simple text file with one entry per line.
output temporary files containing the current output of the pipeline This is what each step of the pipeline output the last time you ran apertium-regtest. You can ignore these files.
expected the output of the pipeline as of the last commit This is what each step of the pipeline output the last time someone checked the output and confirmed that it hadn't gotten any worse.
gold ideal outputs We want what's in expected to be as close to one of these as possible.

A "corpus" for Apertium-regtest can be any input to one of the translation or analysis pipelines that a repository defines. It will normally be a list of words or sentences, but it can also be analyses (such as when testing the generator).

Typical Workflow[edit]

  1. run apertium-regtest web and open the link in your browser
  2. make changes to dictionaries, transfer rules, etc.
  3. recompile
  4. in the browser, select one or all of the corpora to rerun tests for
  5. if any of the entries in the corpus have gotten worse, return to step 2
  6. accept changes
  7. commit changed test files along with dictionaries, transfer rules, etc.

Adding a New Test Corpus[edit]

  1. create a new file under test/
  2. put some words, sentences, or whatever in that file
  3. add that input file to test/tests.json
  4. re-run apertium-regtest
  5. add the generated files to git

Example: If we're trying to add a test for fra-oci_gascon, the input file could be ​test/fra-oci_gascon-input.txt then we could add the following to test/tests.json:

"fra-oci_gascon": {
  "input": "fra-oci_gascon-input.txt",
  "mode": "fra-oci_gascon"
}

Specification[edit]

The test runner has 2 modes: static and dynamic.

Static mode runs the input corpus through the pipeline and compares it against expected. If any of the outputs don't match, the test fails.

Dynamic mode begins the same way, but instead of failing when an output doesn't match expected, it presents those outputs to the user. If the new output is more correct than what is in expected, the user can accept it, in which case it becomes the new expected. On the other hand, if the output has gotten worse, the user can reject it, in which case the test is still failing and the rules in the pipeline need to be fixed.

The test runner will by default check for a file named test/tests.json. This file will contain one or more entries of the form

[name]: {
  ​"mode": [mode-name]
  ​"input": [file-name]
}

Input Corpus[edit]

Where name is the name of this corpus, mode-name names a pipeline mode (usually abc-xyz or xyz-abc), and the value of "input": is a text file where each line contains an input sentence. Line breaks can be included in the input by writing \n and comments beginning with # will be ignored.

Mode Specification[edit]

The mode will be read from modes.xml and each step will be named in the same fashion as gendebug="yes". That is, using debug-suff is present, otherwise trying to guess a standard suffix, and finally falling back to NAMEME. If more than one step has the same debug suffix, they will be numbered sequentially.

If the input file is not intended to be passed through the entire pipeline, the option "start-step": [suffix] can be added, where suffix is the suffix of one of the steps in the pipeline.

If the test does not correspond to any pipeline in modes.xml, mode: [mode-name] can be replaced with "command": [cmd] where cmd is an arbitrary bash command which will be run in the main directory of the repository. For the purposes of expected.txt and gold.txt, this will be treated as a pipeline containing a single step named all.

Directory Structure[edit]

There are 4 types of files associated with a particular corpus: input, output, expected, and gold. There are two arrangements of files currently supported: flat and nested.

In flat mode, all files are placed in the same directory. In nested mode, output, expected, and gold each have a separate subdirectory within test/.

Flat mode is the default, but nested can be specified by adding the following to test/tests.json:

"settings": {
  ​"structure": "nested"
}

A flat directory can be automatically converted to a nested one using this script.

name flat filename nested filename
input specified in test/tests.json specified in test/tests.json
output test/[corpus].[step].output.txt test/output/[corpus].[step].txt
expected test/[corpus].[step].expected.txt test/expected/[corpus].[step].txt
gold test/[corpus].[step].gold.txt test/gold/[corpus].[step].txt

Other[edit]

An individual input line is considered a passing test if it either matches expected or appears in gold for each of the relevant steps and failing otherwise. By default, only the final step of the pipeline is considered relevant. A list of relevant steps can be provided by setting "relevant": [suffixes...] (for example, "relevant": ["morph", "transfer", "postgen"]).

For modes which output Apertium stream format, it is often desirable to ignore differences in ordering so that, for example, ^runs/run<n><sg>/run<v><pres><p3><sg>$ and ^runs/run<v><pres><p3><sg>/run<n><sg>$ will be counted as matching. This can be done by setting either "sort": true to apply sorting to all modes in pipeline, or by providing a list such as "sort": ["morph", "biltrans"].

In dynamic mode, differences between the output and the files will be presented to the user, who will have the option to add the output to either file.

See https://github.com/TinoDidriksen/regtest/wiki for images of what the workflow in dynamic mode might look like. A command line interface is also be available.

Repository Structure[edit]

If the test runner does not find a file named tests/tests.json, it will guess that the tests live in test-[name] (for example apertium-eng would have test repository test-eng) and offer to clone that repository if it exists.

File Structure Example[edit]

apertium-wad/test/tests.json[edit]

In general, this file is set up once at the beginning and rarely edited after that.

{
   ​"general": {
       ​"mode": "wad-tagger",
       ​"input": "general-input.txt"
   ​}
}

apertium-wad/test/general-input.txt[edit]

Any sentence, phrase, or form that you get work or are trying to get working can and should be added to one of the input files.

wona pasi
siri
muandu

apertium-wad/test/general-disam-expected.txt[edit]

Each entry is delimited by blanks containing the hash of the corresponding input line in order to track insertions and deletions in the input file and also so that line breaks in the input will not cause problems.

The lines are sorted by hash rather than being in the same order as the input for simplicity and to minimize the diffs resulting from reorganizing the input.

Expected files intend to show exactly the current output of the system and thus should not be edited by hand.

[APxOFSXUCZrF#0] ^muandu/muandu<num>$
[/APxOFSXUCZrF]
[ZeKQm_Ed8zYn#0] ^wona/wona<n>$ ^pasi/pa<det><def><mid><pl><nh>$
[/ZeKQm_Ed8zYn]
[ge00E0i-0UxQ#0] ^siri/ra<v><p3><pl><nh><o3sg>/siri<num>/ri<v><p3><pl><nh>$
[/ge00E0i-0UxQ]

apertium-wad/test/general-disam-gold.txt[edit]

Like the expected output, gold output is delimited and sorted by hash. Multiple possible ideals are separated by [/option].

Since the entries are delimited by hashes, the recommended way to interact with these files is via the provided interfaces. However, there are instances where you might want to add values directly. For small corpora, you can copy the corresponding expected.txt file and edit the entries. For larger corpora there are some conversion scripts available at https://github.com/apertium/apertium-regtest/tree/master/tools. Bug User:Popcorndude if you need more of these.

[ZeKQm_Ed8zYn]
^wona/wona<n>$ ^pasi/pa<det><def><mid><pl><nh>$ [/option]
[/ZeKQm_Ed8zYn]

Conversion Process[edit]

Any existing tests will be converted to this format with their inputs placed in an input file and their outputs in the appropriate gold.txt. All expected.txt files will be filled in with the current output of the pipeline.

Any test which does not correspond to an existing mode or the beginning of an existing mode will use command.

In any module for which I cannot find existing tests, I will select a few random forms from the analyzer as the corpus. gold.txt will be left empty.