Difference between revisions of "Inconditional section"

From Apertium
Jump to navigation Jump to search
 
Line 2: Line 2:


The <section> element in a dix file can be of type standard, inconditional, postblank or preblank.[https://github.com/apertium/lttoolbox/blob/master/lttoolbox/dix.dtd#L59]
The <section> element in a dix file can be of type standard, inconditional, postblank or preblank.[https://github.com/apertium/lttoolbox/blob/master/lttoolbox/dix.dtd#L59]

The section type is used to change how tokenisation works.


==inconditional==
==inconditional==


Normally, analysis must be separated by spaces (or other blanks), but an analysis in an 'inconditional' section can appear right after or before other analyses ('standard' or non-standard) or even right next to unknowns. The 'inconditional' section of a dictionary typically contains punctuation, and such things.
An 'inconditional' section of a dictionary typically contains punctuation, and such things. The section type is used to change how tokenisation works.


In detail:
In detail:
Analysis in lttoolbox works in a left-to-right longest match fashion. We read characters from input, trying to match them in the transducer, and if we've reached a final transition when at the "end" of the input word, we can output the analysis (or we can try matching something even longer, but if not we use the match we found). But how do we know when we're at the end of an input word? Any "blank" character (not in <alphabet> in .dix files) is allowed to separate words, so e.g. spaces or other strange characters can separate words. But we may also say that certain words-with-analyses can separate words – to mark an analysis as "may act as word-separator", we put it in the 'inconditional' section.
Analysis in lttoolbox works in a left-to-right longest match fashion. We read characters from input, trying to match them in the transducer, and if we've reached a final transition when at the "end" of the input word, we can output the analysis (or we can try matching something even longer, but if not we use the match we found). But how do we know when we're at the end of an input word? Any "blank" character (not in <alphabet> in .dix files) is allowed to separate words, so e.g. spaces or other strange characters can separate words. But we may also say that certain words-with-analyses can separate words – to mark an analysis as "may act as word-separator", we put it in the 'inconditional' section.


<!-- This is wrong:
<!--
This is wrong:
"Inconditional means 'if you see this, stop processing immediately and start reading a new word'. Stop when you reach the end of a possible transduction."
"Inconditional means 'if you see this, stop processing immediately and start reading a new word'. Stop when you reach the end of a possible transduction."
We can have '.' in inconditional section and still get one long analysis of "Dr. Octagon" if "Dr. Octagon" is in e.g. the standard section. But if there was no analysis of the string up until the '.', then we may immediately output an unknown.
- we can have '.' in inconditional section and still get one long analysis of "Dr. Octagon" if "Dr. Octagon" is in e.g. the standard section. But if there was no analysis of the string up until the '.', then we may immediately output an unknown.
-->
-->


In summary, a space (or other blank) is not required to end the analysis of the preceding string, and neither is a blank required to start a new analysis afterwards.


<pre>
<pre>

Latest revision as of 07:20, 8 June 2019

En français

The <section> element in a dix file can be of type standard, inconditional, postblank or preblank.[1]

The section type is used to change how tokenisation works.

inconditional[edit]

Normally, analysis must be separated by spaces (or other blanks), but an analysis in an 'inconditional' section can appear right after or before other analyses ('standard' or non-standard) or even right next to unknowns. The 'inconditional' section of a dictionary typically contains punctuation, and such things.

In detail: Analysis in lttoolbox works in a left-to-right longest match fashion. We read characters from input, trying to match them in the transducer, and if we've reached a final transition when at the "end" of the input word, we can output the analysis (or we can try matching something even longer, but if not we use the match we found). But how do we know when we're at the end of an input word? Any "blank" character (not in <alphabet> in .dix files) is allowed to separate words, so e.g. spaces or other strange characters can separate words. But we may also say that certain words-with-analyses can separate words – to mark an analysis as "may act as word-separator", we put it in the 'inconditional' section.


$ echo 23men |apertium -d . en-it-anmor
^23/23<num>$^men/man<n><pl>$^./.<sent>$

In the above, we don't need the space between 23 and men because numbers are in an 'inconditional' section.


We can get some weird effects by putting plain characters in 'inconditional':

<dictionary>
  <alphabet>ab</alphabet>
  <sdefs>
    <sdef n="aa"/>
    <sdef n="ab"/>
  </sdefs>
  <section id="foo" type="inconditional">
    <e><p><l>a</l><r>a<s n="aa"/></r></p></e>
    <e><p><l>aa</l><r>aa<s n="aa"/></r></p></e>
  </section>
</dictionary>

$ echo aaa |lt-proc  sample.bin
^aa/aa<aa>$^a/a<aa>$

$ echo aaaa |lt-proc  sample.bin
^aa/aa<aa>$^aa/aa<aa>$

$ echo aaaaa |lt-proc  sample.bin
^aa/aa<aa>$^aa/aa<aa>$^a/a<aa>$

postblank / preblank[edit]

The postblank and preblank sections work exactly like inconditional with respect to how they tokenise the input. The only difference is that anything in a postblank section will make lt-proc output a space after the token (in preblank, before the token).

So if "☃" is in postblank (tagged sent), and "foo" and "bar" are in a regular section (tagged n), then we get:

$ echo 'foo☃bar' | lt-proc analyser.bin
^foo/foo<n>$^☃/☃<sent>$ ^bar/bar<n>$

If "☃" were in preblank, we'd get:

$ echo 'foo☃bar' | lt-proc analyser.bin
^foo/foo<n>$ ^☃/☃<sent>$^bar/bar<n>$

Why is this useful?[edit]

TODO

See also[edit]