Difference between revisions of "Apertium stream format"

From Apertium
Jump to navigation Jump to search
(cl-apertium-stream)
 
(24 intermediate revisions by 6 users not shown)
Line 1: Line 1:
  +
[[Format du flux Apertium|En français]]
  +
 
{{TOCD}}
 
{{TOCD}}
 
This page describes the stream format used in the Apertium machine translation platform.
 
This page describes the stream format used in the Apertium machine translation platform.
   
  +
==Characters==
==Special characters==
 
  +
  +
===Reserved===
  +
  +
Reserved characters should only appear escaped in the input stream unless they are part of a lexical unit, chunk or superblank.
  +
  +
* The characters <code>^</code> and <code>$</code> are reserved for delimiting lexical units
  +
* The character <code>/</code> is reserved for delimiting analyses in ambiguous lexical units
  +
* The characters <code>&lt;</code> and <code>&gt;</code> are reserved for encapsulating tags
  +
* The characters <code>{</code> and <code>}</code> are reserved for delimiting chunks
  +
* The character <code>\</code> is the escape character
  +
 
===Special===
  +
  +
The following have special meaning at the start of an analysis:
   
 
* Asterisk, '<code><nowiki>*</nowiki></code>' -- Unanalysed word.
 
* Asterisk, '<code><nowiki>*</nowiki></code>' -- Unanalysed word.
 
* At sign, '<code><nowiki>@</nowiki></code>' -- Untranslated [[lemma]].
 
* At sign, '<code><nowiki>@</nowiki></code>' -- Untranslated [[lemma]].
 
* Hash sign, '<code><nowiki>#</nowiki></code>'
 
* Hash sign, '<code><nowiki>#</nowiki></code>'
** In morphological generation -- Unable to generate [[surface form]] from [[lexical unit]].
+
** In morphological generation -- Unable to generate [[surface form]] from [[lexical unit]] (escape this to use # in lemmas)
** In morphological analysis -- Start of inconditional part of multiword marker.
+
** In morphological analysis -- Start of invariable part of multiword marker (escape this to use # in lemmas)
* Plus symbol, '<code><nowiki>+</nowiki></code>' -- Joined lexical units
+
* Plus symbol, '<code><nowiki>+</nowiki></code>' -- [[Conjoined lexical units|Joined lexical units]] (escape this to use + in lemmas)
* Tilde '<code><nowiki>~</nowiki></code>' -- Word needs treating by post-generator.
+
* Tilde '<code><nowiki>~</nowiki></code>' -- Word needs treating by [[post-generator]]
  +
  +
==Python parsing library==
  +
If you're writing a python script that needs to handle the Apertium stream format, try the excellent https://github.com/apertium/streamparser
  +
which lets you do
  +
  +
<pre>
  +
from streamparser import parse_file, mainpos, reading_to_string
  +
for blank, lu in parse_file(file, with_text=True):
  +
analyses = lu.readings
  +
firstreading = analyses[0]
  +
surfaceform = lu.wordform
  +
# rewrite to print only the first reading (and surface/word form):
  +
print("^{}/{}$".format(surfaceform,
  +
reading_to_string(firstreading)))
  +
# convenience function to grab the first part of speech of the first reading:
  +
mainpos = mainpos(lu)
  +
</pre>
  +
  +
etc. without having to worry about superblanks and escaped characters and such :-)
  +
  +
Here's an example used in testvoc, this one splits ambiguous readings like <code><nowiki>^foo/bar<n>/fie<ij>$</nowiki></code> into <code><nowiki>^foo/bar<n>$ ^foo/fie<ij>$</nowiki></code>, keeping the (super)blanks and newlines in between unchanged:
  +
<pre>
  +
from streamparser import parse_file, reading_to_string
  +
import sys
  +
for blank, lu in parse_file(sys.stdin, with_text=True):
  +
print(blank+" ".join("^{}/{}$".format(lu.wordform, reading_to_string(r))
  +
for r in lu.readings),
  +
end="")
  +
</pre>
  +
  +
Here's a one-liner to print the lemmas of each word:
  +
<pre>
  +
$ echo fisk bank kake|lt-proc nno-nob.automorf.bin|python3 -c 'import sys, streamparser; print ("\n".join("\t".join(set(s.baseform for r in lu.readings for s in r)) for lu in streamparser.parse_file(sys.stdin)))'
  +
</pre>
  +
  +
  +
An alternative python lib: https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_entities.py https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_reader.py
  +
  +
==Common Lisp parsing library==
  +
cl-apertium-stream[https://github.com/veer66/cl-apertium-stream] is a library written in Common Lisp for parsing Apertium stream and generating Apertium stream from parsed data. It is developed based on the discontinued Ruby library[https://github.com/veer66/reinarb]. cl-apertium-stream is data-driven. Its parsed data is a list, keyword, and string combination without any new type/class. So further processing is based on ordinary list operations. cl-apertium-stream handles Apertium stream format by declarative Esrap[https://github.com/scymtym/esrap] rules.
   
 
==Formatted input==
 
==Formatted input==
{{see-also|Superblanks}}
+
{{see-also|Format handling}}
   
 
F = formatted text, T = text to be analysed.
 
F = formatted text, T = text to be analysed.
   
Formatted text is ignored by all stages.
+
Formatted text is treated as a single whitespace by all stages.
   
 
<pre>
 
<pre>
Line 69: Line 125:
   
 
==Chunks==
 
==Chunks==
{{see-also|Chunks}}
+
{{see-also|Chunking}}
 
<pre>
 
<pre>
   
Line 95: Line 151:
   
 
* [[List of symbols]]
 
* [[List of symbols]]
  +
* [[Meaning of symbols * @ and dieze after a translation]]
 
  +
* [[apertium-cleanstream]] which lets you avoid ad-hoc bash oneliners to get one word per line
 
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
 
[[Category:Formats]]
 
[[Category:Formats]]
  +
[[Category:Documentation in English]]

Latest revision as of 07:12, 29 March 2022

En français

This page describes the stream format used in the Apertium machine translation platform.

Characters[edit]

Reserved[edit]

Reserved characters should only appear escaped in the input stream unless they are part of a lexical unit, chunk or superblank.

  • The characters ^ and $ are reserved for delimiting lexical units
  • The character / is reserved for delimiting analyses in ambiguous lexical units
  • The characters < and > are reserved for encapsulating tags
  • The characters { and } are reserved for delimiting chunks
  • The character \ is the escape character

Special[edit]

The following have special meaning at the start of an analysis:

  • Asterisk, '*' -- Unanalysed word.
  • At sign, '@' -- Untranslated lemma.
  • Hash sign, '#'
    • In morphological generation -- Unable to generate surface form from lexical unit (escape this to use # in lemmas)
    • In morphological analysis -- Start of invariable part of multiword marker (escape this to use # in lemmas)
  • Plus symbol, '+' -- Joined lexical units (escape this to use + in lemmas)
  • Tilde '~' -- Word needs treating by post-generator

Python parsing library[edit]

If you're writing a python script that needs to handle the Apertium stream format, try the excellent https://github.com/apertium/streamparser which lets you do

from streamparser import parse_file, mainpos, reading_to_string
for blank, lu in parse_file(file, with_text=True): 
    analyses = lu.readings
    firstreading = analyses[0]
    surfaceform = lu.wordform
    # rewrite to print only the first reading (and surface/word form):
    print("^{}/{}$".format(surfaceform, 
                           reading_to_string(firstreading)))
    # convenience function to grab the first part of speech of the first reading:
    mainpos = mainpos(lu)

etc. without having to worry about superblanks and escaped characters and such :-)

Here's an example used in testvoc, this one splits ambiguous readings like ^foo/bar<n>/fie<ij>$ into ^foo/bar<n>$ ^foo/fie<ij>$, keeping the (super)blanks and newlines in between unchanged:

from streamparser import parse_file, reading_to_string
import sys
for blank, lu in parse_file(sys.stdin, with_text=True):
    print(blank+" ".join("^{}/{}$".format(lu.wordform, reading_to_string(r))
                         for r in lu.readings),
          end="")

Here's a one-liner to print the lemmas of each word:

$ echo fisk bank kake|lt-proc nno-nob.automorf.bin|python3 -c  'import sys, streamparser; print ("\n".join("\t".join(set(s.baseform for r in lu.readings for s in r)) for lu in streamparser.parse_file(sys.stdin)))'


An alternative python lib: https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_entities.py https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_reader.py

Common Lisp parsing library[edit]

cl-apertium-stream[1] is a library written in Common Lisp for parsing Apertium stream and generating Apertium stream from parsed data. It is developed based on the discontinued Ruby library[2]. cl-apertium-stream is data-driven. Its parsed data is a list, keyword, and string combination without any new type/class. So further processing is based on ordinary list operations. cl-apertium-stream handles Apertium stream format by declarative Esrap[3] rules.

Formatted input[edit]

See also: Format handling

F = formatted text, T = text to be analysed.

Formatted text is treated as a single whitespace by all stages.


[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]

|____|       |_______| |____|     |_______|
   |            |        |            |
   F            F        F            F
    
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
      |______|        |      |____|
          |           |        | 
          T           T        T

Analyses[edit]

S = surface form, L = lemma.


^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$

   |    | |________|
   S    L    TAGS
        |______|
        ANALISIS

|_____________________________________________|
          AMBIGUOUS LEXICAL UNIT

^vino<n><m><sg>$

|______________|
 DISAMBIGUATED
  LEXICAL UNIT

^dímelo/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><nt>/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><m><sg>$

                                 |____________________________________________|
                                                JOINED MORPHEMES

^take it away/take<vblex><sep><inf>+prpers<prn><obj><p3><nt><sg># away/take<vblex><sep><pres>+prpers<prn><obj><p3><nt><sg># away$

              |___|                                             |_____|
                |                                                   |
             LEMMA HEAD                                        LEMMA QUEUE

Chunks[edit]

See also: Chunking

^Verbcj<SV><vblex><ifi><p3><sg>{^come<vblex><ifi><p3><sg>$}$ ^pr<PREP>{^to<pr>$}$ ^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

   |   |______________________||__________________________|                                                          |
 CHUNK      CHUNK TAGS              LEXICAL UNITS IN                                                               LINKED
  NAME                                  THE CHUNK                                                                   TAG

   |________________________________________|
                       |
                     CHUNK



^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

                                   |______________|
                                          |
                                POINTERS TO CHUNK TAGS
        <1> <2> <3>     

See also[edit]