Difference between revisions of "Apertium stream format"

From Apertium
Jump to navigation Jump to search
(fix pep8 changes)
(30 intermediate revisions by 6 users not shown)
Line 1: Line 1:
  +
[[Format du flux Apertium|En français]]
  +
 
{{TOCD}}
 
{{TOCD}}
 
This page describes the stream format used in the Apertium machine translation platform.
 
This page describes the stream format used in the Apertium machine translation platform.
   
  +
==Characters==
==Special characters==
 
  +
  +
===Reserved===
  +
  +
Reserved characters should only appear escaped in the input stream unless they are part of a lexical unit, chunk or superblank.
  +
  +
* The characters <code>^</code> and <code>$</code> are reserved for delimiting lexical units
  +
* The character <code>/</code> is reserved for delimiting analyses in ambiguous lexical units
  +
* The characters <code>&lt;</code> and <code>&gt;</code> are reserved for encapsulating tags
  +
* The characters <code>{</code> and <code>}</code> are reserved for delimiting chunks
  +
* The character <code>\</code> is the escape character
  +
 
===Special===
  +
  +
The following have special meaning at the start of an analysis:
   
 
* Asterisk, '<code><nowiki>*</nowiki></code>' -- Unanalysed word.
 
* Asterisk, '<code><nowiki>*</nowiki></code>' -- Unanalysed word.
* At sign, '<code><nowiki>@</nowiki></code>' -- Untranslated lemma.
+
* At sign, '<code><nowiki>@</nowiki></code>' -- Untranslated [[lemma]].
 
* Hash sign, '<code><nowiki>#</nowiki></code>'
 
* Hash sign, '<code><nowiki>#</nowiki></code>'
** In morphological generation -- Unable to generate [[surface form]] from [[lexical unit]].
+
** In morphological generation -- Unable to generate [[surface form]] from [[lexical unit]] (escape this to use # in lemmas)
** In morphological analysis -- Start of inconditional part of multiword marker.
+
** In morphological analysis -- Start of invariable part of multiword marker (escape this to use # in lemmas)
* Plus symbol, '<code><nowiki>+</nowiki></code>' --
+
* Plus symbol, '<code><nowiki>+</nowiki></code>' -- Joined lexical units (escape this to use + in lemmas)
* Tilde '<code><nowiki>~</nowiki></code>' -- Word needs treating by post-generator.
+
* Tilde '<code><nowiki>~</nowiki></code>' -- Word needs treating by [[post-generator]]
  +
  +
==Python parsing library==
  +
If you're writing a python script that needs to handle the Apertium stream format, try the excellent https://github.com/apertium/streamparser
  +
which lets you do
  +
  +
<pre>
  +
from streamparser import parse_file, mainpos, reading_to_string
  +
for blank, lu in parse_file(file, with_text=True):
  +
analyses = lu.readings
  +
firstreading = analyses[0]
  +
surfaceform = lu.wordform
  +
# rewrite to print only the first reading (and surface/word form):
  +
print("^{}/{}$".format(surfaceform,
  +
reading_to_string(firstreading)))
  +
# convenience function to grab the first part of speech of the first reading:
  +
mainpos = mainpos(lu)
  +
</pre>
  +
  +
etc. without having to worry about superblanks and escaped characters and such :-)
  +
  +
Here's an example used in testvoc, this one splits ambiguous readings like <code><nowiki>^foo/bar<n>/fie<ij>$</nowiki></code> into <code><nowiki>^foo/bar<n>$ ^foo/fie<ij>$</nowiki></code>, keeping the (super)blanks and newlines in between unchanged:
  +
<pre>
  +
from streamparser import parse_file, reading_to_string
  +
import sys
  +
for blank, lu in parse_file(sys.stdin, with_text=True):
  +
print(blank+" ".join("^{}/{}$".format(lu.wordform, reading_to_string(r))
  +
for r in lu.readings),
  +
end="")
  +
</pre>
  +
  +
Here's a one-liner to print the lemmas of each word:
  +
<pre>
  +
$ echo fisk bank kake|lt-proc nno-nob.automorf.bin|python3 -c 'import sys, streamparser; print ("\n".join("\t".join(set(s.baseform for r in lu.readings for s in r)) for lu in streamparser.parse_file(sys.stdin)))'
  +
</pre>
  +
  +
  +
An alternative python lib: https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_entities.py https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_reader.py
  +
  +
==Ruby parsing library==
  +
If you're writing a ruby script that needs to handle the Apertium stream format, you might want to try https://github.com/veer66/reinarb which seems similar to the Python streamparser
  +
  +
==Formatted input==
  +
{{see-also|Format handling}}
  +
  +
F = formatted text, T = text to be analysed.
  +
  +
Formatted text is treated as a single whitespace by all stages.
  +
  +
<pre>
  +
  +
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
  +
  +
|____| |_______| |____| |_______|
  +
| | | |
  +
F F F F
  +
  +
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
  +
|______| | |____|
  +
| | |
  +
T T T
  +
</pre>
   
 
==Analyses==
 
==Analyses==
Line 40: Line 117:
   
 
^take it away/take<vblex><sep><inf>+prpers<prn><obj><p3><nt><sg># away/take<vblex><sep><pres>+prpers<prn><obj><p3><nt><sg># away$
 
^take it away/take<vblex><sep><inf>+prpers<prn><obj><p3><nt><sg># away/take<vblex><sep><pres>+prpers<prn><obj><p3><nt><sg># away$
  +
  +
|___| |_____|
  +
| |
  +
LEMMA HEAD LEMMA QUEUE
   
 
</pre>
 
</pre>
   
 
==Chunks==
 
==Chunks==
  +
{{see-also|Chunking}}
 
 
<pre>
 
<pre>
   
Line 56: Line 137:
 
|
 
|
 
CHUNK
 
CHUNK
  +
  +
  +
  +
^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$
  +
  +
|______________|
  +
|
  +
POINTERS TO CHUNK TAGS
  +
<1> <2> <3>
 
</pre>
 
</pre>
   
Line 61: Line 151:
   
 
* [[List of symbols]]
 
* [[List of symbols]]
  +
* [[Meaning of symbols * @ and dieze after a translation]]
 
  +
* [[apertium-cleanstream]] which lets you avoid ad-hoc bash oneliners to get one word per line
 
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
 
[[Category:Formats]]
 
[[Category:Formats]]
  +
[[Category:Documentation in English]]

Revision as of 16:27, 28 February 2019

En français

This page describes the stream format used in the Apertium machine translation platform.

Characters

Reserved

Reserved characters should only appear escaped in the input stream unless they are part of a lexical unit, chunk or superblank.

  • The characters ^ and $ are reserved for delimiting lexical units
  • The character / is reserved for delimiting analyses in ambiguous lexical units
  • The characters < and > are reserved for encapsulating tags
  • The characters { and } are reserved for delimiting chunks
  • The character \ is the escape character

Special

The following have special meaning at the start of an analysis:

  • Asterisk, '*' -- Unanalysed word.
  • At sign, '@' -- Untranslated lemma.
  • Hash sign, '#'
    • In morphological generation -- Unable to generate surface form from lexical unit (escape this to use # in lemmas)
    • In morphological analysis -- Start of invariable part of multiword marker (escape this to use # in lemmas)
  • Plus symbol, '+' -- Joined lexical units (escape this to use + in lemmas)
  • Tilde '~' -- Word needs treating by post-generator

Python parsing library

If you're writing a python script that needs to handle the Apertium stream format, try the excellent https://github.com/apertium/streamparser which lets you do

from streamparser import parse_file, mainpos, reading_to_string
for blank, lu in parse_file(file, with_text=True): 
    analyses = lu.readings
    firstreading = analyses[0]
    surfaceform = lu.wordform
    # rewrite to print only the first reading (and surface/word form):
    print("^{}/{}$".format(surfaceform, 
                           reading_to_string(firstreading)))
    # convenience function to grab the first part of speech of the first reading:
    mainpos = mainpos(lu)

etc. without having to worry about superblanks and escaped characters and such :-)

Here's an example used in testvoc, this one splits ambiguous readings like ^foo/bar<n>/fie<ij>$ into ^foo/bar<n>$ ^foo/fie<ij>$, keeping the (super)blanks and newlines in between unchanged:

from streamparser import parse_file, reading_to_string
import sys
for blank, lu in parse_file(sys.stdin, with_text=True):
    print(blank+" ".join("^{}/{}$".format(lu.wordform, reading_to_string(r))
                         for r in lu.readings),
          end="")

Here's a one-liner to print the lemmas of each word:

$ echo fisk bank kake|lt-proc nno-nob.automorf.bin|python3 -c  'import sys, streamparser; print ("\n".join("\t".join(set(s.baseform for r in lu.readings for s in r)) for lu in streamparser.parse_file(sys.stdin)))'


An alternative python lib: https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_entities.py https://github.com/krvoje/apertium-transfer-dsl/blob/master/apertium/stream_reader.py

Ruby parsing library

If you're writing a ruby script that needs to handle the Apertium stream format, you might want to try https://github.com/veer66/reinarb which seems similar to the Python streamparser

Formatted input

See also: Format handling

F = formatted text, T = text to be analysed.

Formatted text is treated as a single whitespace by all stages.


[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]

|____|       |_______| |____|     |_______|
   |            |        |            |
   F            F        F            F
    
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
      |______|        |      |____|
          |           |        | 
          T           T        T

Analyses

S = surface form, L = lemma.


^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$

   |    | |________|
   S    L    TAGS
        |______|
        ANALISIS

|_____________________________________________|
          AMBIGUOUS LEXICAL UNIT

^vino<n><m><sg>$

|______________|
 DISAMBIGUATED
  LEXICAL UNIT

^dímelo/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><nt>/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><m><sg>$

                                 |____________________________________________|
                                                JOINED MORPHEMES

^take it away/take<vblex><sep><inf>+prpers<prn><obj><p3><nt><sg># away/take<vblex><sep><pres>+prpers<prn><obj><p3><nt><sg># away$

              |___|                                             |_____|
                |                                                   |
             LEMMA HEAD                                        LEMMA QUEUE

Chunks

See also: Chunking

^Verbcj<SV><vblex><ifi><p3><sg>{^come<vblex><ifi><p3><sg>$}$ ^pr<PREP>{^to<pr>$}$ ^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

   |   |______________________||__________________________|                                                          |
 CHUNK      CHUNK TAGS              LEXICAL UNITS IN                                                               LINKED
  NAME                                  THE CHUNK                                                                   TAG

   |________________________________________|
                       |
                     CHUNK



^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

                                   |______________|
                                          |
                                POINTERS TO CHUNK TAGS
        <1> <2> <3>     

See also