Difference between revisions of "Apertium stream format"

Revision as of 07:26, 16 December 2014

Characters

Reserved

Reserved characters should only appear escaped in the input stream unless they are part of a lexical unit, chunk or superblank.

The characters ^ and $ are reserved for delimiting lexical units
The character / is reserved for delimiting analyses in ambiguous lexical units
The characters < and > are reserved for encapsulating tags
The characters { and } are reserved for delimiting chunks
The character \ is the escape character

Special

Asterisk, '*' -- Unanalysed word.
At sign, '@' -- Untranslated lemma.
Hash sign, '#'
- In morphological generation -- Unable to generate surface form from lexical unit.
- In morphological analysis -- Start of invariable part of multiword marker.
Plus symbol, '+' -- Joined lexical units
Tilde '~' -- Word needs treating by post-generator.

Python parsing library

If you're writing a python script that needs to handle the Apertium stream format, try the excellent https://github.com/apertium/streamparser which lets you do "for lu in parse(file): analyses=lu.readings" etc. without having to worry about superblanks and escaped characters and such :-)

Formatted input

See also: Format handling

F = formatted text, T = text to be analysed.

Formatted text is treated as a single whitespace by all stages.


[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]

|____|       |_______| |____|     |_______|
   |            |        |            |
   F            F        F            F
    
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
      |______|        |      |____|
          |           |        | 
          T           T        T

Analyses

S = surface form, L = lemma.


^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$

   |    | |________|
   S    L    TAGS
        |______|
        ANALISIS

|_____________________________________________|
          AMBIGUOUS LEXICAL UNIT

^vino<n><m><sg>$

|______________|
 DISAMBIGUATED
  LEXICAL UNIT

^dímelo/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><nt>/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><m><sg>$

                                 |____________________________________________|
                                                JOINED MORPHEMES

^take it away/take<vblex><sep><inf>+prpers<prn><obj><p3><nt><sg># away/take<vblex><sep><pres>+prpers<prn><obj><p3><nt><sg># away$

              |___|                                             |_____|
                |                                                   |
             LEMMA HEAD                                        LEMMA QUEUE

Chunks

See also: Chunking


^Verbcj<SV><vblex><ifi><p3><sg>{^come<vblex><ifi><p3><sg>$}$ ^pr<PREP>{^to<pr>$}$ ^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

   |   |______________________||__________________________|                                                          |
 CHUNK      CHUNK TAGS              LEXICAL UNITS IN                                                               LINKED
  NAME                                  THE CHUNK                                                                   TAG

   |________________________________________|
                       |
                     CHUNK



^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

                                   |______________|
                                          |
                                POINTERS TO CHUNK TAGS
        <1> <2> <3>

@@ Line 25: / Line 25: @@
 * Plus symbol, '<code><nowiki>+</nowiki></code>' -- Joined lexical units
 * Tilde '<code><nowiki>~</nowiki></code>' -- Word needs treating by [[post-generator]].
+==Python parsing library==
+If you're writing a python script that needs to handle the Apertium stream format, try the excellent https://github.com/apertium/streamparser
+which lets you do "for lu in parse(file): analyses=lu.readings" etc. without having to worry about superblanks and escaped characters and such :-)
 ==Formatted input==

Difference between revisions of "Apertium stream format"

Revision as of 07:26, 16 December 2014

Contents

Characters

Reserved

Special

Python parsing library

Formatted input

Analyses

Chunks

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools