Difference between revisions of "Apertium stream format"

From Apertium
Jump to navigation Jump to search
(IfFNnFDKftsXwrcyQxE)
m (Reverted edits by 81.246.69.108 (Talk) to last revision by Francis Tyers)
Line 1: Line 1:
  +
{{TOCD}}
S1LoNU <a href="http://jwklfqaogmrk.com/">jwklfqaogmrk</a>, [url=http://nuqzxnefwmtw.com/]nuqzxnefwmtw[/url], [link=http://eoxbmlfscicm.com/]eoxbmlfscicm[/link], http://yvaprzpgqezi.com/
 
  +
This page describes the stream format used in the Apertium machine translation platform.
  +
  +
==Characters==
  +
  +
===Reserved===
  +
  +
Reserved characters should only appear escaped in the input stream unless they are part of a lexical unit, chunk or superblank.
  +
  +
* The characters <code>^</code> and <code>$</code> are reserved for delimiting lexical units
  +
* The character <code>/</code> is reserved for delimiting analyses in ambiguous lexical units
  +
* The characters <code>&lt;</code> and <code>&gt;</code> are reserved for encapsulating tags
  +
* The characters <code>{</code> and <code>}</code> are reserved for delimiting chunks
  +
* The character <code>\</code> is the escape character
  +
  +
===Special===
  +
  +
* Asterisk, '<code><nowiki>*</nowiki></code>' -- Unanalysed word.
  +
* At sign, '<code><nowiki>@</nowiki></code>' -- Untranslated [[lemma]].
  +
* Hash sign, '<code><nowiki>#</nowiki></code>'
  +
** In morphological generation -- Unable to generate [[surface form]] from [[lexical unit]].
  +
** In morphological analysis -- Start of invariable part of multiword marker.
  +
* Plus symbol, '<code><nowiki>+</nowiki></code>' -- Joined lexical units
  +
* Tilde '<code><nowiki>~</nowiki></code>' -- Word needs treating by post-generator.
  +
  +
==Formatted input==
  +
{{see-also|Superblanks}}
  +
  +
F = formatted text, T = text to be analysed.
  +
  +
Formatted text is treated as a single whitespace by all stages.
  +
  +
<pre>
  +
  +
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
  +
  +
|____| |_______| |____| |_______|
  +
| | | |
  +
F F F F
  +
  +
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
  +
|______| | |____|
  +
| | |
  +
T T T
  +
</pre>
  +
  +
==Analyses==
  +
  +
S = surface form, L = lemma.
  +
  +
<pre>
  +
  +
^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$
  +
  +
| | |________|
  +
S L TAGS
  +
|______|
  +
ANALISIS
  +
  +
|_____________________________________________|
  +
AMBIGUOUS LEXICAL UNIT
  +
  +
^vino<n><m><sg>$
  +
  +
|______________|
  +
DISAMBIGUATED
  +
LEXICAL UNIT
  +
  +
^dímelo/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><nt>/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><m><sg>$
  +
  +
|____________________________________________|
  +
JOINED MORPHEMES
  +
  +
^take it away/take<vblex><sep><inf>+prpers<prn><obj><p3><nt><sg># away/take<vblex><sep><pres>+prpers<prn><obj><p3><nt><sg># away$
  +
  +
|___| |_____|
  +
| |
  +
LEMMA HEAD LEMMA QUEUE
  +
  +
</pre>
  +
  +
==Chunks==
  +
{{see-also|Chunks}}
  +
<pre>
  +
  +
^Verbcj<SV><vblex><ifi><p3><sg>{^come<vblex><ifi><p3><sg>$}$ ^pr<PREP>{^to<pr>$}$ ^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$
  +
  +
| |______________________||__________________________| |
  +
CHUNK CHUNK TAGS LEXICAL UNITS IN LINKED
  +
NAME THE CHUNK TAG
  +
  +
|________________________________________|
  +
|
  +
CHUNK
  +
  +
  +
  +
^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$
  +
  +
|______________|
  +
|
  +
POINTERS TO CHUNK TAGS
  +
<1> <2> <3>
  +
</pre>
  +
  +
==See also==
  +
  +
* [[List of symbols]]
  +
  +
  +
  +
[[Category:Documentation]]
  +
[[Category:Formats]]

Revision as of 10:45, 22 June 2010

This page describes the stream format used in the Apertium machine translation platform.

Characters

Reserved

Reserved characters should only appear escaped in the input stream unless they are part of a lexical unit, chunk or superblank.

  • The characters ^ and $ are reserved for delimiting lexical units
  • The character / is reserved for delimiting analyses in ambiguous lexical units
  • The characters < and > are reserved for encapsulating tags
  • The characters { and } are reserved for delimiting chunks
  • The character \ is the escape character

Special

  • Asterisk, '*' -- Unanalysed word.
  • At sign, '@' -- Untranslated lemma.
  • Hash sign, '#'
    • In morphological generation -- Unable to generate surface form from lexical unit.
    • In morphological analysis -- Start of invariable part of multiword marker.
  • Plus symbol, '+' -- Joined lexical units
  • Tilde '~' -- Word needs treating by post-generator.

Formatted input

See also: Superblanks

F = formatted text, T = text to be analysed.

Formatted text is treated as a single whitespace by all stages.


[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]

|____|       |_______| |____|     |_______|
   |            |        |            |
   F            F        F            F
    
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
      |______|        |      |____|
          |           |        | 
          T           T        T

Analyses

S = surface form, L = lemma.


^vino/vino<n><m><sg>/venir<vblex><ifi><p3><sg>$

   |    | |________|
   S    L    TAGS
        |______|
        ANALISIS

|_____________________________________________|
          AMBIGUOUS LEXICAL UNIT

^vino<n><m><sg>$

|______________|
 DISAMBIGUATED
  LEXICAL UNIT

^dímelo/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><nt>/decir<vblex><imp><p2><sg>+me<prn><enc><p1><mf><sg>+lo<prn><enc><p3><m><sg>$

                                 |____________________________________________|
                                                JOINED MORPHEMES

^take it away/take<vblex><sep><inf>+prpers<prn><obj><p3><nt><sg># away/take<vblex><sep><pres>+prpers<prn><obj><p3><nt><sg># away$

              |___|                                             |_____|
                |                                                   |
             LEMMA HEAD                                        LEMMA QUEUE

Chunks

See also: Chunks

^Verbcj<SV><vblex><ifi><p3><sg>{^come<vblex><ifi><p3><sg>$}$ ^pr<PREP>{^to<pr>$}$ ^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

   |   |______________________||__________________________|                                                          |
 CHUNK      CHUNK TAGS              LEXICAL UNITS IN                                                               LINKED
  NAME                                  THE CHUNK                                                                   TAG

   |________________________________________|
                       |
                     CHUNK



^det_nom<SN><f><sg>{^the<det><def><3>$ ^beach<n><3>$}$

                                   |______________|
                                          |
                                POINTERS TO CHUNK TAGS
        <1> <2> <3>     

See also