Difference between revisions of "Recursive transfer"

From Apertium
Jump to navigation Jump to search
 
(23 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
   
==Deliverables==
+
==Todo==
   
  +
* <s>Make the parser output optionally original parse tree (SL syntax) and target parse tree (TL syntax).</s>
===Deliverable 1===
 
  +
* Attribute structures. These are defined in typical .t1x format with <code>def-attrs</code>
  +
* Make the parser robust &mdash; we should never get parse errors, though our trees may be mangled.
   
  +
==Process==
* A program which reads a grammar using bison, parses a sentence and outputs the syntax tree as text, or graphViz or something.
 
** See: https://svn.code.sf.net/p/apertium/svn/branches/transfer4/format-parse.py
 
   
  +
The parser has two trees, both are built simultaneously:
===Deliverable 2===
 
   
  +
* The '''source''' tree is parser-internal
* Program which takes output of lt-proc -b (biltrans) and applies a grammar, doing only reordering (and "insertion"/"deletion"), no tag changes
 
** The input would be ^sl/tl$ and the output would be ^tl$
+
* The '''target''' tree is the "abstract syntax tree".
** The grammar can be specified using a simple text-based CFG grammar formalism, converted into bison and compiled.
 
   
  +
When a sentence terminal (<code>S</code>) is reached, the target tree is traversed and printed out.
;Input:
 
<pre>
 
^Hau<prn><dem><sg>/This<prn><dem><sg>$
 
^irabazle<n>/winner<n><ND>$
 
^bat<num><sg>/a<det><ind><sg>$
 
^en<post>/of<pr>$
 
^historia<n>/story<n><ND>$
 
^a<det><art><sg>/the<det><def><sg>$
 
^izan<vbsint><pri><NR_HU>/be<vbser><pri><NR_HU>$
 
^.<sent>/.<sent>$
 
</pre>
 
 
;Output:
 
<pre>
 
^This<prn><dem><sg>$
 
^be<vbser><pri><NR_HU>$
 
^the<det><def><sg>$
 
^story<n><ND>$
 
^of<pr>$
 
^a<det><ind><sg>$
 
^winner<n><ND>$
 
^.<sent>$
 
</pre>
 
 
;Grammar
 
 
<pre>
 
S -> SN SV sent { $1 $2 $3 }
 
SV -> SN v { $2 $1 }
 
SN -> N3 art { $2 $1 } | N3 { $1 }
 
N3 -> SNGen N2 { $2 $1 } | N2 { $1 }
 
N2 -> nom { $1 } | prn { $1 }
 
SNGen -> SN genpost { $2 $1 }
 
sent -> "sent" { $1 }
 
v -> "vbser.*" { $1 } | "vblex.*" { $1 }
 
art -> "det.art.*" { $1 } | "num.sg" { $1 }
 
nom -> "n" { $1 }
 
prn -> "prn.*" { $1 }
 
</pre>
 
 
===Deliverable 3===
 
 
* An XML format for the rules, based on the current format, taking into account transfer operations
 
   
 
==Questions==
 
==Questions==
Line 62: Line 20:
 
* What to do with a parse-fail.
 
* What to do with a parse-fail.
 
** Implicit glue rules
 
** Implicit glue rules
  +
*** How do we make sure that we never get <code>Syntax error</code> (e.g. really robust glue rules).
 
** the glue rules would not compute anything, just allow for partial parses
 
** the glue rules would not compute anything, just allow for partial parses
 
* How about unknown words...
 
* How about unknown words...
Line 70: Line 29:
 
* How to apply macros in rules which have >1 non-terminal.
 
* How to apply macros in rules which have >1 non-terminal.
 
* What on earth to do with blanks / formatting...
 
* What on earth to do with blanks / formatting...
  +
* Do we try and find syntactic relations in the transfer, or do we pre-annotate (e.g. with CG) then use the tags from CG to constraint the parser?
  +
* Can/should we do unification in the grammar (e.g. to avoid rules like SN -> adj n matching when adj.G and n.G are not the same)?
  +
*: If a language uses CG, the rule SN -> @A→ @N would only match where CG mapped @A→ (and CG can do unification with less trouble, not mapping @A→ where gender differs)
  +
** However, if we are to propagate attributes up the tree as well, it makes sense to have unification as well, so we can say <code>NP[gen=X] -&gt; D[gen=X] N[gen=X]</code>
  +
* Should the transfer allow for >1 possible TL translation ? to allow 'lexical selection' inside transfer as well as outside ?
  +
* Can we learn transfer grammars from aligned treebanks ?
   
 
==Algorithms==
 
==Algorithms==
Line 77: Line 42:
 
* [http://en.wikipedia.org/wiki/GLR_parser GLR] (bottom-up)
 
* [http://en.wikipedia.org/wiki/GLR_parser GLR] (bottom-up)
 
* [http://en.wikipedia.org/wiki/Earley_parser Earley] (top-down)
 
* [http://en.wikipedia.org/wiki/Earley_parser Earley] (top-down)
  +
  +
==Usage==
  +
  +
<pre>
  +
$ svn co https://svn.code.sf.net/p/apertium/svn/branches/transfer4
  +
  +
$ cd transfer4
  +
  +
$ cd eng-kaz
  +
  +
$ make
  +
</pre>
  +
  +
;Files
  +
  +
* <code>eng-kaz.grammar</code>: Transfer grammar file for English→Kazakh
  +
* <code>eng-kaz.t1x</code>: Categories (terminals) and attributes for English→Kazakh
  +
  +
;Apply the transfer grammar
  +
  +
<pre>
  +
$ cat input/input.01.txt | ./eng-kaz.parser
  +
^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$ ^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$ ^to<pr>/$
  +
^go<vblex><past>/бар<v><iv><past>$ ^that<cnjsub>/$ ^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$
  +
^know<vblex><pres>/біл<v><tv><pres>$ ^.<sent>/.<sent>$
  +
</pre>
  +
  +
; Print out the source tree
  +
  +
<pre>
  +
$ cat input/input.01.txt | ./eng-kaz.parser -s -p >/dev/null
  +
(S (S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$)))
  +
(SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (Ssub (cnjsub (^that<cnjsub>/$))
  +
(S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$)))
  +
(SV (V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))) (SP (prep (^to<pr>/$))
  +
(SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))))))) (X (sent (^.<sent>/.<sent>$))))
  +
</pre>
  +
  +
; Print out the target tree
  +
  +
<pre>
  +
$ cat input/input.01.txt | ./eng-kaz.parser -p >/dev/null
  +
(S (Ssub (S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$)))
  +
(SV (SP (SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))) (prep (^to<pr>/$)))
  +
(V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))))) (cnjsub (^that<cnjsub>/$)))
  +
(S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$)))
  +
(SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (X (sent (^.<sent>/.<sent>$))))
  +
</pre>
   
 
==References==
 
==References==
Line 83: Line 96:
 
* White (1985) "Characteristics of the METAL machine translation system at Production Stage" (§6)
 
* White (1985) "Characteristics of the METAL machine translation system at Production Stage" (§6)
 
* Slocum (1982) "The LRC Machine translation system: An application of State-of-the-Art ..." (p.18)
 
* Slocum (1982) "The LRC Machine translation system: An application of State-of-the-Art ..." (p.18)
  +
  +
==Further reading==
  +
* [[User:Mlforcada/Robust LR for Transfer]]
  +
* MUHUA ZHU, JINGBO ZHU and HUIZHEN WANG (2013) "Improving shift-reduce constituency parsing with large-scale unlabeled data". ''Natural Language Engineering ''. October 2013, pp. 1--26
  +
* http://www.cs.cmu.edu/~./alavie/papers/thesis.pdf
  +
* http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-743.pdf
   
 
==See also==
 
==See also==
Line 89: Line 108:
   
 
==External links==
 
==External links==
  +
* [http://smlweb.cpsc.ucalgary.ca/start.html CFG tool]
 
  +
* [http://erg.delph-in.net/logon LOGON: Parse with the ERG]
 
[[Category:Development]]
 
[[Category:Development]]
 
[[Category:Transfer]]
 
[[Category:Transfer]]
  +
[[Category:Documentation in English]]

Latest revision as of 10:50, 9 February 2015

Todo[edit]

  • Make the parser output optionally original parse tree (SL syntax) and target parse tree (TL syntax).
  • Attribute structures. These are defined in typical .t1x format with def-attrs
  • Make the parser robust — we should never get parse errors, though our trees may be mangled.

Process[edit]

The parser has two trees, both are built simultaneously:

  • The source tree is parser-internal
  • The target tree is the "abstract syntax tree".

When a sentence terminal (S) is reached, the target tree is traversed and printed out.

Questions[edit]

  • What to do with a parse-fail.
    • Implicit glue rules
      • How do we make sure that we never get Syntax error (e.g. really robust glue rules).
    • the glue rules would not compute anything, just allow for partial parses
  • How about unknown words...
    • they would be some non-terminal UNK that would be glued  by the all-encompassing glue rule from above.
  • Ambiguous grammars -> can be automatically disambiguated ?
    • Learn shift/reduce using target-language information ?
  • Converting right-recursive to left-recursive grammars.
  • How to apply macros in rules which have >1 non-terminal.
  • What on earth to do with blanks / formatting...
  • Do we try and find syntactic relations in the transfer, or do we pre-annotate (e.g. with CG) then use the tags from CG to constraint the parser?
  • Can/should we do unification in the grammar (e.g. to avoid rules like SN -> adj n matching when adj.G and n.G are not the same)?
    If a language uses CG, the rule SN -> @A→ @N would only match where CG mapped @A→ (and CG can do unification with less trouble, not mapping @A→ where gender differs)
    • However, if we are to propagate attributes up the tree as well, it makes sense to have unification as well, so we can say NP[gen=X] -> D[gen=X] N[gen=X]
  • Should the transfer allow for >1 possible TL translation ? to allow 'lexical selection' inside transfer as well as outside ?
  • Can we learn transfer grammars from aligned treebanks ?

Algorithms[edit]

Usage[edit]

$ svn co https://svn.code.sf.net/p/apertium/svn/branches/transfer4

$ cd transfer4

$ cd eng-kaz

$ make
Files
  • eng-kaz.grammar: Transfer grammar file for English→Kazakh
  • eng-kaz.t1x: Categories (terminals) and attributes for English→Kazakh
Apply the transfer grammar
$ cat input/input.01.txt | ./eng-kaz.parser 
^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$ ^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$ ^to<pr>/$ 
^go<vblex><past>/бар<v><iv><past>$ ^that<cnjsub>/$ ^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$ 
^know<vblex><pres>/біл<v><tv><pres>$ ^.<sent>/.<sent>$ 
Print out the source tree
$ cat input/input.01.txt | ./eng-kaz.parser -s -p >/dev/null
(S (S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$))) 
(SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (Ssub (cnjsub (^that<cnjsub>/$)) 
(S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$))) 
(SV (V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))) (SP (prep (^to<pr>/$)) 
(SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))))))) (X (sent (^.<sent>/.<sent>$))))
Print out the target tree
$ cat input/input.01.txt | ./eng-kaz.parser -p >/dev/null
(S (Ssub (S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$))) 
(SV (SP (SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))) (prep (^to<pr>/$))) 
(V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))))) (cnjsub (^that<cnjsub>/$))) 
(S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$))) 
(SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (X (sent (^.<sent>/.<sent>$))))

References[edit]

  • Prószéky & Tihanyi (2002) "MetaMorpho: A Pattern-Based Machine Translation System"
  • White (1985) "Characteristics of the METAL machine translation system at Production Stage" (§6)
  • Slocum (1982) "The LRC Machine translation system: An application of State-of-the-Art ..." (p.18)

Further reading[edit]

See also[edit]

External links[edit]