Difference between revisions of "Recursive transfer"
Jump to navigation
Jump to search
(21 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
== |
==Todo== |
||
* <s>Make the parser output optionally original parse tree (SL syntax) and target parse tree (TL syntax).</s> |
|||
===Deliverable 1=== |
|||
* Attribute structures. These are defined in typical .t1x format with <code>def-attrs</code> |
|||
* Make the parser robust — we should never get parse errors, though our trees may be mangled. |
|||
==Process== |
|||
* A program which reads a grammar using bison, parses a sentence and outputs the syntax tree as text, or graphViz or something. |
|||
** See: https://svn.code.sf.net/p/apertium/svn/branches/transfer4/format-parse.py |
|||
The parser has two trees, both are built simultaneously: |
|||
===Deliverable 2=== |
|||
* The '''source''' tree is parser-internal |
|||
* Program which takes output of lt-proc -b (biltrans) and applies a grammar, doing only reordering (and "insertion"/"deletion"), no tag changes |
|||
* The '''target''' tree is the "abstract syntax tree". |
|||
** The grammar can be specified using a simple text-based CFG grammar formalism, converted into bison and compiled. |
|||
When a sentence terminal (<code>S</code>) is reached, the target tree is traversed and printed out. |
|||
;Input: |
|||
<pre> |
|||
^Hau<prn><dem><sg>/This<prn><dem><sg>$ |
|||
^irabazle<n>/winner<n><ND>$ |
|||
^bat<num><sg>/a<det><ind><sg>$ |
|||
^en<post>/of<pr>$ |
|||
^historia<n>/story<n><ND>$ |
|||
^a<det><art><sg>/the<det><def><sg>$ |
|||
^izan<vbsint><pri><NR_HU>/be<vbser><pri><NR_HU>$ |
|||
^.<sent>/.<sent>$ |
|||
</pre> |
|||
;Output: |
|||
<pre> |
|||
^This<prn><dem><sg>$ |
|||
^be<vbser><pri><NR_HU>$ |
|||
^the<det><def><sg>$ |
|||
^story<n><ND>$ |
|||
^of<pr>$ |
|||
^a<det><ind><sg>$ |
|||
^winner<n><ND>$ |
|||
^.<sent>$ |
|||
</pre> |
|||
;Grammar |
|||
<pre> |
|||
S -> SN SV sent { $1 $2 $3 } |
|||
SV -> SN v { $2 $1 } |
|||
SN -> N3 art { $2 $1 } | N3 { $1 } |
|||
N3 -> SNGen N2 { $2 $1 } | N2 { $1 } |
|||
N2 -> nom { $1 } | prn { $1 } |
|||
SNGen -> SN genpost { $2 $1 } |
|||
sent -> "sent" { $1 } |
|||
v -> "vbser.*" { $1 } | "vblex.*" { $1 } |
|||
art -> "det.art.*" { $1 } | "num.sg" { $1 } |
|||
nom -> "n" { $1 } |
|||
prn -> "prn.*" { $1 } |
|||
</pre> |
|||
===Deliverable 3=== |
|||
* An XML format for the rules, based on the current format, taking into account transfer operations |
|||
==Questions== |
==Questions== |
||
Line 62: | Line 20: | ||
* What to do with a parse-fail. |
* What to do with a parse-fail. |
||
** Implicit glue rules |
** Implicit glue rules |
||
*** How do we make sure that we never get <code>Syntax error</code> (e.g. really robust glue rules). |
|||
** the glue rules would not compute anything, just allow for partial parses |
** the glue rules would not compute anything, just allow for partial parses |
||
* How about unknown words... |
* How about unknown words... |
||
Line 72: | Line 31: | ||
* Do we try and find syntactic relations in the transfer, or do we pre-annotate (e.g. with CG) then use the tags from CG to constraint the parser? |
* Do we try and find syntactic relations in the transfer, or do we pre-annotate (e.g. with CG) then use the tags from CG to constraint the parser? |
||
* Can/should we do unification in the grammar (e.g. to avoid rules like SN -> adj n matching when adj.G and n.G are not the same)? |
* Can/should we do unification in the grammar (e.g. to avoid rules like SN -> adj n matching when adj.G and n.G are not the same)? |
||
*: If a language uses CG, the rule SN -> @A→ @N would only match where CG mapped @A→ (and CG can do unification with less trouble, not mapping @A→ where gender differs) |
|||
** However, if we are to propagate attributes up the tree as well, it makes sense to have unification as well, so we can say <code>NP[gen=X] -> D[gen=X] N[gen=X]</code> |
|||
* Should the transfer allow for >1 possible TL translation ? to allow 'lexical selection' inside transfer as well as outside ? |
|||
* Can we learn transfer grammars from aligned treebanks ? |
|||
==Algorithms== |
==Algorithms== |
||
Line 79: | Line 42: | ||
* [http://en.wikipedia.org/wiki/GLR_parser GLR] (bottom-up) |
* [http://en.wikipedia.org/wiki/GLR_parser GLR] (bottom-up) |
||
* [http://en.wikipedia.org/wiki/Earley_parser Earley] (top-down) |
* [http://en.wikipedia.org/wiki/Earley_parser Earley] (top-down) |
||
==Usage== |
|||
<pre> |
|||
$ svn co https://svn.code.sf.net/p/apertium/svn/branches/transfer4 |
|||
$ cd transfer4 |
|||
$ cd eng-kaz |
|||
$ make |
|||
</pre> |
|||
;Files |
|||
* <code>eng-kaz.grammar</code>: Transfer grammar file for English→Kazakh |
|||
* <code>eng-kaz.t1x</code>: Categories (terminals) and attributes for English→Kazakh |
|||
;Apply the transfer grammar |
|||
<pre> |
|||
$ cat input/input.01.txt | ./eng-kaz.parser |
|||
^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$ ^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$ ^to<pr>/$ |
|||
^go<vblex><past>/бар<v><iv><past>$ ^that<cnjsub>/$ ^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$ |
|||
^know<vblex><pres>/біл<v><tv><pres>$ ^.<sent>/.<sent>$ |
|||
</pre> |
|||
; Print out the source tree |
|||
<pre> |
|||
$ cat input/input.01.txt | ./eng-kaz.parser -s -p >/dev/null |
|||
(S (S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$))) |
|||
(SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (Ssub (cnjsub (^that<cnjsub>/$)) |
|||
(S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$))) |
|||
(SV (V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))) (SP (prep (^to<pr>/$)) |
|||
(SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))))))) (X (sent (^.<sent>/.<sent>$)))) |
|||
</pre> |
|||
; Print out the target tree |
|||
<pre> |
|||
$ cat input/input.01.txt | ./eng-kaz.parser -p >/dev/null |
|||
(S (Ssub (S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$))) |
|||
(SV (SP (SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))) (prep (^to<pr>/$))) |
|||
(V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))))) (cnjsub (^that<cnjsub>/$))) |
|||
(S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$))) |
|||
(SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (X (sent (^.<sent>/.<sent>$)))) |
|||
</pre> |
|||
==References== |
==References== |
||
Line 87: | Line 98: | ||
==Further reading== |
==Further reading== |
||
* [[User:Mlforcada/Robust LR for Transfer]] |
|||
* MUHUA ZHU, JINGBO ZHU and HUIZHEN WANG (2013) "Improving shift-reduce constituency parsing with large-scale unlabeled data". ''Natural Language Engineering ''. October 2013, pp. 1--26 |
* MUHUA ZHU, JINGBO ZHU and HUIZHEN WANG (2013) "Improving shift-reduce constituency parsing with large-scale unlabeled data". ''Natural Language Engineering ''. October 2013, pp. 1--26 |
||
* http://www.cs.cmu.edu/~./alavie/papers/thesis.pdf |
|||
* http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-743.pdf |
|||
==See also== |
==See also== |
||
Line 95: | Line 108: | ||
==External links== |
==External links== |
||
* [http://smlweb.cpsc.ucalgary.ca/start.html CFG tool] |
|||
* [http://erg.delph-in.net/logon LOGON: Parse with the ERG] |
|||
[[Category:Development]] |
[[Category:Development]] |
||
[[Category:Transfer]] |
[[Category:Transfer]] |
||
[[Category:Documentation in English]] |
Latest revision as of 10:50, 9 February 2015
Todo[edit]
Make the parser output optionally original parse tree (SL syntax) and target parse tree (TL syntax).- Attribute structures. These are defined in typical .t1x format with
def-attrs
- Make the parser robust — we should never get parse errors, though our trees may be mangled.
Process[edit]
The parser has two trees, both are built simultaneously:
- The source tree is parser-internal
- The target tree is the "abstract syntax tree".
When a sentence terminal (S
) is reached, the target tree is traversed and printed out.
Questions[edit]
- What to do with a parse-fail.
- Implicit glue rules
- How do we make sure that we never get
Syntax error
(e.g. really robust glue rules).
- How do we make sure that we never get
- the glue rules would not compute anything, just allow for partial parses
- Implicit glue rules
- How about unknown words...
- they would be some non-terminal UNK that would be glued by the all-encompassing glue rule from above.
- Ambiguous grammars -> can be automatically disambiguated ?
- Learn shift/reduce using target-language information ?
- Converting right-recursive to left-recursive grammars.
- How to apply macros in rules which have >1 non-terminal.
- What on earth to do with blanks / formatting...
- Do we try and find syntactic relations in the transfer, or do we pre-annotate (e.g. with CG) then use the tags from CG to constraint the parser?
- Can/should we do unification in the grammar (e.g. to avoid rules like SN -> adj n matching when adj.G and n.G are not the same)?
- If a language uses CG, the rule SN -> @A→ @N would only match where CG mapped @A→ (and CG can do unification with less trouble, not mapping @A→ where gender differs)
- However, if we are to propagate attributes up the tree as well, it makes sense to have unification as well, so we can say
NP[gen=X] -> D[gen=X] N[gen=X]
- Should the transfer allow for >1 possible TL translation ? to allow 'lexical selection' inside transfer as well as outside ?
- Can we learn transfer grammars from aligned treebanks ?
Algorithms[edit]
Usage[edit]
$ svn co https://svn.code.sf.net/p/apertium/svn/branches/transfer4 $ cd transfer4 $ cd eng-kaz $ make
- Files
eng-kaz.grammar
: Transfer grammar file for English→Kazakheng-kaz.t1x
: Categories (terminals) and attributes for English→Kazakh
- Apply the transfer grammar
$ cat input/input.01.txt | ./eng-kaz.parser ^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$ ^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$ ^to<pr>/$ ^go<vblex><past>/бар<v><iv><past>$ ^that<cnjsub>/$ ^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$ ^know<vblex><pres>/біл<v><tv><pres>$ ^.<sent>/.<sent>$
- Print out the source tree
$ cat input/input.01.txt | ./eng-kaz.parser -s -p >/dev/null (S (S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$))) (SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (Ssub (cnjsub (^that<cnjsub>/$)) (S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$))) (SV (V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))) (SP (prep (^to<pr>/$)) (SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))))))) (X (sent (^.<sent>/.<sent>$))))
- Print out the target tree
$ cat input/input.01.txt | ./eng-kaz.parser -p >/dev/null (S (Ssub (S1 (PRNS (subj_pron (^you<prn><subj><p2><mf><sp>/сен<prn><pers><subj><p2><mf><sp>$))) (SV (SP (SN1 (SN (N (nom (^Kazakhstan<np><top><sg>/Қазақстан<np><top><nom>$))))) (prep (^to<pr>/$))) (V (pers_verb (^go<vblex><past>/бар<v><iv><past>$))))) (cnjsub (^that<cnjsub>/$))) (S1 (PRNS (subj_pron (^I<prn><subj><p1><mf><sg>/Мен<prn><pers><subj><p1><mf><sg>$))) (SV (V (pers_verb (^know<vblex><pres>/біл<v><tv><pres>$))))) (X (sent (^.<sent>/.<sent>$))))
References[edit]
- Prószéky & Tihanyi (2002) "MetaMorpho: A Pattern-Based Machine Translation System"
- White (1985) "Characteristics of the METAL machine translation system at Production Stage" (§6)
- Slocum (1982) "The LRC Machine translation system: An application of State-of-the-Art ..." (p.18)
Further reading[edit]
- User:Mlforcada/Robust LR for Transfer
- MUHUA ZHU, JINGBO ZHU and HUIZHEN WANG (2013) "Improving shift-reduce constituency parsing with large-scale unlabeled data". Natural Language Engineering . October 2013, pp. 1--26
- http://www.cs.cmu.edu/~./alavie/papers/thesis.pdf
- http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-743.pdf