User:Mlforcada/Robust LR for Transfer

From Apertium
Jump to navigation Jump to search

We need a way to implement Apertium4 Bison grammars that are robust.

Here, Bison GRAMMAR is a LALR(1) grammar with actions between braces.

There should be a way (or a procedure) to complete handwritten rules, if possible automatically, to generate a robust parser (and translator). The idea is to take the handwritten Bison grammar and complement it with automatically-generated glue rules in such a way that conflicts are not produced (or are harmless) to produce a new grammar .

One possible way to do so is to model left-to-right restarts as follows:

  • by accepting the longest possible constituent (in the original grammar that cannot be merged with the remaining output and generating some kind of translation for it
  • and treating the remaining output as a complete sentence again

For instance, if the grammar is:

S : NP VP  { write(agree(nom($1),$2)) };
NP : n   { write($1) } ;
VP : v  { write($1) } 
   | v NP  { write(acc($2),$1) };

the sequence could not be parsed and would lead to an error. However, it could be seen as a VP followed by a NP, and a translation could be generated for both separately. One could augment the grammar by systematically adding some rules:

S : NP VP  { write(agree(nom($1),$2)) }
  | NP TryS { write($1); write($2) }      # translate an NP and skip
  | VP TryS { write($1); write($2) }      # translate a VP and skip
  ;
TryS : S  { write($1); }
     | ;                                  # the remaining part may be empty
NP : n   { write($1) } ;
VP : v  { write($1) } 
   | v NP  { write(acc($2),$1) };

However, this may generate conflicts, and there should be a way to avoid them while keeping generality. In fact, many conflicts are caused by ambiguity in the grammar (that is, the sentence is already covered).

One can generate TryS rules for all left-hand-sides incrementally, but maybe order is important, so some ordering of constituents may be useful. Unary productions should be avoided. There is no easy way to avoid that.

Another strategy would just add rules of the form

S : terminal S 

for all terminals, which would (probably) skip one token and try again, but we have to make sure that this does not shadow other rules (although this would be detected by conflicts in the grammar, I assume). This could work a bit like a specialized error token. Not all (terminal S) rules could be added, but the ones added would add a lot of coverage, as they would in principle accept any sequence of words through S → terminal S → terminal terminal S ... etc. However, the problem of conflicts remains. Perhaps just adding rules for those terminals not producing a conflict would work.

Then there is the problem of words which are not recognized by the lexer. We need some kind of default terminal for them, with a default action.