User talk:Irene/proposal

From Apertium
Revision as of 04:07, 30 May 2017 by Irene (talk | contribs) (→‎updated proposal: new section)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


  1. Re-examine every multiword for whether or not it can be discontinuous, e.g. call (something) off, cheer (someone) up
    • parsing for multiwords can be done with a language-independent search for the words marked with , but I think determining whether or not a specific word can be discontinuous has to be done by hand.
    • different than what i originally proposed..
  2. If a word is separable, then tag it as so (introduce a new tag symbol).
    • the tag should contain information about which categories of words (np, vp) can split them. this will be useful when it comes to chunking, and achieving this alleviates the need for some of the hacks that we're currently using
    • maybe this calls for creating a section in the paradigm definitions, since many follow the same pattern: call (something) off, cheer (someone) up, take (it) out are all verb-np-preposition

stage 2: CHUNKING

  1. if the appropriate "chunk" is sandwiched between the separable word, then reorder the sentence accordingly
    • inter-chunk stage
    • maybe this could be done with a grep
    • check for false positives: take the thing out of the box does not use take out, as in take out the trash


Talk:Ideas for Google Summer of Code/Discontiguous multiwords

updated proposal[edit]

  1. tagging
    • tag every discontinuous word for what can split it (e.g. "take out" -> can be split by a np/sn)
    • where to insert the tagging?
  2. transfer stage
    • sequences of discontinuous stuff
  3. pseudo:
    • if one of the sequences is encountered, then look into tags to see if it could be an instance of a discontinuous word
    • if it is, then check whether or not it is a real discontinuous word
    • if it's real, then do re-ordering. if not, then do nothing.