Difference between revisions of "Vin-ivar/proposal ud apertium"

From Apertium
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
Work plan
+
== Work plan ==
'''Work plan:
 
'''
 
   
Week 1: morphological feature conversion
+
'''Week 1:''' morphological feature conversion
  +
Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable).
+
Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.
  +
  +
'''Week 2:''' build on week one.
  +
  +
Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.
  +
 
'''Week 3:''' soft constraints - 1
   
Week 2: soft constraints - 1:
 
 
If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).
 
If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).
   
Week 3: soft constraints - 2:
+
'''Week 4:''' soft constraints - 2
  +
Continue week 2. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).
+
Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).
  +
  +
'''Week 5:''' POS tagger improvements
   
  +
Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have.
Week 4: integrate dependencies within lexical selection
 
  +
Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.
Lexical selection currently uses words based on their position in a sentence, which isn't perfect. Add support for writing rules with dependencies. Example:
 
   
  +
'''Week 6:''' stealing Apertium data
<rule> <!-- MPs spent almost six hours debating... -->
 
<match lemma="spend" tags="vblex.*">
 
<select lemma="pasar" tags="vblex.*"/>
 
</match>
 
<match/>
 
<match/>
 
<or>
 
<match lemma="minute"/>
 
<match lemma="hour"/>
 
<match lemma="year"/>
 
</or>
 
</rule>
 
   
  +
Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add --col 3` (appends).
Becomes:
 
   
 
'''Week 7:''' writing wrappers:
<rule>
 
<match lemma="spend" tags="vblex.*">
 
<or>
 
<dep lemma="minute" deprel="dobj"/>
 
<dep lemma="hour" deprel="dobj"/>
 
<dep lemma="year" deprel="dobj"/>
 
</or>
 
<select lemma="pasar" tags="vblex.*"/>
 
</match>
 
</rule>
 
   
  +
(Complete stuff from week 6.)
Week 5: writing wrappers:
 
  +
It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses or evaluations. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers by hiding the underlying implementations. Parsers that are to be included in the wrapper must meet a set of necessary conditions to be included, such as support for CONLL-U.
   
  +
'''Week 8:''' extending wrappers:
You should have the choice of what parser you want to use for stuff done in week 4 (and later); maybe you hate neural networks and are a MaltParser purist (or maybe you just don't have the time to train models with UDPipe). This involves writing a wrapper over popular parsers, to use whichever one you specify in Apertium pipelines. The underlying implementation should be invisible to the user; all they need to specify is "--parser udpipe" or "--parser maltparser". This is also a potential paper.
 
   
  +
Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type, i.e. go beyond the bare functionality the wrapper provides. eg. MaltParser has a config format it can use as an argset out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers and for maintaining the wrapper in the event of my untimely demise (inshallah no).
Week 6: adding "apertium features" to wrapper:
 
   
 
'''Week 9:''' [ESSLLI]
Allow the relevant parser to use features generated by apertium (largely word translations) as an additional feature. For eg. MaltParser, this should seamlessly combine with configurations, like ArcEager or CovingtonProjective.
 
   
  +
Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.
Week 7: implement transfer rules as constraints
 
   
 
'''Week 10:''' [ESSLLI]
Implement as part of an ensemble system. If you have xx-yy as a pair and yy is a rubbish treebank - translate yy → xx, parse xx. When transfer rules move stuff around, move the associated dependencies with them. This could be helpful with non-projective sentences that are projective when translated - since you're just reordering dependencies, you're not really parsing non-projective sentences (which is hard).
 
   
  +
Contrib to annotator to import/export from HFST format. Check abugida support (if not already done)
Week 8: integrating wrappers within transfer rules:
 
   
  +
'''Week 11:''' [Experiment]
more precision for reordering stuff - for instance, you could refer to a chunk as the "object" chunk and move that around (todo: example)
 
   
  +
Use knowledge of Apertium transfer rules to assists dependency parsing. For instance - a sentence that is non-projective in language xx might be projective (and therefore easier to parse) in yy. Translate xx to yy, parse yy, and translate back to yy (with deprels), forcing two things: (a) the original sentence; cache the transfer rules used and "reverse" them and (b) the tree structure: use a reparsing algorithm to force trees to be well-formed.
   
  +
'''Week 12:''' Documentation.
Week 9: (ESSLLI?):
 
bit more chill. Write plugins to make UD annotation simpler for the better text editors (read: vim).
 
   
  +
Paper on weeks 7, 8 and 11 if possible. Tie up loose ends.
Week 10: (ESSLLI?):
 
Same as above
 

Latest revision as of 08:58, 3 April 2017

Work plan[edit]

Week 1: morphological feature conversion

Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.

Week 2: build on week one.

Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.

Week 3: soft constraints - 1

If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).

Week 4: soft constraints - 2

Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).

Week 5: POS tagger improvements

Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have. Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.

Week 6: stealing Apertium data

Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add --col 3` (appends).

Week 7: writing wrappers:

(Complete stuff from week 6.) It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses or evaluations. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers by hiding the underlying implementations. Parsers that are to be included in the wrapper must meet a set of necessary conditions to be included, such as support for CONLL-U.

Week 8: extending wrappers:

Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type, i.e. go beyond the bare functionality the wrapper provides. eg. MaltParser has a config format it can use as an argset out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers and for maintaining the wrapper in the event of my untimely demise (inshallah no).

Week 9: [ESSLLI]

Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.

Week 10: [ESSLLI]

Contrib to annotator to import/export from HFST format. Check abugida support (if not already done)

Week 11: [Experiment]

Use knowledge of Apertium transfer rules to assists dependency parsing. For instance - a sentence that is non-projective in language xx might be projective (and therefore easier to parse) in yy. Translate xx to yy, parse yy, and translate back to yy (with deprels), forcing two things: (a) the original sentence; cache the transfer rules used and "reverse" them and (b) the tree structure: use a reparsing algorithm to force trees to be well-formed.

Week 12: Documentation.

Paper on weeks 7, 8 and 11 if possible. Tie up loose ends.