Vin-ivar/proposal ud apertium
Work plan[edit]
Week 1: morphological feature conversion
Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.
Week 2: build on week one.
Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.
Week 3: soft constraints - 1
If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).
Week 4: soft constraints - 2
Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).
Week 5: POS tagger improvements
Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have. Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.
Week 6: stealing Apertium data
Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add --col 3` (appends).
Week 7: writing wrappers:
(Complete stuff from week 6.) It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses or evaluations. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers by hiding the underlying implementations. Parsers that are to be included in the wrapper must meet a set of necessary conditions to be included, such as support for CONLL-U.
Week 8: extending wrappers:
Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type, i.e. go beyond the bare functionality the wrapper provides. eg. MaltParser has a config format it can use as an argset out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers and for maintaining the wrapper in the event of my untimely demise (inshallah no).
Week 9: [ESSLLI]
Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.
Week 10: [ESSLLI]
Contrib to annotator to import/export from HFST format. Check abugida support (if not already done)
Week 11: [Experiment]
Use knowledge of Apertium transfer rules to assists dependency parsing. For instance - a sentence that is non-projective in language xx might be projective (and therefore easier to parse) in yy. Translate xx to yy, parse yy, and translate back to yy (with deprels), forcing two things: (a) the original sentence; cache the transfer rules used and "reverse" them and (b) the tree structure: use a reparsing algorithm to force trees to be well-formed.
Week 12: Documentation.
Paper on weeks 7, 8 and 11 if possible. Tie up loose ends.