Difference between revisions of "Vin-ivar/proposal ud apertium"

From Apertium
Jump to navigation Jump to search
Line 4: Line 4:


Week 1: morphological feature conversion
Week 1: morphological feature conversion
Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable).
Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.


Week 2: build on week one. Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.
Week 2: soft constraints - 1:

Week 3: soft constraints - 1:
If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).
If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).


Week 3: soft constraints - 2:
Week 4: soft constraints - 2:
Continue week 2. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).
Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).

Week 4: integrate dependencies within lexical selection
Lexical selection currently uses words based on their position in a sentence, which isn't perfect. Add support for writing rules with dependencies. Example:

<rule> <!-- MPs spent almost six hours debating... -->
<match lemma="spend" tags="vblex.*">
<select lemma="pasar" tags="vblex.*"/>
</match>
<match/>
<match/>
<or>
<match lemma="minute"/>
<match lemma="hour"/>
<match lemma="year"/>
</or>
</rule>

Becomes:

<rule>
<match lemma="spend" tags="vblex.*">
<or>
<dep lemma="minute" deprel="dobj"/>
<dep lemma="hour" deprel="dobj"/>
<dep lemma="year" deprel="dobj"/>
</or>
<select lemma="pasar" tags="vblex.*"/>
</match>
</rule>

Week 5: writing wrappers:

You should have the choice of what parser you want to use for stuff done in week 4 (and later); maybe you hate neural networks and are a MaltParser purist (or maybe you just don't have the time to train models with UDPipe). This involves writing a wrapper over popular parsers, to use whichever one you specify in Apertium pipelines. The underlying implementation should be invisible to the user; all they need to specify is "--parser udpipe" or "--parser maltparser". This is also a potential paper.

Week 6: adding "apertium features" to wrapper:


Week 5: POS tagger improvements:
Allow the relevant parser to use features generated by apertium (largely word translations) as an additional feature. For eg. MaltParser, this should seamlessly combine with configurations, like ArcEager or CovingtonProjective.
Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have.
Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.


Week 6: stealing Apertium data
Week 7: implement transfer rules as constraints
Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add`.


Week 7: writing wrappers:
Implement as part of an ensemble system. If you have xx-yy as a pair and yy is a rubbish treebank - translate yy → xx, parse xx. When transfer rules move stuff around, move the associated dependencies with them. This could be helpful with non-projective sentences that are projective when translated - since you're just reordering dependencies, you're not really parsing non-projective sentences (which is hard).
It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers.


Week 8: integrating wrappers within transfer rules:
Week 8: extending wrappers:


Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type. eg. MaltParser has a config format it can take out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers.
more precision for reordering stuff - for instance, you could refer to a chunk as the "object" chunk and move that around (todo: example)


Week 9: [Chill] [ESSLLI?]
Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.


Week 10: [Chill] [ESS[ential for sanity | LLI]]
Week 9: (ESSLLI?):
Implement decent Apertium and UD support in a decent editor that supports abugidas.
bit more chill. Write plugins to make UD annotation simpler for the better text editors (read: vim).


Week 10: (ESSLLI?):
Week 11: TODO
Same as above

Revision as of 23:52, 2 April 2017

Work plan Work plan:

Week 1: morphological feature conversion Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.

Week 2: build on week one. Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.

Week 3: soft constraints - 1: If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).

Week 4: soft constraints - 2: Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).

Week 5: POS tagger improvements: Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have. Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.

Week 6: stealing Apertium data Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add`.

Week 7: writing wrappers: It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers.

Week 8: extending wrappers:

Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type. eg. MaltParser has a config format it can take out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers.

Week 9: [Chill] [ESSLLI?] Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.

Week 10: [Chill] [ESS[ential for sanity | LLI]] Implement decent Apertium and UD support in a decent editor that supports abugidas.

Week 11: TODO