Difference between revisions of "Vin-ivar/proposal ud apertium"
Line 4: | Line 4: | ||
Week 1: morphological feature conversion |
Week 1: morphological feature conversion |
||
Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). |
Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one. |
||
Week 2: build on week one. Converting morph features: finish off week 1, implement the whole thing, test on multiple languages. |
|||
⚫ | |||
⚫ | |||
If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter). |
If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter). |
||
Week |
Week 4: soft constraints - 2: |
||
Continue week |
Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules). |
||
Week 4: integrate dependencies within lexical selection |
|||
Lexical selection currently uses words based on their position in a sentence, which isn't perfect. Add support for writing rules with dependencies. Example: |
|||
<rule> <!-- MPs spent almost six hours debating... --> |
|||
<match lemma="spend" tags="vblex.*"> |
|||
<select lemma="pasar" tags="vblex.*"/> |
|||
</match> |
|||
<match/> |
|||
<match/> |
|||
<or> |
|||
<match lemma="minute"/> |
|||
<match lemma="hour"/> |
|||
<match lemma="year"/> |
|||
</or> |
|||
</rule> |
|||
Becomes: |
|||
<rule> |
|||
<match lemma="spend" tags="vblex.*"> |
|||
<or> |
|||
<dep lemma="minute" deprel="dobj"/> |
|||
<dep lemma="hour" deprel="dobj"/> |
|||
<dep lemma="year" deprel="dobj"/> |
|||
</or> |
|||
<select lemma="pasar" tags="vblex.*"/> |
|||
</match> |
|||
</rule> |
|||
⚫ | |||
You should have the choice of what parser you want to use for stuff done in week 4 (and later); maybe you hate neural networks and are a MaltParser purist (or maybe you just don't have the time to train models with UDPipe). This involves writing a wrapper over popular parsers, to use whichever one you specify in Apertium pipelines. The underlying implementation should be invisible to the user; all they need to specify is "--parser udpipe" or "--parser maltparser". This is also a potential paper. |
|||
Week 6: adding "apertium features" to wrapper: |
|||
Week 5: POS tagger improvements: |
|||
Allow the relevant parser to use features generated by apertium (largely word translations) as an additional feature. For eg. MaltParser, this should seamlessly combine with configurations, like ArcEager or CovingtonProjective. |
|||
Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have. |
|||
Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6. |
|||
Week 6: stealing Apertium data |
|||
Week 7: implement transfer rules as constraints |
|||
Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add`. |
|||
⚫ | |||
Implement as part of an ensemble system. If you have xx-yy as a pair and yy is a rubbish treebank - translate yy → xx, parse xx. When transfer rules move stuff around, move the associated dependencies with them. This could be helpful with non-projective sentences that are projective when translated - since you're just reordering dependencies, you're not really parsing non-projective sentences (which is hard). |
|||
It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers. |
|||
Week 8: |
Week 8: extending wrappers: |
||
Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type. eg. MaltParser has a config format it can take out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers. |
|||
more precision for reordering stuff - for instance, you could refer to a chunk as the "object" chunk and move that around (todo: example) |
|||
⚫ | |||
Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime. |
|||
Week 10: [Chill] [ESS[ential for sanity | LLI]] |
|||
⚫ | |||
Implement decent Apertium and UD support in a decent editor that supports abugidas. |
|||
bit more chill. Write plugins to make UD annotation simpler for the better text editors (read: vim). |
|||
Week |
Week 11: TODO |
||
Same as above |
Revision as of 23:52, 2 April 2017
Work plan Work plan:
Week 1: morphological feature conversion Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.
Week 2: build on week one. Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.
Week 3: soft constraints - 1: If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).
Week 4: soft constraints - 2: Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).
Week 5: POS tagger improvements: Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have. Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.
Week 6: stealing Apertium data Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add`.
Week 7: writing wrappers: It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers.
Week 8: extending wrappers:
Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type. eg. MaltParser has a config format it can take out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers.
Week 9: [Chill] [ESSLLI?] Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.
Week 10: [Chill] [ESS[ential for sanity | LLI]] Implement decent Apertium and UD support in a decent editor that supports abugidas.
Week 11: TODO