Difference between revisions of "Vin-ivar/proposal ud apertium"
(Created page with "Work plan ========= Week 1: morphological feature conversion Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoy...") |
(→=) |
||
Line 1: | Line 1: | ||
Work plan |
Work plan |
||
'''Work plan: |
|||
========= |
|||
''' |
|||
Week 1: morphological feature conversion |
Week 1: morphological feature conversion |
||
Line 14: | Line 15: | ||
Lexical selection currently uses words based on their position in a sentence, which isn't perfect. Add support for writing rules with dependencies. Example: |
Lexical selection currently uses words based on their position in a sentence, which isn't perfect. Add support for writing rules with dependencies. Example: |
||
<rule> <!-- MPs spent almost six hours debating... --> |
|||
<match lemma="spend" tags="vblex.*"> |
|||
<select lemma="pasar" tags="vblex.*"/> |
|||
</match> |
|||
<match/> |
|||
<match/> |
|||
<or> |
|||
<match lemma="minute"/> |
|||
<match lemma="hour"/> |
|||
<match lemma="year"/> |
|||
</or> |
|||
</rule> |
|||
Becomes: |
Becomes: |
Revision as of 21:54, 2 April 2017
Work plan Work plan:
Week 1: morphological feature conversion Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable).
Week 2: soft constraints - 1: If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).
Week 3: soft constraints - 2: Continue week 2. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).
Week 4: integrate dependencies within lexical selection Lexical selection currently uses words based on their position in a sentence, which isn't perfect. Add support for writing rules with dependencies. Example:
<rule> <match lemma="spend" tags="vblex.*"> <select lemma="pasar" tags="vblex.*"/> </match> <match/> <match/> <or> <match lemma="minute"/> <match lemma="hour"/> <match lemma="year"/> </or> </rule>
Becomes:
<rule> <match lemma="spend" tags="vblex.*"> <or> <dep lemma="minute" deprel="dobj"/> <dep lemma="hour" deprel="dobj"/> <dep lemma="year" deprel="dobj"/> </or> <select lemma="pasar" tags="vblex.*"/> </match> </rule>
Week 5: writing wrappers:
You should have the choice of what parser you want to use for stuff done in week 4 (and later); maybe you hate neural networks and are a MaltParser purist (or maybe you just don't have the time to train models with UDPipe). This involves writing a wrapper over popular parsers, to use whichever one you specify in Apertium pipelines. The underlying implementation should be invisible to the user; all they need to specify is "--parser udpipe" or "--parser maltparser". This is also a potential paper.
Week 6: adding "apertium features" to wrapper:
Allow the relevant parser to use features generated by apertium (largely word translations) as an additional feature. For eg. MaltParser, this should seamlessly combine with configurations, like ArcEager or CovingtonProjective.
Week 7: integrating wrappers within transfer rules:
more precision for reordering stuff - for instance, you could refer to a chunk as the "object" chunk and move that around (todo: example)
Week 8: more of week 7
Week 9: (ESSLLI?): bit more chill. Write plugins to make UD annotation simpler for the better text editors (read: vim).
Week 10: (ESSLLI?): TODO