Difference between revisions of "Vin-ivar/proposal ud apertium"

Revision as of 08:01, 3 April 2017

Work plan

Week 1: morphological feature conversion

Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.

Week 2: build on week one.

Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.

Week 3: soft constraints - 1

If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).

Week 4: soft constraints - 2

Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).

Week 5: POS tagger improvements

Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have. Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.

Week 6: stealing Apertium data

Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add --col 3` (appends).

Week 7: writing wrappers:

It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses or evaluations. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers.

Week 8: extending wrappers:

Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type. eg. MaltParser has a config format it can take out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers.

Week 9: [ESSLLI]

Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.

Week 10:[ESSLLI]

Contrib to annotator to import/export from HFST format. Check abugida support (if not already done)

Week 11: [Experiment]

Use knowledge of Apertium transfer rules to assists dependency parsing. For instance - a sentence that is non-projective in language xx might be projective (and therefore easier to parse) in yy. Translate xx to yy, parse yy, and translate back to yy (with deprels), forcing two things: (a) the original sentence; cache the transfer rules used and "reverse" them and (b) the tree structure: use a reparsing algorithm to force trees to be well-formed.

Week 12:

Documentation. Paper on weeks 7, 8 and 11 if possible.

@@ Line 1: / Line 1: @@
-Work plan
+== Work plan ==
-'''Work plan:
+'''Week 1:''' morphological feature conversion
-'''
-Week 1: morphological feature conversion
 Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.
-Week 2: build on week one. Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.
+'''Week 2:''' build on week one.
+Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.
+'''Week 3:''' soft constraints - 1
-Week 3: soft constraints - 1:
 If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).
-Week 4: soft constraints - 2:
+'''Week 4:''' soft constraints - 2
 Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).
-Week 5: POS tagger improvements:
+'''Week 5:''' POS tagger improvements
 Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have.
 Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 6.
-Week 6: stealing Apertium data
+'''Week 6:''' stealing Apertium data
-Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add`.
+Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add --col 3` (appends).
+'''Week 7:''' writing wrappers:
-Week 7: writing wrappers:
 It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses or evaluations. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers.
-Week 8: extending wrappers:
+'''Week 8:''' extending wrappers:
 Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type. eg. MaltParser has a config format it can take out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers.
-Week 9: [Chill] [ESSLLI?]
+'''Week 9:''' [ESSLLI]
 Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.
+'''Week 10:'''[ESSLLI]
-Week 10: [Chill] [ESS[ential for sanity | LLI]]
 Contrib to annotator to import/export from HFST format. Check abugida support (if not already done)
-Week 11: [Experiment]
+'''Week 11:''' [Experiment]
 Use knowledge of Apertium transfer rules to assists dependency parsing. For instance - a sentence that is non-projective in language xx might be projective (and therefore easier to parse) in yy. Translate xx to yy, parse yy, and translate back to yy (with deprels), forcing two things: (a) the original sentence; cache the transfer rules used and "reverse" them and (b) the tree structure: use a reparsing algorithm to force trees to be well-formed.
+'''Week 12:'''
-Week 12: Documentation. Paper on weeks 7, 8 and 11 if possible.
+Documentation. Paper on weeks 7, 8 and 11 if possible.

Difference between revisions of "Vin-ivar/proposal ud apertium"

Revision as of 08:01, 3 April 2017

Work plan

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools