User:Vin-ivar/proposal

From Apertium
Jump to navigation Jump to search

Name: Vinit Ravishankar

Email: vinit.ravishankar@gmail.com

IRC: vin-ivar

Why is it that you are interested in machine translation?

I’m natively trilingual. Languages are fun, computers are fun - the combination, multilingual stuff, is more fun than the sum of its parts. Machine translation is especially fun because it also involves dealing with interesting humans quite a bit.

Why is it that you are interested in Apertium? Well-resourced languages can be boring, and not have much room for contribution. Rule-based MT is more linguisticky and falls in a more “happy” zone between CS and linguistics than statistical methods do. RBMT is also more useful to me from a pragmatic perspective; one of my native languages (Marathi) is utter rubbish with statistical methods.

Which of the published tasks are you interested in? What do you plan to do? Interfaces between Apertium and Universal Dependencies (UD). I plan to create a suite of tools, scripts and modifications to UD parsers to allow Apertium to use UD data, and UD to use Apertium information more easily; this can be built on for research on improving parsing (for UD), or used to improve translation (for Apertium).

Why should Google and Apertium sponsor it? Google ought to sponsor it because dependencies are really useful, and the next “big thing”. Google Translate also has a bit of a focus on better-resourced languages; Apertium and UD are both a lot more accepting of moderately resourced and underresourced languages than Google’s systems are. The Universal Dependencies project itself is a fantastic collaboration and standards-setting community that makes it so much easier for linguists/nerds to hack on treebanks for underresourced languages. Apertium has always had a focus on underresourced languages. All of this is open source. Throwing money at this means Google doesn’t need to throw money at Google engineers.

Apertium ought to sponsor because it can be awfully useful for Apertium to use resources like UD all over the place. UD is a potential path to a solution for reordering long-distance constituents. UD parse trees have a tonne of information that can be used for different stuff, like lexical selection. Further, integrating Apertium within UD and publishing research on aiding parsing with Apertium generates a) research and b) exposure for Apertium.

Work plan

Bonding period: get used to UDPipe’s code.

Week 1: morphological feature conversion

Whilst mapping Apertium POS tags to UD's UPOSTAGs is fairly simple, converting morph features is a lot more annoying (and not completely doable). Use feature embeddings? HMMs? Evaluate approaches and pick one.

Week 2: build on week one.

Converting morph features: finish off week 1, implement the whole thing, test on multiple languages.

Week 3: allow stealing Apertium data

Based on the paper Fran and I did on parsing underresourced treebanks using Apertium to compensate - modify UDPipe (and MaltParser w/ MaltOptimizer maybe) to read ATT files, and add information to the MISC column/replace existing columns, for use as features where necessary. eg: `udpipe --train --transducer apertium-es.att --method replace --col 2`, or `udpipe --train --transducer apertium-es.att --method add --col 3` (appends).

Week 4: soft constraints - 1

If UDPipe has a lemmatisation or a POS tag with a probability less than a threshold value, use Apertium's solution instead. Annoyances: hacking UDPipe to figure out the softmax bit. Add this as a mode to a UDPipe fork, eg. `udpipe --tag --threshold 0.8 --tagger ../apertium-swe/`. Also allow UDPipe to integrate other popular tokenisers (eg. the Stanford word segmenter).

---

Deliverable #1 - scripts to quickly convert a massive HFST format block to CONLL-U, along with converting morph features/POS tags. UDPipe branch that can read ATT and use it, and a soft constraint modification that allows overriding UDPipe if Apertium is better.

---

Week 5: writing wrappers:

(Complete stuff from week 6.) It can be annoying trying to evaluate a hypothesis with multiple parsers because you have to get used to all their idiosyncrasies. For instance, to evaluate MaltParser, you either need to use MaltEval, or create a parsed treebank and evaluate it yourself. UDPipe is nicer and does it automatically. This involves writing a "wrapper" that allows you to be blind to the underlying implementation; you supply it whatever you need to supply it, give it arguments, and get parses or evaluations. Important: this does _not_ do anything except provide a unified "language" for _calling_ parsers, and evaluations. No format conversions, no fixes that belong upstream: all it does is not make you have to learn a dozen different command formats for a dozen different parsers by hiding the underlying implementations. Parsers that are to be included in the wrapper must meet a set of necessary conditions to be included, such as support for CONLL-U.

Week 6: extending wrappers:

Allow the wrapper to use config files that allow you to exploit "unique" features of each parser type, i.e. go beyond the bare functionality the wrapper provides. eg. MaltParser has a config format it can use as an argset out of the box. Make something similar for all parsers covered. Document. Document API for adding new parsers and for maintaining the wrapper.

Week 7: [ESSLLI]

Write plugins for the best editors (¬emacs) to handle UD stuff nicely. Fork the vim highlight plugin to highlight the line which your deprel points to. Similar stuff for Sublime.

Week 8: [ESSLLI] soft constraints - 2

Continue week 3. GF soft constraints: if certain deprels are unlikely, use GF to parse the chunk. Use the relations GF returns (all of which are hard-coded for different rules).

---

Deliverable #2 - Complete wrapper module for at least two parsers. Set of plugins and scripts to interface with GF.

---

Week 9: POS tagger improvements

Suite that lets you choose a POS tagger for apertium and a UD treebank. Improve the tagger based on all the stuff you have. Bonus: use closely related language treebanks in UDPipe; transfer the lemmas, assume the POS tags remain the same (reasonable assumption). Use ATT from week 3.

Week 10: annotator improvements

Contrib to annotator to import/export from HFST format. Check abugida support (if not already done)

Week 11: [Experiment]

Use knowledge of Apertium transfer rules to assists dependency parsing. For instance - a sentence that is non-projective in language xx might be projective (and therefore easier to parse) in yy. Translate xx to yy, parse yy, and translate back to yy (with deprels), forcing two things: (a) the original sentence; cache the transfer rules used and "reverse" them and (b) the tree structure: use a reparsing algorithm [1, 2] to force trees to be well-formed.

Week 12: Documentation.

Paper on weeks 5, 6 and 11 if possible. Tie up loose ends.

---

Final deliverable - POS tagger improvement scripts. Patch fixes. Documentation for everything designed so far. Results of experiment, if successful. NB: all of this is subject to change if future events result in me adding/scrapping things.

---

List your skills and give evidence of your qualifications:

I’ve been working in language technology for around two years now. I’ve been involved with Apertium and UD for a while, where I’ve worked on several things: a morphological analyser, (WIP) UD treebank, (WIP) dependency parser, standards for Indic languages, transliterator. I mentored a student working on apertium-lint last year. I currently study computational linguistics at the University of Malta.

List any non-Summer-of-Code plans you have for the summer:

I might head to ESSLLI at the end of July if my university fund it. If not, no other major plans as of now.