Difference between revisions of "User:Francis Tyers/Apertium 4"
Jump to navigation
Jump to search
(21 intermediate revisions by 3 users not shown) | |||
Line 4: | Line 4: | ||
== Software engineering == |
== Software engineering == |
||
* Make the different parts of the engine code more coherent in terms of modules |
* Make the different parts of the engine code more coherent in terms of repository (GitHub) modules |
||
* Better test coverage and testing of existing (particularly newer) modules |
|||
* Good continuous integration, both of ML-based modules and rule-based ones. |
|||
== Linguistic data == |
== Linguistic data == |
||
Line 10: | Line 12: | ||
* Extract multiwords from lexicons into "separable" FSTs |
* Extract multiwords from lexicons into "separable" FSTs |
||
* Train taggers for all languages using available corpora and a TLM |
* Train taggers for all languages using available corpora and a TLM |
||
* At least one state-of-the-art language pair (wrt. Google). |
* At least one state-of-the-art language pair (wrt. Google) using all available modules. |
||
* All monolingual dictionaries and bilingual dictionaries weighted. |
|||
== Engine == |
== Engine == |
||
Line 16: | Line 19: | ||
* Better support for Unicode in lttoolbox. |
* Better support for Unicode in lttoolbox. |
||
* Use embeddings for morphological disambiguation and lexical selection |
* Use embeddings for morphological disambiguation and lexical selection |
||
* Pass the surface form until transfer (to allow modules to look up surface form embeddings) |
* [[Surface forms in the pipe]]: Pass the surface form until transfer (to allow modules to look up surface form embeddings) |
||
* Retire the HMM tagger |
* Retire the HMM tagger |
||
* We currently have at least 3-4 classifiers (HMM, Perceptron, MaxEnt, ...). This is unecessary. We should have a single discriminative classifier for all classification needs (MLP?) modules should be frontends to it. |
|||
** Tagger, lexical selection, ... |
|||
* [[Weights in the pipeline]]: Weights throughout the pipeline |
|||
* Be able to train weights for morph analysis + morph. disambiguation + lexical selection + transfer end to end. |
* Be able to train weights for morph analysis + morph. disambiguation + lexical selection + transfer end to end. |
||
** e.g. can we treat the modules of the pipeline as a neural net and train the weights for them via backprop? |
|||
* Fully functional recursive transfer |
* Fully functional recursive transfer |
||
* Per session state, this could be stored in something like a special blank that could be updated. It might contain things like domain, etc. |
|||
* [[Incorporating guessing into Apertium]] (no word left untranslated!) |
|||
* [[Rethinking tokenisation in the pipeline]] support for languages that don't write spaces |
|||
; Neural system |
; Neural system |
||
* There should be a basic NMT implementation that functions in the Apertium ecosystem (C++,autotools,bash,apy,html-tools) for communities that want to build their own NMT systems and still take advantage of our ecosystem. We should be a one-stop shop for MT for marginalised langs. |
* [[apertium-neural]]: There should be a basic NMT implementation that functions in the Apertium ecosystem (C++,autotools,bash,apy,html-tools) for communities that want to build their own NMT systems and still take advantage of our ecosystem. We should be a one-stop shop for MT for marginalised langs. |
||
== Data creation and curation == |
|||
* Automatic multiword extraction using parallel corpora |
|||
* Recursive rule extraction |
|||
* Per-word lexical selection rule extraction |
|||
* Integrate training into the make/test cycle, to avoid annotated data getting stale. |
|||
* Get rid of the incubator/nursery/staging/etc. and move to a system of qualifying non-released language pairs by more objective means: coverage, diagnostics, WER, entries, etc. |
|||
== End user == |
== End user == |
||
* Format handling |
* Format handling |
||
* Better treatment of "no-translate" |
* [[User-based chunking]]: Better treatment of "no-translate" |
||
* User dictionaries |
* User dictionaries |
||
* Better handling of code-switching/mixed texts and informal text. |
* Better handling of code-switching/mixed texts and informal text. |
||
* Confidence scores for translations |
|||
== Research == |
|||
* Neural MT benefits from decoupling TL generation from SL analysis. So that things generated are fluent, even if they don't correspond to the input. RBMT often has incoherent output because we don't take into account the target language. There has been research into using TLMs with RBMT, but mostly for N-best list reranking. Could we do something more clever? Unsupervised neural rewriting of RBMT output(?) |
Latest revision as of 12:12, 27 June 2020
Wish list for Apertium 4:
Software engineering[edit]
- Make the different parts of the engine code more coherent in terms of repository (GitHub) modules
- Better test coverage and testing of existing (particularly newer) modules
- Good continuous integration, both of ML-based modules and rule-based ones.
Linguistic data[edit]
- Extract multiwords from lexicons into "separable" FSTs
- Train taggers for all languages using available corpora and a TLM
- At least one state-of-the-art language pair (wrt. Google) using all available modules.
- All monolingual dictionaries and bilingual dictionaries weighted.
Engine[edit]
- Better support for Unicode in lttoolbox.
- Use embeddings for morphological disambiguation and lexical selection
- Surface forms in the pipe: Pass the surface form until transfer (to allow modules to look up surface form embeddings)
- Retire the HMM tagger
- We currently have at least 3-4 classifiers (HMM, Perceptron, MaxEnt, ...). This is unecessary. We should have a single discriminative classifier for all classification needs (MLP?) modules should be frontends to it.
- Tagger, lexical selection, ...
- Weights in the pipeline: Weights throughout the pipeline
- Be able to train weights for morph analysis + morph. disambiguation + lexical selection + transfer end to end.
- e.g. can we treat the modules of the pipeline as a neural net and train the weights for them via backprop?
- Fully functional recursive transfer
- Per session state, this could be stored in something like a special blank that could be updated. It might contain things like domain, etc.
- Incorporating guessing into Apertium (no word left untranslated!)
- Rethinking tokenisation in the pipeline support for languages that don't write spaces
- Neural system
- apertium-neural: There should be a basic NMT implementation that functions in the Apertium ecosystem (C++,autotools,bash,apy,html-tools) for communities that want to build their own NMT systems and still take advantage of our ecosystem. We should be a one-stop shop for MT for marginalised langs.
Data creation and curation[edit]
- Automatic multiword extraction using parallel corpora
- Recursive rule extraction
- Per-word lexical selection rule extraction
- Integrate training into the make/test cycle, to avoid annotated data getting stale.
- Get rid of the incubator/nursery/staging/etc. and move to a system of qualifying non-released language pairs by more objective means: coverage, diagnostics, WER, entries, etc.
End user[edit]
- Format handling
- User-based chunking: Better treatment of "no-translate"
- User dictionaries
- Better handling of code-switching/mixed texts and informal text.
- Confidence scores for translations
Research[edit]
- Neural MT benefits from decoupling TL generation from SL analysis. So that things generated are fluent, even if they don't correspond to the input. RBMT often has incoherent output because we don't take into account the target language. There has been research into using TLMs with RBMT, but mostly for N-best list reranking. Could we do something more clever? Unsupervised neural rewriting of RBMT output(?)