User:Francis Tyers/Apertium 4

Software engineering

Make the different parts of the engine code more coherent in terms of repository (GitHub) modules
Better test coverage and testing of existing (particularly newer) modules

Linguistic data

Extract multiwords from lexicons into "separable" FSTs
Train taggers for all languages using available corpora and a TLM
At least one state-of-the-art language pair (wrt. Google) using all available modules.
All monolingual dictionaries and bilingual dictionaries weighted.

Engine

Better support for Unicode in lttoolbox.
Use embeddings for morphological disambiguation and lexical selection
Surface forms in the pipe: Pass the surface form until transfer (to allow modules to look up surface form embeddings)
Retire the HMM tagger
We currently have at least 3-4 classifiers (HMM, Perceptron, MaxEnt, ...). This is unecessary. We should have a single discriminative classifier for all classification needs (MLP?) modules should be frontends to it.
- Tagger, lexical selection, ...
Weights in the pipeline: Weights throughout the pipeline
Be able to train weights for morph analysis + morph. disambiguation + lexical selection + transfer end to end.
- e.g. can we treat the modules of the pipeline as a neural net and train the weights for them via backprop?
Fully functional recursive transfer
Per session state, this could be stored in something like a special blank that could be updated. It might contain things like domain, etc.

Neural system

apertium-neural: There should be a basic NMT implementation that functions in the Apertium ecosystem (C++,autotools,bash,apy,html-tools) for communities that want to build their own NMT systems and still take advantage of our ecosystem. We should be a one-stop shop for MT for marginalised langs.

Data creation

Automatic multiword extraction using parallel corpora
Recursive rule extraction
Per-word lexical selection rule extraction

End user

Format handling
User-based chunking: Better treatment of "no-translate"
User dictionaries
Better handling of code-switching/mixed texts and informal text.
Confidence scores for translations

Research

Neural MT benefits from decoupling TL generation from SL analysis. So that things generated are fluent, even if they don't correspond to the input. RBMT often has incoherent output because we don't take into account the target language. There has been research into using TLMs with RBMT, but mostly for N-best list reranking. Could we do something more clever? Unsupervised neural rewriting of RBMT output(?)

Tanmai Khanna (talk) Isn't that also one of the disadvantages of neural models with strong language models that prefer fluency over adequacy? At least in RBMT we can guarantee not giving up adequacy for fluency.

User:Francis Tyers/Apertium 4

Contents

Software engineering

Linguistic data

Engine

Data creation

End user

Research

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools