Difference between revisions of "User:Francis Tyers/Apertium 4"

From Apertium
Jump to navigation Jump to search
Line 27: Line 27:
   
 
* [[apertium-neural]]: There should be a basic NMT implementation that functions in the Apertium ecosystem (C++,autotools,bash,apy,html-tools) for communities that want to build their own NMT systems and still take advantage of our ecosystem. We should be a one-stop shop for MT for marginalised langs.
 
* [[apertium-neural]]: There should be a basic NMT implementation that functions in the Apertium ecosystem (C++,autotools,bash,apy,html-tools) for communities that want to build their own NMT systems and still take advantage of our ecosystem. We should be a one-stop shop for MT for marginalised langs.
  +
  +
== Data creation ==
  +
  +
* Automatic multiword extraction using parallel corpora
  +
* Recursive rule extraction
   
 
== End user ==
 
== End user ==

Revision as of 21:27, 21 June 2020

Wish list for Apertium 4:

Software engineering

  • Make the different parts of the engine code more coherent in terms of modules
  • Better test coverage and testing of existing (particularly newer) modules

Linguistic data

  • Extract multiwords from lexicons into "separable" FSTs
  • Train taggers for all languages using available corpora and a TLM
  • At least one state-of-the-art language pair (wrt. Google).

Engine

  • Better support for Unicode in lttoolbox.
  • Use embeddings for morphological disambiguation and lexical selection
  • Pass the surface form until transfer (to allow modules to look up surface form embeddings)
  • Retire the HMM tagger
  • Be able to train weights for morph analysis + morph. disambiguation + lexical selection + transfer end to end.
    • e.g. can we treat the modules of the pipeline as a neural net and train the weights for them via backprop?
  • Fully functional recursive transfer
  • Per session state, this could be stored in something like a special blank that could be updated. It might contain things like domain, etc.
Neural system
  • apertium-neural: There should be a basic NMT implementation that functions in the Apertium ecosystem (C++,autotools,bash,apy,html-tools) for communities that want to build their own NMT systems and still take advantage of our ecosystem. We should be a one-stop shop for MT for marginalised langs.

Data creation

  • Automatic multiword extraction using parallel corpora
  • Recursive rule extraction

End user

  • Format handling
  • user-based chunking: Better treatment of "no-translate"
  • User dictionaries
  • Better handling of code-switching/mixed texts and informal text.