User:Nikita Medyankin/GSoC 2016 WTR Plan

From Apertium
Jump to navigation Jump to search

Community Bonding Period (22th April—22th May)

  • Get acquainted with the full rules application workflow through all the .tNx stages.
  • Set up a toy ru->en language pair with a small set of ambiguous rules and a small test corpus in order to test weighting and choosing the rules in a controlled environment.
  • Introduce some rules into my copy of es->en pair to add ambiguity. That one will be for testing on a real pair.
  • Get to know core Apertium code to know where the weighted rule selection should be coded into.

Deliverables

  • CBD1: The toy ru->en pair.
  • CBD2: Modified es->en pair. Victor kindly suggested that he will provide me with ambiguous rules he has as a byproduct of his studies.
  • CBD3: A modified version of apertium-transfer that does not apply a rule to an SL chunk if it contains certain Russian word. This would be one hardcoded word of my choice. Then, translate a big monolingual text in Russian with the toy ru-en language pair and the two versions of apertium-transfer. The only difference in the two translations should be in those segments that contain the specified word. This would be a proof of my understanding of the transfer code. Because you only really get to know the code when you have to make it do something new.

Main Period

Deliverables

  • MCD1: The code in C++ integrated into Apertium to choose the rules given the weights.
  • MCD2: Standalone training script. That one will be used for computing the weights given the corpus and the rules. The design must be pretty straightforward, so anyone would be able to compute the weights for a given language pair.
  • MCD3: Optimized weights for es-en pair.

The order of the deliverables' implementation was changed as Victor suggested it:

I think it is safer to change the order of the two main deliverables: we should first produce the C++ code in order to load the weights and choose the rules based on them, and then start working on obtaining these weights automatically. The first objective involves just coding (although it does not mean that it is an easy objective), while the second one is research problem and carries a certain level of uncertainty: maybe the initially defined strategy does not work well and we need to refine it. By addressing firstly the development task, it is more likely that we have, at the end of the project, a solid contribution to Apertium.

Week 1 (23th—29th May)

Design xml format for rule weights file, specify it in DTD or XML Schema. Devise dummy weights for ru-en pair for the purposes of testing. Start implementing MCD1.

Week 2 (30th May—5th June)

Work on MCD1 implementation.

Week 3 (6th—12th June)

Work on MCD1 implementation.

Week 4 (13th—19th June)

Work on MCD1 implementation, testing and bug fixing.

Mid-term evaluation

MCD1

Week 5 (20th—26th June)

Obtain the weights for the toy pair.

Week 6 (27th June—3th July)

Implement the pipeline of the weight obtaining process.

Week 7 (4th—10th July)

Experiment on obtaining the weights for the real pair.

Week 8 (11th—17th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training.

Week 9 (18th—24th July)

RBMT Summer School in Alacant: working on Apertium User-friendly lexical selection training.

Week 10 (25th—31th July)

Experiment on obtaining the weights for the real pair.

Week 11 (1th—7th August)

Experiment on obtaining the weights for the real pair.

Week 12 (8th—14th August)

Experiment on obtaining the weights for the real pair. Testing and bug fixing.

Final week (15th—23th August)

Final testing and bug fixing.

Final evaluation

  • MCD2
  • MCD3