Difference between revisions of "User:Asfrent/MSc Log"

From Apertium
Jump to navigation Jump to search
Line 94: Line 94:
disambiguation_accuracy: 67.29%
disambiguation_accuracy: 67.29%
overall_accuracy: 90.59%
overall_accuracy: 90.59%
* Stats for of '''disambiguator_first.pl''' (chooses first one of the readings in ambiguous tokens):
token_count: 372
ambiguous_count: 107
correctly_disambiguated: 62
correct_count: 327
disambiguation_accuracy: 57.94%
overall_accuracy: 87.90%

Revision as of 12:30, 23 July 2014

MSc

Plan, questions, stuff

Ideas

CG style rules

Constraint Grammar uses two types of actions for the rules, SELECT and REMOVE. These actions are applied based on some contextual checks. The two actions will be implemented as follows:

  • SELECT - MIL concept learning. Monadic predicates that check the context around a token and succeed only if the context matches the learned concept (is_noun(State) :- ..., is_sg(State) :- ...). If a certain concept is detected, then all other readings are removed and only the identified concept will be kept.
  • REMOVE - MIL dyadic predicate learning. Dyadic predicates that do state transformation based on the context around a certain token. Each state transformation will remove one or more readings from the token. It is not permitted for a REMOVE rule to eliminate the last reading.

Unambiguous tokens may be passed to the learner, but the rules will not be applied on them (since there is nothing to disambiguate). The rule types will be applied alternatively, with REMOVE eliminating readings so that SELECT will be able to detect concepts. Rules should be as simple as possible, so the learning will be fast. One of the challenges will be to provide a good set of examples for learning.

Brill tagger style rules

Is it possible?

Short term plan / Pendings

Pending Notes
Write a DCG for the apertium stream format.
Research UTF and Prolog.
Write a simple PoS disambiguator that keeps only the first reading. A random disambiguator would also be useful.
Set up a repository for the project.
  • Sourceforge account, email to Francis.
  • Commit stuff in apertium-branches.
Check licensing of MIL code. No answer by email. :-(
Design internal representation of the input data.
Design rules.
Implement basic predicates.
Learn rules using MIL.
Write a python script that aligns and tests two outputs (handtagged vs disambiguated). At the moment the script outputs only counts and accuracy, should be extended to compute per class statistics.

Questions

Log

11.07.2014

  • Read ILP paper from Francis.
  • Got MIL code, did a few tests.
  • Tracked down and downloaded test data from Apertium project for the tagger.
  • Read about tagging, CG and rules.
  • Wrote a Prolog script that reads all the lines from a file.

12.07.2014

  • Started to read CG docs in order to make the design of the data structures.
  • Did a bit of research on Prolog DCG.
  • Wrote stream tokenizer in Prolog.
  • Wrote token splitter in Prolog.

13.07.2014

  • Implemented and tested output writer.
  • Implemented a trivial disambiguator that always selects the first reading. The number of mismatches from the hand tagged version lowered from 107 to 45 (this was kind of a test to ensure the two are actually aligned and match a bit better).
  • Designed the internal structure of the data. We will keep the initial split tokens in lists and remove tags from these lists. Each one of the lists will have a metadata slot allocated somewhere (probably in a metadata list). I should research hashtables, trees, or whatever fast lookup data structure Prolog might have.

14.07.2014

  • wrote the initial version of a diff script that will analyse the differences between two tagged versions.
  • set up SF account for the apertium svn repo.

23.07.2014

  • Back in London from a 2 week holiday in Romania
  • Implemented the baseline disambiguator in Python.
 $ ./train.py ../testing/ambiguous/atoms1.ambiguous.txt ../testing/handtagged/atoms1.handtagged.andrei.txt > model
 $ ./predict.py ../testing/ambiguous/atoms1.ambiguous.txt model > prediction
  • Implemented utils/eval.py. It currently does simple evaluation of the disambiguation and overall correctness
 $ ./eval.py ../testing/ambiguous/atoms1.ambiguous.txt ../baseline/prediction ../testing/handtagged/atoms1.handtagged.andrei.txt
 token_count: 372
 ambiguous_count: 107
 correctly_disambiguated: 72
 correct_count: 337
 disambiguation_accuracy: 67.29%
 overall_accuracy: 90.59%
  • Stats for of disambiguator_first.pl (chooses first one of the readings in ambiguous tokens):
 token_count: 372
 ambiguous_count: 107
 correctly_disambiguated: 62
 correct_count: 327
 disambiguation_accuracy: 57.94%
 overall_accuracy: 87.90%