MSc
Plan, questions, stuff
Short term plan / Pendings
Pending
|
Notes
|
Write a DCG for the apertium stream format.
|
|
Research UTF and Prolog.
|
|
Write a simple PoS disambiguator that keeps only the first reading.
|
A random disambiguator would also be useful.
|
Set up a repository for the project.
|
Sourceforge account, email to Francis.
Commit stuff in apertium-branches.
|
Check licensing of MIL code.
|
No answer by email. :-(
|
Design internal representation of the input data.
|
|
Design rules.
|
|
Implement basic predicates.
|
|
Learn rules using MIL.
|
|
Write a python script that aligns and tests two outputs (handtagged vs disambiguated).
|
|
Questions
Log
11.07.2014
- Read ILP paper from Francis.
- Got MIL code, did a few tests.
- Tracked down and downloaded test data from Apertium project for the tagger.
- Read about tagging, CG and rules.
- Wrote a Prolog script that reads all the lines from a file.
12.07.2014
- Started to read CG docs in order to make the design of the data structures.
- Did a bit of research on Prolog DCG.
- Wrote stream tokenizer in Prolog.
- Wrote token splitter in Prolog.
13.07.2014
- Implemented and tested output writer.
- Implemented a trivial disambiguator that always selects the first reading. The number of mismatches from the hand tagged version lowered from 107 to 45 (this was kind of a test to ensure the two are actually aligned and match a bit better).
- Designed the internal structure of the data. We will keep the initial split tokens in lists and remove tags from these lists. Each one of the lists will have a metadata slot allocated somewhere (probably in a metadata list). I should research hashtables, trees, or whatever fast lookup data structure Prolog might have.
14.07.2014
- wrote the initial version of a diff script that will analyse the differences between two tagged versions.
- set up SF account for the apertium svn repo.
23.07.2014
- Back in London from a 2 week holiday in Romania
- Implemented the baseline disambiguator in Python.
$ ./train.py ../testing/ambiguous/atoms1.ambiguous.txt ../testing/handtagged/atoms1.handtagged.andrei.txt > model
$ ./predict.py ../testing/ambiguous/atoms1.ambiguous.txt model > prediction
- Implemented utils/eval.py. It currently does simple evaluation of the disambiguation and overall correctness
$ ./eval.py ../testing/ambiguous/atoms1.ambiguous.txt ../baseline/prediction ../testing/handtagged/atoms1.handtagged.andrei.txt
token_count: 372
ambiguous_count: 107
correctly_disambiguated: 72
correct_count: 337
disambiguation_accuracy: 67.29%
overall_accuracy: 90.59%