User:Asfrent/MSc Log

From Apertium
Revision as of 22:01, 5 August 2014 by Asfrent (talk | contribs) (→‎05.08.2014)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

MSc

Plan, questions, stuff

Ideas

CG style rules

Constraint Grammar uses two types of actions for the rules, SELECT and REMOVE. These actions are applied based on some contextual checks. The two actions will be implemented as follows:

  • SELECT - MIL concept learning. Monadic predicates that check the context around a token and succeed only if the context matches the learned concept (is_noun(State) :- ..., is_sg(State) :- ...). If a certain concept is detected, then all other readings are removed and only the identified concept will be kept.
  • REMOVE - MIL dyadic predicate learning. Dyadic predicates that do state transformation based on the context around a certain token. Each state transformation will remove one or more readings from the token. It is not permitted for a REMOVE rule to eliminate the last reading.

Unambiguous tokens may be passed to the learner, but the rules will not be applied on them (since there is nothing to disambiguate). The rule types will be applied alternatively, with REMOVE eliminating readings so that SELECT will be able to detect concepts. Rules should be as simple as possible, so the learning will be fast. One of the challenges will be to provide a good set of examples for learning.

Brill tagger style rules

Is it possible?

How to get examples for the MIL learner?

The first step is to implement a context miner and some filters. It is unclear how examples will be fed into the MIL learner at the moment.

Short term plan / Pendings

Pending Notes
Write a DCG for the apertium stream format.
Research UTF and Prolog.
Write a simple PoS disambiguator that keeps only the first reading. A random disambiguator would also be useful.
Set up a repository for the project.
  • Sourceforge account, email to Francis.
  • Commit stuff in apertium-branches.
Check licensing of MIL code. No answer by email. :-(
Design internal representation of the input data.
Design rules.
Implement basic predicates.
Learn rules using MIL.
Write a python script that aligns and tests two outputs (handtagged vs disambiguated). At the moment the script outputs only counts and accuracy, should be extended to compute per class statistics.

Questions

Log

11.07.2014

  • Read ILP paper from Francis.
  • Got MIL code, did a few tests.
  • Tracked down and downloaded test data from Apertium project for the tagger.
  • Read about tagging, CG and rules.
  • Wrote a Prolog script that reads all the lines from a file.

12.07.2014

  • Started to read CG docs in order to make the design of the data structures.
  • Did a bit of research on Prolog DCG.
  • Wrote stream tokenizer in Prolog.
  • Wrote token splitter in Prolog.

13.07.2014

  • Implemented and tested output writer.
  • Implemented a trivial disambiguator that always selects the first reading. The number of mismatches from the hand tagged version lowered from 107 to 45 (this was kind of a test to ensure the two are actually aligned and match a bit better).
  • Designed the internal structure of the data. We will keep the initial split tokens in lists and remove tags from these lists. Each one of the lists will have a metadata slot allocated somewhere (probably in a metadata list). I should research hashtables, trees, or whatever fast lookup data structure Prolog might have.

14.07.2014

  • wrote the initial version of a diff script that will analyse the differences between two tagged versions.
  • set up SF account for the apertium svn repo.

23.07.2014

  • Back in London from a 2 week holiday in Romania
  • Implemented the baseline disambiguator in Python.
 $ ./train.py ../testing/ambiguous/atoms1.ambiguous.txt ../testing/handtagged/atoms1.handtagged.andrei.txt > model
 $ ./predict.py ../testing/ambiguous/atoms1.ambiguous.txt model > prediction
  • Implemented utils/eval.py. It currently does simple evaluation of the disambiguation and overall accuracy of the part of speech tagging. Here are the stats for the baseline method:
 $ ./eval.py ../testing/ambiguous/atoms1.ambiguous.txt ../baseline/prediction ../testing/handtagged/atoms1.handtagged.andrei.txt
 token_count: 372
 ambiguous_count: 107
 correctly_disambiguated: 72
 correct_count: 337
 disambiguation_accuracy: 67.29%
 overall_accuracy: 90.59%
  • Stats for disambiguator_first.pl (chooses first one of the readings in ambiguous tokens):
 token_count: 372
 ambiguous_count: 107
 correctly_disambiguated: 62
 correct_count: 327
 disambiguation_accuracy: 57.94%
 overall_accuracy: 87.90%
  • Wrote a basic version of context mining for building examples for the MIL learner. The idea behind the context miner is to parse ambiguous and handtagged files and extract contexts (a few tokens around a central token). These contexts will be then filtered by certain conditions (eg. all nouns in the handtagged file) and then transformed in Prolog lists which will be fed into the MIL learner. This way, we can isolate and sample some nouns, then build examples and use them to learn the concept of a noun.
  • Implemented the following filters for contexts: ctxs_filter_sample.py, ctxs_filter_ambiguous.py, ctxs_filter_by_tag.py. Example usage (say we want to sample 2 contexts from the atoms file which were ambiguous and have been handtagged as nouns):
 $ ./ctxs_mine_contexts.py ../testing/ambiguous/atoms1.ambiguous.txt ../testing/handtagged/atoms1.handtagged.andrei.txt |\
   ./ctxs_filter_ambiguous.py |\
   ./ctxs_filter_by_tag.py "<n>" |\
   ./ctxs_filter_by_tag.py "<pl>" |\
   ./ctxs_filter_sample.py 2 |\
   ./ctxs_translate.py
 
 TOKEN: [u'^lines/line<n><pl>/line<vblex><pri><p3><sg>$', u'^lines/line<n><pl>$']
 PRE_CONTEXT:
   AMBIGUOUS : [u'^extra/extra<adj>/extra<n><sg>$', u'^emission/emission<n><sg>$']
   HANDTAGGED: [u'^extra/extra<adj>$', u'^emission/emission<n><sg>$']
 POST_CONTEXT:
   AMBIGUOUS : [u'^,/,<cm>$', u'^but/but<cnjcoo>/but<pr>$']
   HANDTAGGED: [u'^,/,<cm>$', u'^but/but<cnjcoo>$']
 
 TOKEN: [u'^terms/term<n><pl>/term<vblex><pri><p3><sg>$', u'^terms/term<n><pl>$']
 PRE_CONTEXT:
   AMBIGUOUS : [u'^atom/atom<n><sg>$', u'^in/in<pr>$']
   HANDTAGGED: [u'^atom/atom<n><sg>$', u'^in/in<pr>$']
 POST_CONTEXT:
   AMBIGUOUS : [u'^of/of<pr>$', u'^probabilities/probability<n><pl>$']
   HANDTAGGED: [u'^of/of<pr>$', u'^probabilities/probability<n><pl>$']

24.07.2014

  • Added DCG for extracting tags from a reading.
  • Started to manually write some rules for the new disambiguator.
  • Metadata add / rem, preprocesses the stream of tokens so we don't need to make expensive calls to parsing or other parts at a later point. Current internal format is as follows (copy paste from some code file):
 % Internal representation without metadata:
 %   Stream -> [Token1, Token2, ...].
 %   Token -> [Word, Reading1, Reading2, ...].
 %   Word, Reading -> string codes (list of ints).
 %
 % Internal representation with metadata:
 %   Stream -> [Token1, Token2, ...].
 %   Token -> [Word, MetaReading1, MetaReading2, ...].
 %   MetaReading -> [Reading, Tags].
 %   Word, Reading -> string codes (list of ints).
  • Short description of how the first version of the disambiguator will work:
 % This disambiguator works by concept identification. Concept identification
 % means the disambiguator tries to identify a certain concept and then removes
 % every tag that does not match the specified concept (eg. <n> is identified
 % based on the context, then reading that does not contain <n> is removed).
 % The general workflow is as follows:
 %   * Read the file, tokenize, split and add metadata.
 %   * Process every context in the stream. A Context is a list of three tokens,
 %     [Pre, Token, Post].
 %   * Every context is checked for ambiguity. If the analysed token is not
 %     ambiguous, then we move on and leave it as is.
 %   * Ambiguous contexts are passed to a predicate that tries to identify a
 %     concept. This predicate will check agains every concept it knows and a
 %     rule is applied when only one concept holds.
 %   * In the case of the current disambiguator, the rule is always the same:
 %     remove all readings that do not match the identified concept. For
 %     instance, if ^mici/mici<n><pl>/mici<n><sg>/mici<adj>$ has been identified
 %     as <n> then the last reading is removed, because it does not math the <n>
 %     concept (as the first two do).
 %   * There are two conditions that MUST hold every time a concept is found.
 %     First of all, the the tag associated with the concept must appear between
 %     the tags of the analysed token. This ensures we will not identify an <adv>
 %     concept in the previous example, removing all readings. The second
 %     condition is that the specific tag must not appear in all readings of the
 %     token. Indeed, if the tag appears in all the readings, then the associated
 %     rule will remove nothing, so no disambiguation is done.
 %   * This processing of the stream is being done until no changes occur, which
 %     means that the list of tokens cannot be further disambiguated.
 %   * Remove metadata, assemble tokens, write result.
  • Fully implemented. One test concept that selects <n> if the previous thing is an unambiguous <adj>.

26.07.2014

  • Added directory for auto generated background knowledge. We will have two types of background knowledge: hand crafted (predicates for processing a context, etc.) and auto generated (predicates that define frequent words or tags to use in the MIL learner).
  • After some trouble with the metagol_d code, I have the first automatically learned rule from a not too great set of examples:
FINAL HYPOTHESIS FOR EPISODE: concept_n, BOUND: 1

concept_n(A) :- pre_token(A,B), is_adj(B).

27.07.2014

  • Refactored a bit and commited the apertium.pl file that contains the MIL learner.
  • Utility for listing all tags from a tagged file.

28.07.2014 - 03.08.2014

  • Generated random examples from the ambiguous / handtagged files.
  • Random examples are able to produce rules of the kind above because of their simplicity (all rules are of the type concept_X(A) :- PRE/POST_token(A,B), is_Y(B)).
  • After having generated enough random rules, the next step needs to be a filtering of the rules. In order to get an idea of how much can we get out of this combination of rule format / training I implemented a genetic algorithm. The best set of rules has an accuracy for disambiguation alone that is about 67% (91% overall). Note that the "still ambiguous" taggings were randomly disambiguated (I also tried to disambiguate with the baseline, same result - this means the rule based disambiguator incorporates all the "knowledge" of the baseline, since after the rule based one the use of random selection or baseline selection does not produce a significant difference).

04.08.2014

  • Made a stub of the MSc thesis in order to present a draft structure of it.
  • Designed the new format of the rules (next steps include implementing required metarules into the MIL framework):
Disambiguate(C1, C2) :- Test(C1), RemovePart(C1, C2)
Disambiguate(C1, C2) :- Test(C1), SelectPart(C1, C2)
Disambiguate(C1, C2) :- RemovePart(C1, C2), Test(C2)
Disambiguate(C1, C2) :- SelectPart(C1, C2), Test(C2)
Disambiguate(C1, C3) :- Disambiguate1(C1, C2), Disambiguate2(C2, C3)

05.08.2014

  • Great day!
  • Spoke to spectie on IRC about MIL / Prolog and come to an agreement on the second type of rules that will be hopefully learnable by feeding [semi]random examples. We'll meet again in London :-).
  • Met with Jim, discussed the structure of the thesis and the demo.
  • Refactored the code for the first type of rules. There are two things missing: genetic algorithm for rule selection and acquisition of rules from gen_rnd_hyp.py. Otherwise, everything seems to work fine.
  • Commited everything into the SVN repo.
  • Thesis can be found here.