Assimilation Evaluation Toolkit

From Apertium
Revision as of 10:23, 26 May 2014 by Sereni (talk | contribs) (Week 1 summary: keywords and gaps)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Project description

This page describes a work in progress[1]. The Assimilation evaluation toolkit is a set of programs that generates tasks for human evaluation of machine translation. The tasks consist of sentences in the original language, reference translation with keywords omitted and the machine translation of these sentences. They also contain a key to determine answer correctness, not shown to the evaluator.

Keyword extraction

Keywords are extracted from the text with a method described in [2]. This method favors longer keywords, which is not suitable for text gapping, so the keywords containing more than two words are filtered out. A list of stopwords is required for the algorithm to work. For this, we use Apertium POS-tagger and select stopwords containing the following tags (list to be refined): 'pr', 'vbser', 'def', 'ind', 'cnjcoo', 'det', 'rel', 'vaux', 'vbhaver', 'prn', 'itg'

Task generation

For task generation, four input files are needed: original text, machine translation, reference translation, and its pos-tagged version. After keywords have been extracted, they are removed from the reference translation. Gap density can be varied, c.f. output with gap density of 30% and 70%:

Corpora in { gap } are large collections of texts enhanced with special markup. They allow linguists to search the texts by various { gap } in order to discover phenomena and patterns in the natural language.

Corpora in { gap } are large collections of { gap } with { gap }. They allow linguists to { gap } the { gap } by various parameters in order to { gap } and { gap } in the natural language.

References and links

  1. Current code on github
  2. Rose, Stuart, et al. "Automatic keyword extraction from individual documents." Text Mining (2010): 1-20.