Difference between revisions of "Assimilation Evaluation Toolkit"
(fix links to eamt-assim paper) |
|||
Line 32: | Line 32: | ||
Four groups of evaluators (a total of 11) are asked to fill the gaps in Spanish sentences with appropriate words. Each group is given 9 sentences with different hints, so that all sentences are evaluated in different modes, but no user evaluates one sentence in two modes at once. The hint modes include the following: no assistance, the source Basque sentence, machine translation of Basque sentence into Spanish using Apertium, and both the source Basque sentence and its machine translation. |
Four groups of evaluators (a total of 11) are asked to fill the gaps in Spanish sentences with appropriate words. Each group is given 9 sentences with different hints, so that all sentences are evaluated in different modes, but no user evaluates one sentence in two modes at once. The hint modes include the following: no assistance, the source Basque sentence, machine translation of Basque sentence into Spanish using Apertium, and both the source Basque sentence and its machine translation. |
||
Source data and experiment results can be found in Apertium SVN repository<ref>Basque-Spanish experiment data [https://svn.code.sf.net/p/apertium/svn/branches/papers/ |
Source data and experiment results can be found in Apertium SVN repository<ref>Basque-Spanish experiment data [https://svn.code.sf.net/p/apertium/svn/branches/papers/2015-eamt-assim/eu-es]</ref>. The results of the evaluation are presented in tables. For tables 2 and 4, the evaluators' answers have been reviewed to determine whether any of the words that did not match the answer key could be used in each question. The list of candidate synonyms has been created automatically based on the provided answers. For a word to be included into the list, it must have been submitted by two or more evaluators. The list has been reviewed by a native speaker of Spanish, and the relevant synonyms have been counted as correct answers in the results. |
||
A similar set of data is provided for English-Kazakh evaluations <ref>English-Kazakh experiment data [https://svn.code.sf.net/p/apertium/svn/branches/papers/ |
A similar set of data is provided for English-Kazakh evaluations <ref>English-Kazakh experiment data [https://svn.code.sf.net/p/apertium/svn/branches/papers/2015-eamt-assim/eng-kaz]</ref>. 9 Kazakh speakers completed the evaluation of 36 sentences in 4 different modes analogous to Basque-Spanish experiment. A synonyms list was built with the help of a native speaker of Kazakh. |
||
Another experiment was conducted for Tatar-Russian language pair <ref>Tatar-Russian experiment data [https://svn.code.sf.net/p/apertium/svn/branches/papers/ |
Another experiment was conducted for Tatar-Russian language pair <ref>Tatar-Russian experiment data [https://svn.code.sf.net/p/apertium/svn/branches/papers/2015-eamt-assim/tat-rus]</ref>. 28 evaluators have participated in it, each filling a set of 36 sentences divided into 3 groups with 10, 20 and 30% of words removed, in 4 different assistance modes, similarly to the Basque-Spanish experiment. The texts for this experiment belong to three domains: casual conversations, legal texts and news extracts. The results were calculated for each set separately, along with the united results for all of them. A synonyms list was also used, built by a native Russian speaker based on evaluators' answers. |
||
== Progress == |
== Progress == |
Revision as of 15:06, 13 March 2015
Contents
Project description
This page describes a work in progress[1]. The Assimilation evaluation toolkit is a set of programs that generates tasks for human evaluation of machine translation. The tasks consist of sentences in the original language, reference translation with keywords omitted and the machine translation of these sentences. They also contain a key to determine answer correctness, not shown to the evaluator. The tasks may be generated as standalone text files with automated checking, or as XML files to be integrated into Appraise evaluation system.
Keyword extraction
Keywords are extracted from the text with a method described in [2]. This method favors longer keywords, which is not suitable for text gapping, so the keywords containing more than two words are filtered out. A list of stopwords is required for the algorithm to work. For this, we use Apertium POS-tagger and select stopwords containing the following tags (list to be refined): 'pr', 'vbser', 'def', 'ind', 'cnjcoo', 'det', 'rel', 'vaux', 'vbhaver', 'prn', 'itg'.
The toolkit also features a non-keyword gap generation mode, when words are randomly omitted regardless of their significance for the text.
Task generation
For task generation, four input files are needed: original text, machine translation, reference translation, and its pos-tagged version. After keywords have been extracted, they are removed from the reference translation. Gap density can be varied, c.f. output with gap density of 30% and 70%:
Corpora in { gap } are large collections of texts enhanced with special markup. They allow linguists to search the texts by various { gap } in order to discover phenomena and patterns in the natural language.
Corpora in { gap } are large collections of { gap } with { gap }. They allow linguists to { gap } the { gap } by various parameters in order to { gap } and { gap } in the natural language.
Gap density can be specified in relation to the number of keywords, or to the total number of words in the text. In addition, the user may adjust gap contents by specifying parts of speech to be removed.
As an option, the users may select to view lemmas of omitted words in the gaps. In this case, the evaluators are required to fill in the correct grammatical forms of the words given. This may help to understand how well the MT system deals with translating grammar.
Multiple choice gaps
An additional task generation mode features multiple choice options for gaps. Each omitted word is assigned a list of similar words for the user to choose from during evaluation. The choices are picked from the same text by part of speech and grammar tags. A choice must be the same part of speech as the original word, and they should share as many grammatical features as possible. The approach is described in [3] and [4]. An example of keyword choices generated on the two sentences above (the number of choices can be specified):
special / large / natural, linguists / parameters / patterns, search / make / discover.
Experiments
As a part of the testing process, the toolkit has been used to evaluate Basque-Spanish and English-Kazakh Apertium language pairs.
The Basque-Spanish experiment features 36 sentence pairs over 10 words long randomly drawn from the parallel corpus [5] of legal texts. The sentences are divided into three groups, and 10, 20 and 30 percent of words are removed from Spanish sentences in each group respectively. The removed words belong to the following parts of speech: noun, proper noun, adverb, adjective and lexical verb.
Four groups of evaluators (a total of 11) are asked to fill the gaps in Spanish sentences with appropriate words. Each group is given 9 sentences with different hints, so that all sentences are evaluated in different modes, but no user evaluates one sentence in two modes at once. The hint modes include the following: no assistance, the source Basque sentence, machine translation of Basque sentence into Spanish using Apertium, and both the source Basque sentence and its machine translation.
Source data and experiment results can be found in Apertium SVN repository[6]. The results of the evaluation are presented in tables. For tables 2 and 4, the evaluators' answers have been reviewed to determine whether any of the words that did not match the answer key could be used in each question. The list of candidate synonyms has been created automatically based on the provided answers. For a word to be included into the list, it must have been submitted by two or more evaluators. The list has been reviewed by a native speaker of Spanish, and the relevant synonyms have been counted as correct answers in the results.
A similar set of data is provided for English-Kazakh evaluations [7]. 9 Kazakh speakers completed the evaluation of 36 sentences in 4 different modes analogous to Basque-Spanish experiment. A synonyms list was built with the help of a native speaker of Kazakh.
Another experiment was conducted for Tatar-Russian language pair [8]. 28 evaluators have participated in it, each filling a set of 36 sentences divided into 3 groups with 10, 20 and 30% of words removed, in 4 different assistance modes, similarly to the Basque-Spanish experiment. The texts for this experiment belong to three domains: casual conversations, legal texts and news extracts. The results were calculated for each set separately, along with the united results for all of them. A synonyms list was also used, built by a native Russian speaker based on evaluators' answers.
Progress
This section lists project progress according to the GSOC proposal.
Week 1: Created an algorithm for keyword extraction in simple gaps. Made a program that creates a gapped reference translation with keys given reference translation text and its tagged version.
Week 2: Multiple choice gaps: created an algorithm for finding similar words for multiple choice based on pos and grammar tags. Updated code to create tasks with multiple choice gaps. Added variable gap density. Added random word deletion (without keyword determination). Fixed bugs.
Week 3: Added task generation with lemmas in place of gaps. Added an option to select parts of speech to be removed. Adjusted keyword removal algorithm to calculate scores based on lemmas (thus account for different word forms).
Week 4: Added modules to generate text-based task sets with original text, reference translation with all supported gap types and optional machine translation. Added a module that calculates the number of correct answers from tasks filled out by evaluators.
Week 5: Created the command-line interface for task generation and answer checking.
Weeks 6-7: Integrated gisting evaluation tasks into Appraise. Created command-line interface for XML generation.
Week 8: Modified XML generation with additional options.
Weeks 9: Added TSV generation script. Fixed bugs. Changed the algorithm to spread gaps evenly across the sentences rather than the whole text.
Week 10: Recruited volunteers for Tatar-Russian evaluations. Added scripts to determine possible synonyms based on evaluators' input and to present evaluation results in TeX format. Hosted the evaluation system on the server.
Week 11: Added scripts to prepare and distribute the tasks automatically given a number of evaluators, number of sentences and evaluation modes. Prepared material for Basque-Spanish evaluations.
Week 12: Collected the results of Basque-Spanish evaluations. Prepared material for and collected results of English-Kazakh and Tatar-Russian evaluations.
References and links
- ↑ Current code on github
- ↑ Rose, Stuart, et al. "Automatic keyword extraction from individual documents." Text Mining (2010): 1-20.
- ↑ Trond Trosterud, Kevin Brubeck Unhammer. Evaluating North Sámi to Norwegian assimilation RBMT. Proceedings of the Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012); 06/2012
- ↑ Jim O'Regan and Mikel L. Forcada (2013) "Peeking through the language barrier: the development of a free/open-source gisting system for Basque to English based on apertium.org". Procesamiento del Lenguaje Natural 51, 15-22.
- ↑ Memorias de traducción del Servicio Oficial de Traductores del IVAP [1]
- ↑ Basque-Spanish experiment data [2]
- ↑ English-Kazakh experiment data [3]
- ↑ Tatar-Russian experiment data [4]