Helsinki Apertium Workshop/Session 7

Now that all of the basic aspects of creating a new MT system in Apertium have been covered, we come to the final, and possibly most important one. This session will cover the question of why we need data consistency, what we mean by quality and how to perform an evaluation. The practical will involve working with some of the methods that we use to assure consistency and quality in Apertium. It will also cover quality evaluation.

Theory[edit]

Consistency[edit]

Self-contained system[edit]

In contrast to many other types of systems for natural language processing — such as morphological analysers and part-of-speech taggers, a machine translation system designed and developed with Apertium is a self-contained system. For any input, it should have one, predictable, deterministic output.

Put another way, every lexical unit in the source language morphological analyser should have a corresponding entry in the transfer lexicon, and subsequently an entry in the morphological generator for the target language. Lexical units added in the transfer stage should also have entries in the target language generator. This must be on the level of both lemmas and tags. Consider the following example: we can see that for each source language lexical form, there is a corresponding entry in the bilingual dictionary.

When this is not the case, things go wrong, you get diagnostic symbols (@, #) in your output and your translation looks like it has been dragged through a hedge backwards. This is the principle difference between the different statūs of language pairs. We can see this clearly if we take a pair with trunk (nearly always for released pairs which have been quality controlled — that is they do not have diagnostic symbols) status, and a pair with nursery (for pairs which have only undergone basic development and are pending quality control) status.

Inconsistency[edit]

This is demonstrated if we try and translate the sentence in Turkish below with two distinct translators in Apertium:

Original (Turkish)	Trunk (Turkish → Kyrgyz)	Incubator (Turkish → Chuvash)
22 yaşındaki Çavuş Sandra Radovanoviç, Sırp Ordusu'nun Super Galep G4'ünü uçuran ilk kadın oldu.	22 жашындагы Сержант Сандра Радованович, Серб Аскеринин *Super Galep Г4үндү учкан биринчи аял болду.	22 @yaş @Çavuş @Sandra @Radovanoviç, @Sırp Ordusu Super Galep G4'#чап @uç #пӗрремӗш #арӑм #пул.
Yaşıtları daha araba kullanmayı yeni öğrenirken, Radovanoviç bir savaş uçağına 4.000m irtifada, 700km/sa hızla manevra yaptırıyor.	Теңтуштары дагы араба колдонууну эми үйрөнөт, Радованович бир согуш учагына 4.000m бийиктикте, 700km/*sa ылдамдык менен манёвр кылдырат.	#Тантӑш #ӗнтӗ #ӑйӑ @kullan @yeni #вӗрен @0, @Radovanoviç пӗр #вӑрҫӑ @uçak #4.000 #ҫӳл, 700km/sa #хӑвӑртлӑх @manevra @yap.
Radovanoviç, Sırbistan'ın ilk kadın pilotu olarak tarihe geçti.	Радованович, Сербиянын биринчи аял пилотту боло тарыхка өттү.	@Radovanoviç, @Sırbistan #пӗрремӗш #арӑм #лётчик #пул #хисеп @geç.
Pilot, Belgrad'daki harp okulu Havacılık Okulu'nda son sınıf öğrencisi olarak okuyor.	Пилот, Белграддагы согуш мектеби Абаачылык Мектеби'*nda акыркы класс окуучусу боло окуйт.	#Лётчик, @Belgrad @harp шкулӗ #Авиаци Шкулӗ'*nda @son #курс #студент #пул @oku.
Üç yıl önce, hayatında ilk defa bir uçağa bindi.	Үч жыл мурда, жашоосунда биринчи жолу бир учакка минди.	#Виҫ #ҫул #ӗлӗк, #чӗрлӗх #пӗрремӗш #рас пӗр @uçak @bin.

The diagnostic symbols @ and # mark errors in the lexical transfer, and in the morphological transfer/generation respectively.

A missing lemma from the bilingual dictionary will produce an @ followed by the missing lemma. For example: Belgrad → @Belgrad, the word Belgrad does not appear in the bilingual dictionary.
If the lemma exists in the bilingual dictionary, but the lemma with the part-of-speech tag does not exist, an @ will also be produced. For example: uç → @uç, the word uç only appears with the tag <n>, but the tagger choses <v>.
If the lemma and the part-of-speech tag exist in the bilingual dictionary, but some other tags do not match, an @ will be produced.
If a lemma does not exist in the target language dictionary then a # will be produced. For example: kadın → арӑм → #арӑм
If a lemma exists in the target language morphological dictionary, but there is a mismatch in the morphological tags between the output of transfer and the morphological dictionary, then a # will be produced. For example: ol → пул. The target dictionary does not have the form <past><p3><sg> for the lemma пул.

In principle, before any translator is released, a full test of the dictionaries must be performed (colloquially called testvoc — from test vocabulary), and no diagnostic symbols must be present in the translation. In practice, sometimes errors may remain.

One of the reasons that Apertium avoids modelling unrestricted derivational processes is because they may not be equivalent in both languages. If they are not equivalent, and the transfer rules are not in place, then debugging and testing the translator is much more difficult. Also, if the morphological transducers are cyclic (allow unrestricted derivation) it is impossible to perform a vocabulary test.

Quality[edit]

Quality—you know what it is, yet you don’t know what it is. But that’s self-contradictory. But some things are better than others, that is, they have more quality. But when you try to say what the quality is, apart from the things that have it, it all goes poof! There’s nothing to talk about. But if you can’t say what Quality is, how do you know what it is, or how do you know that it even exists? If no one knows what it is, then for all practical purposes it doesn’t exist at all. But for all practical purposes it really does exist. What else are the grades based on? Why else would people pay fortunes for some things and throw others in the trash pile? Obviously some things are better than others—but what’s the "betterness"? -- So round and round you go, spinning mental wheels and nowhere finding anyplace to get traction. What the hell is Quality? What is it? — Zen and the Art of Motorcycle Maintenance

System quality[edit]

While quality might be a difficult thing to define or quantify, we can get around this interesting philosophical question by assessing it in terms of questions. For example:

How much of the two languages is covered by the dictionaries of the system in the desired domain ?
Is the morphological disambiguation performed well enough to choose both the right translations of words, and make the transfer rules effective ?
Are there inconsistencies in the dictionaries leading to diagnostic symbols ?
Is the system laid out in a way that makes it easy, or feasible to modify ?

Translation quality[edit]

What does it mean to have "good quality" translations ? It really depends on what the system is going to be used for. And comes down to the following:

For producing draft translations:
- According to the person performing the linguistic revision, is it quicker, and more efficient to post edit the draft translations produced by the system than to translate from scratch ?
For producing gisting translations:
- Does the system produce translations which are sufficiently intelligible to make human translation unnecessary in some cases for the particular task at hand ?

Evaluation[edit]

Vocabulary coverage[edit]

The coverage of a system is an indication of how much of the vocabulary it covers in a given corpus or domain. For an idea of what this means, we will try translating a sentence from Turkish to Bashkir with different levels of coverage:

Sentence	Coverage
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Ahmet çabukça eski büyük бер ağaca koşuyor, arkasına Ana'dan saklanıyor.	10%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Ahmet çabukça eski ҙур бер ağaca koşuyor, arkasına Ana'dan saklanıyor.	20%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Ahmet çabukça иҫке ҙур бер ağaca koşuyor, arkasına Ana'dan saklanıyor.	30%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Ahmet çabukça иҫке ҙур бер ağaca koşuyor, arkasına Ананан saklanıyor.	40%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Әхмәт çabukça иҫке ҙур бер ağaca koşuyor, arkasına Ананан saklanıyor.	50%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Әхмәт çabukça иҫке ҙур бер ağaca koşuyor, arkasına Ананан йәшеренә.	60%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Әхмәт çabukça иҫке ҙур бер ağaca koşuyor, артына Ананан йәшеренә.	70%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Әхмәт çabukça иҫке ҙур бер ağaca сабырға, артына Ананан йәшеренә.	80%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Әхмәт çabukça иҫке ҙур бер ағасҡа сабырға, артына Ананан йәшеренә.	90%
Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor. Әхмәт тиҙ иҫке ҙур бер ағасҡа сабырға, артына Ананан йәшеренә.	100%

Usually, coverage is given over a set of sentences, or corpus, instead of over a single sentence. In Apertium, the baseline coverage for releasing a new prototype translator is around 80%, or 2 unknown words in 10 for a given corpus. This is not enough to make revision practical, except in the case of closely-related languages. However, it is usually enough to make translations which are intelligible.

Error rate[edit]

While the coverage gives you an idea of how many words you will have to change in the best case, that is, that the rest of the translation is correct. A more accurate indication of how many words you will have to change when using the translator is given by post-edition word error rate (often abbreviated as wer). This is given as a percentage of changes (insertions, deletions, substitutions) between a machine translated sentence, and a sentence which has been revised by a human translator.

Taking the example above:

		Changes	wer
Original	Ahmet çabukça eski büyük bir ağaca koşuyor, arkasına Ana'dan saklanıyor.	—
Machine translation	Әхмәт тиҙ иҫке ҙур бер ағасҡа саба, артына Гөлнаранан йәшеренә.	—
substitute	Әхмәт тиҙ иҫке ҙур бер ағасҡа йөгөрә, артына Гөлнаранан йәшеренә.	1/10
insert	Әхмәт тиҙ генә иҫке ҙур бер ағасҡа йөгөрә, уның артына Гөлнаранан йәшеренә.	2/10
delete	Әхмәт тиҙ генә иҫке ҙур бер ағасҡа йөгөрә, уның артына Гөлнаранан йәшеренә.	0/10
Revised	Әхмәт тиҙ генә иҫке ҙур бер ағасҡа йөгөрә, уның артына Гөлнаранан йәшеренә.	3/10	30%

As with coverage, error rate evaluation is usually carried out on a corpus of sentences. So it gives you an indication of how many words you are likely to have to change in a given sentence.

When calculated over an appropriate corpus of the target translation domain, the combination of word error rate and coverage can give an idea of the usefulness of a machine translation system for a specific task. Of course, to determine if a system is useful for translators, a more thorough and case-specific evaluation needs to be made.

Practice[edit]

Consistency[edit]

Testvoc[edit]

To get an idea of how to run a testvoc script, and what the output looks like, go to the apertium-tt-ba directory. Make sure the translator is compiled (e.g. type make) and then enter the dev/ subdirectory.

In order to perform the testvoc you need to run the testvoc.sh command:

$ sh testvoc.sh 

dl gen  9 13:54:22 GMT 2012
===============================================
POS	Total	Clean	With @	With #	Clean %
v	117457	 117457	0	0	100
n	74148	 74148	0	0	100
num	7564	 7564	0	0	100
cnjcoo	2487	 2487	0	0	100
prn	954	 954	0	0	100
adj	361	 361	0	0	100
np	62	 62	0	0	100
adv	33	 33	0	0	100
post	11	 11	0	0	100
postadv	4	 4	0	0	100
det	3	 3	0	0	100
guio	2	 2	0	0	100
cm	1	 1	0	0	100
ij	0	 0	0	0	100
===============================================

dl gen  9 14:07:22 GMT 2012
===============================================
POS	Total	Clean	With @	With #	Clean %
v	188860	 188860	0	0	100
n	105840	 105588	252	0	99.76
num	7560	 7560	0	0	100
cnjcoo	1844	 1844	0	0	100
prn	1068	 1068	0	0	100
adj	306	 306	0	0	100
np	96	 96	0	0	100
adv	41	 41	0	0	100
post	10	 10	0	0	100
det	5	 5	0	0	100
postadv	4	 4	0	0	100
guio	2	 2	0	0	100
cm	1	 1	0	0	100
ij	0	 0	0	0	100
===============================================

The whole test will take around 20—25 minutes to run, and will generate around 3-400M of output.

Evaluation[edit]

Coverage[edit]

To calculate coverage, of your morphological analyser, the easiest way is the following, demonstrating again with the Tatar and Bashkir pair:


$ cat tt.txt | apertium-destxt | hfst-proc tt-ba.automorf.hfst  | apertium-retxt  | sed 's/\$\W*\^/$\n^/g' > tt-cov.txt

$ cat tt-cov | wc -l
384

$ cat tt-cov | grep -v '\*' | wc -l
359

$ calc 359/384
	~0.93489583333333333333

Giving a coverage of 93.4% for the Tatar morphological analyser for the example text.

Word error rate[edit]

Apertium has a tool for calculating the Word error rate between a reference translation and a machine translation. The tool is called apertium-eval-translator and can be found in the trunk of the Apertium SVN repository. The objective of this practical is to try it out on the system you have created.

You will need two reference translations. The first will be the "original" text in the target language, this was created without post-editting. The second will be a post-editted version of the machine translation text. When you are creating the post-editted version, take care to make only the minimal changes required to produce an adequate translation.

Here is an example for Bashkir to Tatar. Presuming the example text is in a file called ba.txt, run the command:

$ cat ba.txt  | apertium -d . ba-tt > ba-tt.txt

Check that the file has been created properly, then run:

$ apertium-eval-translator -r tt.txt -t ba-tt.txt 
Test file: 'ba-tt.txt'
Reference file 'tt.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 311
Number of words in test: 313
Number of unknown words (marked with a star) in test: 
Percentage of unknown words: 0.00 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 42
Word error rate (WER): 13.42 %
Number of position-independent word errors: 42
Position-independent word error rate (PER): 13.42 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 42
Word Error Rate (WER): 13.42 %
Number of position-independent word errors: 42
Position-independent word error rate (PER): 13.42 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0%

This gives information about the Word error rate, and some other statistics about the two files. The -r argument gives the reference translation, in this case tt.txt is the file containing the example text in Tatar. The -t gives the test text, e.g. the output of the machine translation system.

Now make a copy of the file ba-tt.txt called tt2.txt and edit it so that it becomes an adequate translation, then rerun the above commands, substituting tt.txt with tt2.txt and compare the results.

Helsinki Apertium Workshop/Session 7

Contents

Theory[edit]

Consistency[edit]

Self-contained system[edit]

Inconsistency[edit]

Quality[edit]

System quality[edit]

Translation quality[edit]

Evaluation[edit]

Vocabulary coverage[edit]

Error rate[edit]

Practice[edit]

Consistency[edit]

Testvoc[edit]

Evaluation[edit]

Coverage[edit]

Word error rate[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools