Unsupervised tagger training

From Apertium
Jump to navigation Jump to search
See also: Tagger training

First, make a directory called <lang>-tagger-data. Put your corpus into there with a name like <lang>.crp.txt. Make sure the corpus is in raw text format.

Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For apertium-en-af I took the Makefile from apertium-en-ca. The file that you need is called en-ca-unsupervised.make.

Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, BASENAME, LANG1, and LANG2. Everything else should be fine.

Now run:

$ make -f en-af-unsupervised.make

and wait... you should get some output like:

Generating en-tagger-data/en.dic
This may take some time. Please, take a cup of coffee and come back later.
apertium-validate-dictionary apertium-en-af.en.dix
apertium-validate-tagger apertium-en-af.en.tsx
lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\
        awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded
lt-proc -a en-af.automorf.bin <en.dic.expanded | \
        apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic
rm en.dic.expanded;
apertium-destxt < en-tagger-data/en.crp.txt | lt-proc en-af.automorf.bin > en-tagger-data/en.crp
apertium-validate-tagger apertium-en-af.en.tsx
apertium-tagger -t 8 \
                           en-tagger-data/en.dic \
                           en-tagger-data/en.crp \
                           apertium-en-af.en.tsx \
                           en-af.prob;
Calculating ambiguity classes...
Kupiec's initialization of transition and emission probabilities...
Applying forbid and enforce rules...
Training (Baum-Welch)...
Applying forbid and enforce rules...

And after this you should have a en-af.prob file, which can be used with the apertium-tagger module.


Some questions and answers about unsupervised tagger training

Q: How big a dictionary do I need?

A: For English and Esperanto we had approx 13,000 entries. Approx half of the training sentences had an unknown word. With this we got very poor tagger performance. Then we added 7,000 proper nouns, so we had 20,000 entries. That made the quality acceptable.

Q: My dix is not big enough, and approx half of the training sentences has an unknown word. Can't I just grep these sentences away, and then train on the rest?

A: No. Unknown words gets a special category, so you also needs some adequate representation of unknown words in your training set.

Q: In which circumstances can I just copy a tagger .prob file (or a .tsx file) from another project?

A: You must make sure that the symbols are exactly the same. For example eo-en uses symbols have<vblex><pres><p3><sg> and es-en uses have<pri><pres><p3><sg>, so they will not work.

Q: I changed a paradigm which is often used and now a lot of the words that uses that paradigm are tagged differently!

A: Yes. You will need to retrain your tagger because the probablilities have changed. If you for example remove imperative (which in English is the same as the infinitive) for a verb paradigm the tagger will distribute the probabilities to the other possibilities.

Q: Can I make the tagger distinguish between surface forms that are the same in all circumstances.

A: Probably not very well. For example in English imperative has the same form as the infinitive. Unless you write some extremely clever TSX rules the tagger has no change of distinguishing the two forms and will select between them more or less randomly. Such things are much better detected and handled in transfer.

Q: What does apertium-tagger-apply-new-rules do?

A: It applies forbid and enforce rules from a new TSX file on an existing .prob file, with no need to retrain. The categories must remain the same. It is a quick solution for small changes, if you modify the TSX file a lot, it is recommended to retrain the tagger.

Q: I was told the tale that taggers work at 99% or more for English. It seems not the case in Apertium. Was it a just a tale, or the Apertium tagger is not complex enough?

A: The best tagger works at 99%. Humans generally have 98% agreement and our tagger works at around 93-95%.

Why does our English tagger work badly:

  1. the best taggers have many hand written disambiguation rules
  2. the best HMM taggers use trigrams (we use bigrams -- for speed)
  3. the best taggers use hand-tagged corpora to train with (we use untagged corpora -- for English)

So, to improve the performance, you'd need to either: 1) write better disambiguation rules, 2) adapt the tagger to use trigrams, 3) hand-tag a training corpus -- or convert one that is already tagged.

Q: The tagger is taking very little CPU anyway, It's the transfer that is CPU intensive. so why bother with CPU contraints?

A: The tagger was designed and implemented when we had 1-stage transfer (but you are welcome to re-write the tagger to use trigrams :-)

Improving the tagger performance

Q: My tagger is performing poorly. What can I do?

A: Assuming that your TSX file is OK, the best thing you can do is to add words to your dix so less words (but still some) are unknown. You can also try with another corpus.

Q: Can't I just tag a corpus with the tagger, correct the tags in places where it has selected the wrong possibility, and retrain on that file?

A: Yes you can. This is called supervised training: using a manually disambiguated corpus. You will need about 25.000 words to obtain good results.

Q: Can I improve my unsupervised training with selected by-hand disambiguated examples?

A: You can train with a new iteration taking the probabilities from the previous training with the option --retrain. Categories must be the same, the .tsx file must be the same.

The expert here is Felipe. He said:

The option --retrain is used to retrain the tagger: In each iteration of Baum Welch, the probabilities of the Markov model are re-estimated using the probabilities obtained in the previous iteration. With --retrain what you are saying to the tagger is to read the probabilities of a file and re-estimate them with the training corpus; in other words, to add one or more iterations. For example, training with 6 iterations and retraining with 2 is equivalent to training with 8 iterations from the beginning (supposing that it has the same corpus, of course).

A way to mix supervised and unsupervised training is to train supervisedly with a manually tagged (disambiguated) corpus and afterwards re-train (--retrain) with a bigger untagged corpus.