Difference between revisions of "Unsupervised tagger training"

From Apertium
Jump to navigation Jump to search
(Link to French page)
 
(18 intermediate revisions by 6 users not shown)
Line 1: Line 1:
  +
[[Créer un tagueur en mode automatique|En français]]
  +
 
{{TOCD}}
 
{{TOCD}}
 
{{see-also|Tagger training}}
 
{{see-also|Tagger training}}
   
First, make a directory called <code><lang>-tagger-data</code>. Put your corpus into there with a name like <code><lang>.crp.txt</code>. Make sure the corpus is in raw text format with one sentence per line.
+
First, make a directory called <code><lang>-tagger-data</code>. Put your corpus into there with a name like <code><lang>.crp.txt</code>. Make sure the corpus is in raw text format.
   
 
Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For <code>apertium-en-af</code> I took the Makefile from <code>apertium-en-ca</code>. The file that you need is called <code>en-ca-unsupervised.make</code>.
 
Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For <code>apertium-en-af</code> I took the Makefile from <code>apertium-en-ca</code>. The file that you need is called <code>en-ca-unsupervised.make</code>.
Line 34: Line 36:
 
en-af.prob;
 
en-af.prob;
 
Calculating ambiguity classes...
 
Calculating ambiguity classes...
Kupiec's initialization of transition and emission probabilities...
+
Kupiec's initialisation of transition and emission probabilities...
 
Applying forbid and enforce rules...
 
Applying forbid and enforce rules...
 
Training (Baum-Welch)...
 
Training (Baum-Welch)...
Line 43: Line 45:
   
   
  +
== Troubleshooting ==
== Some questions and answers about unsupervised tagger training ==
 
   
  +
<pre>
Q: How big a dictionary do I need?
 
  +
NAN(3) gamma[230] = -nan alpha[2][230]= 3.59646e-06 beta[1][230] = 0 prob = 0 previous gamma = 0
  +
</pre>
   
  +
If you get an error like this, then probably your [[TSX format|TSX file]] does not have any categories which are not marked with <code>closed=true</code>. Make sure you have at least one open category before training.
A: For [[English and Esperanto]] we had approx 13000 entries. Approx half of the training sentences had an unknown word. With this we got very poor tagger performance. Then we added 7000 proper nouns, so we had 20000 entries. That made the quality acceptable.
 
   
 
== Some questions and answers about unsupervised tagger training ==
   
 
;Q<nowiki>:</nowiki> How big a dictionary do I need?
Q: My dix is not big enough, and approx half of the training sentences has an unknown word. Can't I just grep these sentences away, and then train on the rest?
 
  +
 
A: For [[English and Esperanto]] we had approx 13,000 entries. Approx half of the training sentences had an unknown word. With this we got very poor tagger performance. Then we added 7,000 proper nouns, so we had 20,000 entries. That made the quality acceptable.
  +
 
;Q<nowiki>:</nowiki> My dix is not big enough, and approx half of the training sentences has an unknown word. Can't I just grep these sentences away, and then train on the rest?
   
 
A: No. Unknown words gets a special category, so you also needs some adequate representation of unknown words in your training set.
 
A: No. Unknown words gets a special category, so you also needs some adequate representation of unknown words in your training set.
   
 
;Q<nowiki>:</nowiki> In which circumstances can I just copy a tagger .prob file (or a .tsx file) from another project?
   
 
A: You must make sure that the symbols are exactly the same. For example eo-en uses symbols <code>have<vblex><pres><p3><sg></code> and es-en uses <code>have<vblex><pri><p3><sg></code>, so they will not work.
Q: In which circonstances can I just copy a tagger .prob file (or a .tsx file) from another project?
 
   
 
;Q<nowiki>:</nowiki> I changed a paradigm which is often used and now a lot of the words that uses that paradigm are tagged differently!
A: You must make sure that the symbols are exactly the same. For example eo-en uses symbols ''have<vblex><pres><p3><sg>'' and es-en uses ''have<pri><pres><p3><sg>'', so they will not work.
 
   
 
A: Yes. You will need to retrain your tagger because the probablilities have changed.
 
If you for example remove imperative (which in English is the same as the infinitive) for a verb paradigm the tagger will distribute the probabilities to the other possibilities.
   
 
;Q<nowiki>:</nowiki> Can I make the tagger distinguish between surface forms that are the same in all circumstances.
Q: I changed a paradigm which is often used and now a lot of the words that uses that paradigm are tagged differently!
 
   
 
A: Probably not very well. For example in English imperative has the same form as the infinitive. Unless you write some extremely clever TSX rules the tagger has no chance of distinguishing the two forms and will select between them more or less randomly.
A: Yes. You will need to retrain your tagger becaurse the probablilites have changed.
 
 
Such things are much better detected and handled in transfer.
If you for example remove imperative (which in English is the same as the infinitive) for a verb paradigm the tagger will distribute the probablities to the other possibilites.
 
   
 
;Q<nowiki>:</nowiki> What does <code>apertium-tagger-apply-new-rules</code> do?
   
  +
A: It applies forbid and enforce rules from a new TSX file on an existing .prob file, with no need to retrain. The categories must remain the same. It is a quick solution for small changes, if you modify the TSX file a lot, it is recommended to retrain the tagger.
Q: Can I make the tagger distinguish between surface forms that are the same in all circonstances.
 
   
  +
;Q<nowiki>:</nowiki> I was told the tale that taggers work at 99% or more for English. It seems not the case in Apertium. Was it a just a tale, or the Apertium tagger is not complex enough?
A: Probably not very well. For example in English imperative has the same form as the infinitive. Unless you write some extremely clever TSX rules the tagger has no change of distinguishing the two forms and will select between them more or less randomly.
 
  +
Such things are much better detected and handled in transfer.
 
  +
A: The best tagger works at 99%. Humans generally have 98% agreement and our tagger works at around 93-95%.
  +
  +
Why does our English tagger work badly:
  +
  +
# the best taggers have many hand written disambiguation rules
  +
# the best HMM taggers use trigrams (we use bigrams -- for speed)
  +
# the best taggers use hand-tagged corpora to train with (we use untagged corpora -- for English)
   
  +
So, to improve the performance, you'd need to either: 1) write better disambiguation rules, 2) adapt the tagger to use trigrams, 3) hand-tag a training corpus -- or convert one that is already tagged.
   
  +
;Q<nowiki>:</nowiki> The tagger is taking very little CPU anyway, It's the transfer that is CPU intensive. so why bother with CPU contraints?
Q: What does <pre>apertium-tagger-apply-new-rules</pre> do?
 
   
  +
A: The tagger was designed and implemented when we had 1-stage transfer (but you are welcome to re-write the tagger to use trigrams :-)
A: It applies forbid and enforce rules from a new TSX file on an existing .prob file.
 
The categories must remain the same.
 
   
 
== Improving the tagger performance ==
 
== Improving the tagger performance ==
   
Q: My tagger is performin poorly. What can I do?
+
;Q<nowiki>:</nowiki> My tagger is performing poorly. What can I do?
   
 
A: Assuming that your TSX file is OK, the best thing you can do is to add words to your dix so less words (but still some) are unknown.
 
A: Assuming that your TSX file is OK, the best thing you can do is to add words to your dix so less words (but still some) are unknown.
 
You can also try with another corpus.
 
You can also try with another corpus.
   
 
;Q<nowiki>:</nowiki> Can't I just tag a corpus with the tagger, correct the tags in places where it has selected the wrong possibility, and retrain on that file?
   
  +
A: Yes you can. This is called supervised training: using a manually disambiguated corpus. You will need about 25.000 words to obtain good results.
Q: Can't I just tag a corpus with the tagger, correct the tags in places where it has selected the wrong possibility, and retrain on that file?
 
   
 
;Q<nowiki>:</nowiki> Can I improve my unsupervised training with selected by-hand disambiguated examples?
A: ??????
 
   
 
A: You can train with a new iteration taking the probabilities from the previous training with the option --retrain.
 
Categories must be the same, the <code>.tsx</code> file must be the same.
   
 
The expert here is Felipe. He said:
 
Q: Can I improve my unsupervised training with selected by-hand disambiguated examples?
 
 
A: You can train with a new iteration taking the probabilities from the previous training with the option --retrain.
 
Categories must be the same, the tsx file must be the same.
 
The expert here is felipeç. He said:
 
   
 
The option --retrain is used to retrain the tagger:
 
The option --retrain is used to retrain the tagger:
In each iteration of Baum Welch, the probabilities of the Markov model are reestimated using the probabilities obtained in the previous iteration.
+
In each iteration of Baum Welch, the probabilities of the Markov model are re-estimated using the probabilities obtained in the previous iteration.
With --retrain what you are saying to the tagger is to read the probabilities of a file and reestimate them with the training corpus; in other words, to add one or more iterations.
+
With --retrain what you are saying to the tagger is to read the probabilities of a file and re-estimate them with the training corpus; in other words, to add one or more iterations.
 
For example, training with 6 iterations and retraining with 2 is equivalent to training with 8 iterations from the beginning (supposing that it has the same corpus, of course).
 
For example, training with 6 iterations and retraining with 2 is equivalent to training with 8 iterations from the beginning (supposing that it has the same corpus, of course).
   
A way to mix supervised and unsupervised training is to train supervisedly with a manually labelled (disambiguated) corpus and afterwards
+
A way to mix supervised and unsupervised training is to train supervisedly with a manually tagged (disambiguated) corpus and afterwards
re-train (--retrain) with a bigger unlabelled corpus.
+
re-train (--retrain) with a bigger untagged corpus.
   
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]
  +
[[Category:Tagger]]

Latest revision as of 08:34, 8 October 2014

En français

See also: Tagger training

First, make a directory called <lang>-tagger-data. Put your corpus into there with a name like <lang>.crp.txt. Make sure the corpus is in raw text format.

Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For apertium-en-af I took the Makefile from apertium-en-ca. The file that you need is called en-ca-unsupervised.make.

Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, BASENAME, LANG1, and LANG2. Everything else should be fine.

Now run:

$ make -f en-af-unsupervised.make

and wait... you should get some output like:

Generating en-tagger-data/en.dic
This may take some time. Please, take a cup of coffee and come back later.
apertium-validate-dictionary apertium-en-af.en.dix
apertium-validate-tagger apertium-en-af.en.tsx
lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\
        awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded
lt-proc -a en-af.automorf.bin <en.dic.expanded | \
        apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic
rm en.dic.expanded;
apertium-destxt < en-tagger-data/en.crp.txt | lt-proc en-af.automorf.bin > en-tagger-data/en.crp
apertium-validate-tagger apertium-en-af.en.tsx
apertium-tagger -t 8 \
                           en-tagger-data/en.dic \
                           en-tagger-data/en.crp \
                           apertium-en-af.en.tsx \
                           en-af.prob;
Calculating ambiguity classes...
Kupiec's initialisation of transition and emission probabilities...
Applying forbid and enforce rules...
Training (Baum-Welch)...
Applying forbid and enforce rules...

And after this you should have a en-af.prob file, which can be used with the apertium-tagger module.


Troubleshooting[edit]

NAN(3) gamma[230] = -nan alpha[2][230]= 3.59646e-06 beta[1][230] = 0 prob = 0 previous gamma = 0

If you get an error like this, then probably your TSX file does not have any categories which are not marked with closed=true. Make sure you have at least one open category before training.

Some questions and answers about unsupervised tagger training[edit]

Q: How big a dictionary do I need?

A: For English and Esperanto we had approx 13,000 entries. Approx half of the training sentences had an unknown word. With this we got very poor tagger performance. Then we added 7,000 proper nouns, so we had 20,000 entries. That made the quality acceptable.

Q: My dix is not big enough, and approx half of the training sentences has an unknown word. Can't I just grep these sentences away, and then train on the rest?

A: No. Unknown words gets a special category, so you also needs some adequate representation of unknown words in your training set.

Q: In which circumstances can I just copy a tagger .prob file (or a .tsx file) from another project?

A: You must make sure that the symbols are exactly the same. For example eo-en uses symbols have<vblex><pres><p3><sg> and es-en uses have<vblex><pri><p3><sg>, so they will not work.

Q: I changed a paradigm which is often used and now a lot of the words that uses that paradigm are tagged differently!

A: Yes. You will need to retrain your tagger because the probablilities have changed. If you for example remove imperative (which in English is the same as the infinitive) for a verb paradigm the tagger will distribute the probabilities to the other possibilities.

Q: Can I make the tagger distinguish between surface forms that are the same in all circumstances.

A: Probably not very well. For example in English imperative has the same form as the infinitive. Unless you write some extremely clever TSX rules the tagger has no chance of distinguishing the two forms and will select between them more or less randomly. Such things are much better detected and handled in transfer.

Q: What does apertium-tagger-apply-new-rules do?

A: It applies forbid and enforce rules from a new TSX file on an existing .prob file, with no need to retrain. The categories must remain the same. It is a quick solution for small changes, if you modify the TSX file a lot, it is recommended to retrain the tagger.

Q: I was told the tale that taggers work at 99% or more for English. It seems not the case in Apertium. Was it a just a tale, or the Apertium tagger is not complex enough?

A: The best tagger works at 99%. Humans generally have 98% agreement and our tagger works at around 93-95%.

Why does our English tagger work badly:

  1. the best taggers have many hand written disambiguation rules
  2. the best HMM taggers use trigrams (we use bigrams -- for speed)
  3. the best taggers use hand-tagged corpora to train with (we use untagged corpora -- for English)

So, to improve the performance, you'd need to either: 1) write better disambiguation rules, 2) adapt the tagger to use trigrams, 3) hand-tag a training corpus -- or convert one that is already tagged.

Q: The tagger is taking very little CPU anyway, It's the transfer that is CPU intensive. so why bother with CPU contraints?

A: The tagger was designed and implemented when we had 1-stage transfer (but you are welcome to re-write the tagger to use trigrams :-)

Improving the tagger performance[edit]

Q: My tagger is performing poorly. What can I do?

A: Assuming that your TSX file is OK, the best thing you can do is to add words to your dix so less words (but still some) are unknown. You can also try with another corpus.

Q: Can't I just tag a corpus with the tagger, correct the tags in places where it has selected the wrong possibility, and retrain on that file?

A: Yes you can. This is called supervised training: using a manually disambiguated corpus. You will need about 25.000 words to obtain good results.

Q: Can I improve my unsupervised training with selected by-hand disambiguated examples?

A: You can train with a new iteration taking the probabilities from the previous training with the option --retrain. Categories must be the same, the .tsx file must be the same.

The expert here is Felipe. He said:

The option --retrain is used to retrain the tagger: In each iteration of Baum Welch, the probabilities of the Markov model are re-estimated using the probabilities obtained in the previous iteration. With --retrain what you are saying to the tagger is to read the probabilities of a file and re-estimate them with the training corpus; in other words, to add one or more iterations. For example, training with 6 iterations and retraining with 2 is equivalent to training with 8 iterations from the beginning (supposing that it has the same corpus, of course).

A way to mix supervised and unsupervised training is to train supervisedly with a manually tagged (disambiguated) corpus and afterwards re-train (--retrain) with a bigger untagged corpus.