Difference between revisions of "Tagger training"

From Apertium
Jump to navigation Jump to search
 
(33 intermediate revisions by 10 users not shown)
Line 1: Line 1:
  +
[[Entraînement d'un tagueur|En français]]
  +
 
{{TOCD}}
 
{{TOCD}}
   
Line 7: Line 9:
 
===Wikipedia===
 
===Wikipedia===
   
A basic corpus can be retrieved from Wikipedia as follows:
+
A basic corpus can be retrieved from a Wikipedia dump (see [http://download.wikimedia.org/backup-index.html here]) as follows:
   
 
<pre>
 
<pre>
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
+
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
 
sed 's/&.*;/ /g' > mycorpus.txt
 
 
</pre>
 
</pre>
  +
  +
Another option for stripping Wikipedia, which will probably result in a higher quality corpus, is as follows. First download the Wikipedia extractor script from [http://wiki.apertium.org/wiki/Wikipedia_Extractor here], then:
  +
 
<pre>
  +
$ bzcat enwiki-20081008-pages-articles.xml.bz2.part > enwiki.xml
 
</pre>
  +
  +
Now use the script above to get <code>enwiki.txt</code> and then
  +
 
<pre>
  +
$ cat enwiki-20091001-pages-articles.txt | grep -v "''" | grep -v http | grep -v "#" | grep -v "@" |\
  +
grep -e '................................................' | sort -fiu | sort -R | nl -s ". " > enwiki.crp.txt
 
</pre>
  +
  +
The last 3 commands are not strictly necesary. They sort and finds only uniqe lines, then sorts randomly (mix the sentences) and adds line numbers.
   
 
===Other sources===
 
===Other sources===
Line 20: Line 35:
   
 
==Writing a TSX file==
 
==Writing a TSX file==
  +
{{see-also|TSX format}}
 
A <code>.tsx</code> file is a tag definition file, it turns the fine tags from the morphological analyser into course tags for the tagger. The DTD is in <code>tagger.dtd</code>, although it is probably easier to take a look at one of the pre-written ones in other language pairs.
+
A <code>.tsx</code> file is a tag definition file, it turns the fine tags from the morphological analyser into coarse tags for the tagger. The DTD is in <code>tagger.dtd</code>, although it is probably easier to take a look at one of the pre-written ones in other language pairs.
   
 
The file should be in the language pair directory and be called (in for example English-Afrikaans), <code>apertium-en-af.en.tsx</code> for the English tagger, and <code>apertium-en-af.af.tsx</code> for the Afrikaans tagger.
 
The file should be in the language pair directory and be called (in for example English-Afrikaans), <code>apertium-en-af.en.tsx</code> for the English tagger, and <code>apertium-en-af.af.tsx</code> for the Afrikaans tagger.
   
The TSX file defines a set of "coarse tags" for groups of "fine tags", this is done because the POS tagging module does not need so much information as is defined in the fine tags. It also allows the user to apply a set of restrictions or enforcements. For example to forbid a relative adverb at the start of a sentence (<code>SENT RELADV</code>, or to forbid a pronoun after a noun (<code>NOM PRNPERS</code>).
+
The TSX file defines a set of "coarse tags" for groups of "fine tags", this is done because the POS tagging module does not need so much information as is defined in the fine tags. It also allows the user to apply a set of restrictions or enforcements. For example to forbid a relative adverb at the start of a sentence (<code>SENT RELADV</code>), or to forbid a pronoun after a noun (<code>NOM PRNPERS</code>).
   
You can also write lexical rules, so for example in Afrikaans, the word "deur" is polysemic, one meaning is "by" (as a preposition) and the other is "door" (as a noun). So we can define two course tags, <code>DEURNOM</code> and <code>DEURPR</code>, and then a forbid rule to say "forbid 'door' before 'the'".
+
You can also write lexical rules, so for example in Afrikaans, the word "deur" is ambiguous, one meaning is "by" (as a preposition) and the other is "door" (as a noun). So we can define two coarse tags, <code>DEURNOM</code> and <code>DEURPR</code>, and then a forbid rule to say "forbid 'door' before 'the'".
   
 
It is worth considering this file carefully and probably also consulting with a linguist, as the tagger can make a big difference to the quality of the final translation. The example below gives the basic structure of the file:
 
It is worth considering this file carefully and probably also consulting with a linguist, as the tagger can make a big difference to the quality of the final translation. The example below gives the basic structure of the file:
Line 41: Line 56:
 
<tags-item lemma="deur" tags="pr"/>
 
<tags-item lemma="deur" tags="pr"/>
 
</def-label>
 
</def-label>
<def-label name="NOM" closed="true">
+
<def-label name="NOM">
 
<tags-item tags="n.*"/>
 
<tags-item tags="n.*"/>
 
</def-label>
 
</def-label>
<def-label name="PRPERS" closed="true">
+
<def-label name="PRNPERS" closed="true">
 
<tags-item tags="prpers.*"/>
 
<tags-item tags="prpers.*"/>
 
</def-label>
 
</def-label>
Line 71: Line 86:
   
 
* Unsupervised &mdash; This uses a large (hundreds of thousands of words) ''untagged'' corpus and the iterative Baum-Welch algorithm in a wholely unsupervised manner. This is the least effective way of training the tagger, but is also the cheapest in terms of time and resources.
 
* Unsupervised &mdash; This uses a large (hundreds of thousands of words) ''untagged'' corpus and the iterative Baum-Welch algorithm in a wholely unsupervised manner. This is the least effective way of training the tagger, but is also the cheapest in terms of time and resources.
* Supervised &mdash; This uses a medium sized ''tagged'' corpus.
+
* Supervised &mdash; This uses a medium sized (minimum 30,000 words) ''tagged'' corpus.
  +
* Using <code>apertium-tagger-trainer</code> &mdash; This uses a large ''untagged'' corpus in the target language, a previously trained <code>.prob</code> file and an existing translator. It performs as well as supervised training without the need of hand-tagging a corpus, at the expense of being a bit tricky to set up.
* Using <code>apertium-tagger-trainer</code> &mdash;
 
  +
  +
At the moment <code>apertium-tagger-trainer</code> only works with apertium 1, so it's not an option for most pairs.--[[User:Jacob Nordfalk|Jacob Nordfalk]] 06:15, 17 September 2008 (UTC)
  +
(Clarification: it only works with ''one-stage transfer'', so Apertium 3 pairs which only have t1x can still use it.)
   
 
===Unsupervised===
 
===Unsupervised===
  +
{{main|Unsupervised tagger training}}
   
 
===Supervised===
First, make a directory called <code><lang>-tagger-data</code>. Put the corpus you downloaded into there with a name like <code><lang>.crp.txt</code>. Make sure the corpus is in raw text format with one sentence per line.
 
  +
{{main|Supervised tagger training}}
   
  +
===Target language tagger training===
Once you have your corpus in there you need a Makefile that specifies how to generate the probability file. You can grab one from another language package. For <code>apertium-en-af</code> I took the Makefile from <code>apertium-en-ca</code>. The file that you need is called <code>en-ca-unsupervised.make</code>.
 
  +
{{main|Target language tagger training}}
 
There is a package called <code>apertium-tagger-training-tools</code> that trains taggers based on both source and target language information. The resulting probability files are as good as supervised training for machine translation purposes, but much quicker to produce, and with less effort.
   
  +
==See also==
Copy it into your main language pair directory under an appropriate name, then edit it and change the variables at the top of the file, <code>BASENAME</code>, <code>LANG1</code>, and <code>LANG2</code>. Everything else should be fine.
 
   
  +
* [[Apertium and Constraint Grammar]]
Now run:
 
   
  +
==Further reading==
<pre>
 
  +
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/sanchez08b.bib bibtex]) Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada (2008). "[http://www.springerlink.com/content/m452802q3536044v/?p=61e26194c87e4a5780c77303b3210210&pi=2 Using target-language information to train part-of-speech taggers for machine translation]". In Machine Translation, volume 22, numbers 1-2, p. 29-66.
$ make -f en-af-unsupervised.make
 
</pre>
 
   
  +
* ([http://www.dlsi.ua.es/~fsanchez/pub/thesis/thesis.bib bibtex]) Felipe Sánchez-Martínez (2008). "[http://www.dlsi.ua.es/~fsanchez/pub/thesis/thesis-sin.pdf Using unsupervised corpus-based methods to build rule-based machine translation systems]". PhD thesis, Departament de Llenguatges i Sistemes Infomàtics, Universitat d'Alacant, Spain.
and wait... you should get some output like:
 
 
<pre>
 
Generating en-tagger-data/en.dic
 
This may take some time. Please, take a cup of coffee and come back later.
 
apertium-validate-dictionary apertium-en-af.en.dix
 
apertium-validate-tagger apertium-en-af.en.tsx
 
lt-expand apertium-en-af.en.dix | grep -v "__REGEXP__" | grep -v ":<:" |\
 
awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt >en.dic.expanded
 
lt-proc -a en-af.automorf.bin <en.dic.expanded | \
 
apertium-filter-ambiguity apertium-en-af.en.tsx > en-tagger-data/en.dic
 
rm en.dic.expanded;
 
apertium-destxt < en-tagger-data/en.crp.txt | lt-proc en-af.automorf.bin > en-tagger-data/en.crp
 
apertium-validate-tagger apertium-en-af.en.tsx
 
apertium-tagger -t 8 \
 
en-tagger-data/en.dic \
 
en-tagger-data/en.crp \
 
apertium-en-af.en.tsx \
 
en-af.prob;
 
Calculating ambiguity classes...
 
Kupiec's initialization of transition and emission probabilities...
 
Applying forbid and enforce rules...
 
Training (Baum-Welch)...
 
Applying forbid and enforce rules...
 
</pre>
 
 
And after this you should have a <code>en-af.prob</code> file, which can be used with the <code>apertium-tagger</code> module.
 
 
===Supervised===
 
   
  +
* Felipe Sánchez-Martínez, Carme Armentano-Oller, Juan Antonio Pérez-Ortiz, Mikel L. Forcada. (2007) "[http://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez07b.pdf Training part-of-speech taggers to build machine translation systems for less-resourced language pairs]". ''Procesamiento del Lenguaje Natural nº 39, (XXIII Congreso de la Sociedad Española de Procesamiento del Lenguaje Natural)'', pp. 257&mdash;264
===Using <code>apertium-tagger-trainer</code>===
 
   
  +
* Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada. (2004) "[http://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez04a.pdf Cooperative unsupervised training of the part-of-speech taggers in a bidirectional machine translation system]". ''Proceedings of TMI, The Tenth Conference on Theoretical and Methodological Issues in Machine Translation'', pp. 135&mdash;144
There is a package called <code>apertium-tagger-trainer</code> that trains taggers based on both source and target language information. The resulting probability files are as good as supervised training, but much quicker to produce, and with less effort.
 
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]
  +
[[Category:Quickstart]]
  +
[[Category:Tagger]]

Latest revision as of 07:37, 4 July 2016

En français

Once your dictionaries are of a reasonable size, say perhaps 3,000 lemmata in total, it is worth training the tagger. To do this, you'll need a couple of things, a decent sized corpus, either tagged or untagged, and a .tsx file. The basic instructions may be found below.

Creating a corpus[edit]

Wikipedia[edit]

A basic corpus can be retrieved from a Wikipedia dump (see here) as follows:

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' > mycorpus.txt

Another option for stripping Wikipedia, which will probably result in a higher quality corpus, is as follows. First download the Wikipedia extractor script from here, then:

$ bzcat enwiki-20081008-pages-articles.xml.bz2.part > enwiki.xml

Now use the script above to get enwiki.txt and then

$ cat enwiki-20091001-pages-articles.txt | grep -v "''" | grep -v http | grep -v "#" | grep -v "@" |\
grep -e '................................................' | sort -fiu | sort -R | nl -s ". " > enwiki.crp.txt

The last 3 commands are not strictly necesary. They sort and finds only uniqe lines, then sorts randomly (mix the sentences) and adds line numbers.

Other sources[edit]

Some pre-processed corpora can be found here and here.

Writing a TSX file[edit]

See also: TSX format

A .tsx file is a tag definition file, it turns the fine tags from the morphological analyser into coarse tags for the tagger. The DTD is in tagger.dtd, although it is probably easier to take a look at one of the pre-written ones in other language pairs.

The file should be in the language pair directory and be called (in for example English-Afrikaans), apertium-en-af.en.tsx for the English tagger, and apertium-en-af.af.tsx for the Afrikaans tagger.

The TSX file defines a set of "coarse tags" for groups of "fine tags", this is done because the POS tagging module does not need so much information as is defined in the fine tags. It also allows the user to apply a set of restrictions or enforcements. For example to forbid a relative adverb at the start of a sentence (SENT RELADV), or to forbid a pronoun after a noun (NOM PRNPERS).

You can also write lexical rules, so for example in Afrikaans, the word "deur" is ambiguous, one meaning is "by" (as a preposition) and the other is "door" (as a noun). So we can define two coarse tags, DEURNOM and DEURPR, and then a forbid rule to say "forbid 'door' before 'the'".

It is worth considering this file carefully and probably also consulting with a linguist, as the tagger can make a big difference to the quality of the final translation. The example below gives the basic structure of the file:

<?xml version="1.0" encoding="UTF-8"?>
<tagger name="afrikaans">
  <tagset>
    <def-label name="DEURNOM" closed="true">
      <tags-item lemma="deur" tags="n.*"/>
    </def-label> 
    <def-label name="DEURPR" closed="true">
      <tags-item lemma="deur" tags="pr"/>
    </def-label>     
    <def-label name="NOM">
      <tags-item tags="n.*"/>
    </def-label> 
    <def-label name="PRNPERS" closed="true">
      <tags-item tags="prpers.*"/>
    </def-label> 
    <def-label name="DET" closed="true">
      <tags-item tags="det.*"/>
    </def-label> 
  </tagset>
  <forbid>
    <label-sequence>
      <label-item label="NOM"/>
      <label-item label="PRNPERS"/>
    </label-sequence>
    <label-sequence>
      <label-item label="DEURNOM"/>
      <label-item label="DET"/>
    </label-sequence>
  </forbid>
</tagger>

You will need enough coarse tags to cover all the fine tags in your dictionaries.

Training the tagger[edit]

A brief note on the various kinds of training that you can do:

  • Unsupervised — This uses a large (hundreds of thousands of words) untagged corpus and the iterative Baum-Welch algorithm in a wholely unsupervised manner. This is the least effective way of training the tagger, but is also the cheapest in terms of time and resources.
  • Supervised — This uses a medium sized (minimum 30,000 words) tagged corpus.
  • Using apertium-tagger-trainer — This uses a large untagged corpus in the target language, a previously trained .prob file and an existing translator. It performs as well as supervised training without the need of hand-tagging a corpus, at the expense of being a bit tricky to set up.

At the moment apertium-tagger-trainer only works with apertium 1, so it's not an option for most pairs.--Jacob Nordfalk 06:15, 17 September 2008 (UTC) (Clarification: it only works with one-stage transfer, so Apertium 3 pairs which only have t1x can still use it.)

Unsupervised[edit]

Main article: Unsupervised tagger training

Supervised[edit]

Main article: Supervised tagger training

Target language tagger training[edit]

Main article: Target language tagger training

There is a package called apertium-tagger-training-tools that trains taggers based on both source and target language information. The resulting probability files are as good as supervised training for machine translation purposes, but much quicker to produce, and with less effort.

See also[edit]

Further reading[edit]