Difference between revisions of "Unigram tagger"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
<code>apertium-tagger</code> from m5w/apertium<ref name="_1">https://github.com/m5w/apertium</ref> supports all the unigram models from “A set of open-source tools for Turkish natural language processing”<ref name="_2">http://coltekin.net/cagri/papers/trmorph-tools.pdf</ref>.
+
<code>apertium-tagger</code> from “m5w/apertium<ref name="_1">https://github.com/m5w/apertium</ref> supports all the unigram models from “A set of open-source tools for Turkish natural language processing<ref name="_2">http://coltekin.net/cagri/papers/trmorph-tools.pdf</ref>.
 
==Installation==
 
==Installation==
First, install all prerequisites. See If you want to add language data / do more advanced stuff”<ref name="_3">[[Installation#If you want to add language data / do more advanced stuff]]</ref> .
+
First, install all prerequisites. See “If you want to add language data / do more advanced stuff<ref name="_3">[[Installation#If you want to add language data / do more advanced stuff]]</ref>.
   
Then, replace <code>&lt;directory&gt;</code> with the directory you'd like to clone m5w/apertium into and clone the repository.
+
Then, replace <code>&lt;directory&gt;</code> with the directory you'd like to clone “m5w/apertium” into and clone the repository.
   
 
<pre>
 
<pre>

Revision as of 03:56, 16 January 2016

apertium-tagger from “m5w/apertium[1]” supports all the unigram models from “A set of open-source tools for Turkish natural language processing[2].”

Installation

First, install all prerequisites. See “If you want to add language data / do more advanced stuff[3].”

Then, replace <directory> with the directory you'd like to clone “m5w/apertium” into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, configure your environment[4] and finally configure, build, and install[5] m5w/apertium .

Usage

See apertium-tagger -h .

Training a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as one would for all other models.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example 2.1.1: handtagged.txt a Hand-Tagged Corpus for apertium-tagger

Then, replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the filename to which you'd like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

Disambiguation

Either write input to a file or pipe it to the tagger.

$ cat raw.txt
^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

Example 2.2.1: raw.txt Input for apertium-tagger

Replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt
^a/a<b>$
^aa/a<b>+a<b>$
$ echo '^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \
apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER
^a/a<b>$
^aa/a<b>+a<b>$

Unigram Models

See section 5.3 of [1] .

Model 1

See section 5.3.1 of "A set of open-source tools for Turkish natural language processing."

This model assigns each analysis string a score of

with additive smoothing.

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$

Example 3.1.1: handtagged.txt : A Hand-Tagged Corpus for apertium-tagger

Given the lexical unit ^a/a<a>/a<b>/a<c>$ , the tagger assigns the analysis string a<a> a score of

The tagger then assigns the analysis string a<b> a score of

and the unknown analysis string a<c> a score of

If ./autogen.sh is passed the option --enable-debug , the tagger prints such calculations to standard error.

$ ./autogen.sh --enable-debug
$ make
$ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER


score("a<a>") ==
  2 ==
  2.000000000000000000
score("a<b>") ==
  3 ==
  3.000000000000000000
score("a<c>") ==
  1 ==
  1.000000000000000000
^a<b>$

Training on Corpora with Ambiguous Lexical Units

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<a>/a<b>$
^a/a<b>$
^a/a<b>$

Example 3.1.1.1: a Hand-Tagged Corpus for apertium-tagger

The probabilities of a<a> and a<b> and both half for the first lexical unit. However, all unigram models store frequencies as std::size_t.

To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to one, expecting unambiguous lexical units. The size of this corpus' first lexical unit, ^a/a<a>$, one, is divisible by the LCM, one, so the tagger increments the frequency of its analysis, a<a> by

The size of the next lexical unit, ^a/a<a>/a<b>$, two, isn't divisible by the LCM, one. Therefore, the tagger first multiplies the LCM, one, by the size, two, to yield two. Then, the tagger does the same to the frequency of a<a> by the size, also yielding two. Finally, the tagger increments the frequency of each of this lexical unit's analyses, a<a> and a<b>, by

The frequency of a<a> is then three, and the frequency of a<b> is one.

The tagger then increments the frequency of the next lexical unit's analysis, a<b> by

After doing the same for the last lexical unit, the frequency of a<a> is three and the frequency of a<b> is five.

Each model implements functions to increment analyses and multiply previous ones, so this method works for models 2 and 3 as well.

TODO: If one passes the -d option to apertium-tagger, the tagger prints warnings about ambiguous analyses in corpora to stderr.

$ apertium-tagger -ds 0 -u 1 handtagged.txt
apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal
ysis "a<a>"
^a/a<a>/a<b>$
            ^

Model 2

See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."

Consider Example 3.1.1.

The tag string <b> is twice as frequent as <a>. However, model 1 scores b<a> and b<b> equally because neither analysis string appears in the corpus.

This model splits each analysis string into a root, , and the part of the analysis string that isn't the root, . An analysis string's root is its first lemma. The of a<b>+c<d> is a ; its is <b>+c<d> . The tagger assigns each analysis string a score of with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent than to unknown analysis strings with infrequent .

Given the lexical unit ^b/b<a>/b<b>$, the tagger assigns the analysis string b<a> a score of

Note that counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a> a score of

The tagger assigns the analysis string b<b> a score of

Notes