Difference between revisions of "Unigram tagger"
Line 1: | Line 1: | ||
− | <code>apertium-tagger</code> from |
+ | <code>apertium-tagger</code> from “m5w/apertium<ref name="_1">https://github.com/m5w/apertium</ref>” supports all the unigram models from “A set of open-source tools for Turkish natural language processing<ref name="_2">http://coltekin.net/cagri/papers/trmorph-tools.pdf</ref>.” |
==Installation== |
==Installation== |
||
− | First, install all prerequisites. See |
+ | First, install all prerequisites. See “If you want to add language data / do more advanced stuff<ref name="_3">[[Installation#If you want to add language data / do more advanced stuff]]</ref>.” |
− | Then, replace <code><directory></code> with the directory you'd like to clone |
+ | Then, replace <code><directory></code> with the directory you'd like to clone “m5w/apertium” into and clone the repository. |
<pre> |
<pre> |
Revision as of 03:56, 16 January 2016
apertium-tagger
from “m5w/apertium[1]” supports all the unigram models from “A set of open-source tools for Turkish natural language processing[2].”
Contents
Installation
First, install all prerequisites. See “If you want to add language data / do more advanced stuff[3].”
Then, replace <directory>
with the directory you'd like to clone “m5w/apertium” into and clone the repository.
git clone https://github.com/m5w/apertium.git <directory>
Then, configure your environment[4] and finally configure, build, and install[5] m5w/apertium
.
Usage
See apertium-tagger -h
.
Training a Model on a Hand-Tagged Corpus
First, get a hand-tagged corpus as one would for all other models.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$ ^aa/a<a>+a<a>$ ^aa/a<a>+a<b>$ ^aa/a<a>+a<b>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$
Example 2.1.1: handtagged.txt
a Hand-Tagged Corpus for apertium-tagger
Then, replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the filename to which you'd like to write the model, and train the tagger.
$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt
Disambiguation
Either write input to a file or pipe it to the tagger.
$ cat raw.txt ^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$
Example 2.2.1: raw.txt
Input for apertium-tagger
Replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the file to which you wrote the unigram model, and disambiguate the input.
$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt ^a/a<b>$ ^aa/a<b>+a<b>$ $ echo '^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER ^a/a<b>$ ^aa/a<b>+a<b>$
Unigram Models
See section 5.3 of [1] .
Model 1
See section 5.3.1 of "A set of open-source tools for Turkish natural language processing."
This model assigns each analysis string a score of
with additive smoothing.
Consider the following corpus.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$
Example 3.1.1: handtagged.txt
: A Hand-Tagged Corpus for apertium-tagger
Given the lexical unit ^a/a<a>/a<b>/a<c>$
, the tagger assigns the analysis string a<a>
a score of
The tagger then assigns the analysis string a<b>
a score of
and the unknown analysis string a<c>
a score of
If ./autogen.sh
is passed the option --enable-debug
, the tagger prints such calculations to standard error.
$ ./autogen.sh --enable-debug $ make $ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER score("a<a>") == 2 == 2.000000000000000000 score("a<b>") == 3 == 3.000000000000000000 score("a<c>") == 1 == 1.000000000000000000 ^a<b>$
Training on Corpora with Ambiguous Lexical Units
Consider the following corpus.
$ cat handtagged.txt ^a/a<a>$ ^a/a<a>/a<b>$ ^a/a<b>$ ^a/a<b>$
Example 3.1.1.1: a Hand-Tagged Corpus for apertium-tagger
The probabilities of a<a>
and a<b>
and both half for the first lexical unit. However, all unigram models store frequencies as std::size_t.
To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to one, expecting unambiguous lexical units. The size of this corpus' first lexical unit, ^a/a<a>$
, one, is divisible by the LCM, one, so the tagger increments the frequency of its analysis, a<a>
by
The size of the next lexical unit, ^a/a<a>/a<b>$
, two, isn't divisible by the LCM, one. Therefore, the tagger first multiplies the LCM, one, by the size, two, to yield two. Then, the tagger does the same to the frequency of a<a>
by the size, also yielding two. Finally, the tagger increments the frequency of each of this lexical unit's analyses, a<a>
and a<b>
, by
The frequency of a<a>
is then three, and the frequency of a<b>
is one.
The tagger then increments the frequency of the next lexical unit's analysis, a<b>
by
After doing the same for the last lexical unit, the frequency of a<a>
is three and the frequency of a<b>
is five.
Each model implements functions to increment analyses and multiply previous ones, so this method works for models 2 and 3 as well.
TODO: If one passes the -d
option to apertium-tagger
, the tagger prints warnings about ambiguous analyses in corpora to stderr.
$ apertium-tagger -ds 0 -u 1 handtagged.txt apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal ysis "a<a>" ^a/a<a>/a<b>$ ^
Model 2
See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."
Consider Example 3.1.1.
The tag string <b>
is twice as frequent as <a>
. However, model 1 scores b<a>
and b<b>
equally because neither analysis string appears in the corpus.
This model splits each analysis string into a root, , and the part of the analysis string that isn't the root, . An analysis string's root is its first lemma. The of a<b>+c<d>
is a
; its is <b>+c<d>
. The tagger assigns each analysis string a score of with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent than to unknown analysis strings with infrequent .
Given the lexical unit ^b/b<a>/b<b>$
, the tagger assigns the analysis string b<a>
a score of
Note that counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a>
a score of
The tagger assigns the analysis string b<b>
a score of