Difference between revisions of "Unigram tagger"
Line 160: | Line 160: | ||
<math style="padding-left:1cm;"> |
<math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
− | score & = \frac{( |
+ | \mathrm{score} & = \frac{(\mathrm{tokenCount\_r\_a} + 1)(\mathrm{tokenCount\_a} + 1)}{\mathrm{tokenCount\_a} + 1 + \mathrm{typeCount\_a}} \\ |
− | & = \frac{ |
+ | & = \frac{(0 + 1)(1 + 1)}{1 + 1 + 2} \\ |
& = \frac{(1)(2)}{4} \\ |
& = \frac{(1)(2)}{4} \\ |
||
− | & = \frac{1}{2} |
+ | & = \frac{1}{2}~\text{,} |
\end{align} |
\end{align} |
||
</math> |
</math> |
||
Line 169: | Line 169: | ||
<math style="padding-left:1cm;"> |
<math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
− | \text{where |
+ | \text{where}~&\mathrm{tokenCount\_r\_a}~\text{is the frequency of the}~r,a~\text{in the corpus ,} \\ |
− | & |
+ | &\mathrm{tokenCount\_a}~\text{is the frequency of the}~a~\text{in the corpus ,} \\ |
− | \text{and |
+ | \text{and}~&\mathrm{typeCount\_a}~\text{is the size of the parameter vector of all}~r~\text{preceding the}~a~\text{.} |
\end{align} |
\end{align} |
||
</math> |
</math> |
||
+ | Note that <math>\mathrm{typeCount\_a}</math> counts the analysis string being scored. The tagger would assign the known analysis string <code>a<a></code> a score of |
||
− | <math> |
||
− | \text{Note that }typeCount_a\text{ includes the scored }r,a\text{ }\texttt{b<a>}\text{.} |
||
− | </math> |
||
+ | <math style="padding-left:1cm;"> |
||
− | <math> |
||
+ | \begin{align} |
||
− | \text{For the known }r,a\text{ }\texttt{a<a>}\text{, }typeCount_a = 1\text{.} |
||
+ | \mathrm{score} & = \frac{(1 + 1)(1 + 1)}{1 + 1 + 1} \\ |
||
+ | & = \frac{(2)(2)}{3} \\ |
||
+ | & = \frac{4}{3}~\text{.} |
||
+ | \end{align} |
||
</math> |
</math> |
||
Line 186: | Line 188: | ||
<math> |
<math> |
||
+ | \begin{align} |
||
− | \frac{(0) + 1}{(2) + 1 + (2)}\times{[(2) + 1]} = \frac{1}{5}\times{3} = \frac{3}{5}\text{.} |
||
+ | \mathrm{score} & = \frac{(0 + 1)(2 + 1)}{2 + 1 + 2} \\ |
||
+ | & = \frac{(1)(3)}{5} \\ |
||
+ | & = \frac{3}{5}~\text{.} |
||
+ | \end{align} |
||
</math> |
</math> |
||
[[Category:Development]] |
[[Category:Development]] |
Revision as of 19:18, 15 January 2016
apertium-tagger
from m5w/apertium supports all the unigram models from A set of open-source tools for Turkish natural language processing.
Contents
Install
First, install all prerequisites. See Installation#If you want to add language data / do more advanced stuff.
Then, replace <directory>
with the directory you'd like to clone m5w/apertium into and clone the repository.
git clone https://github.com/m5w/apertium.git <directory>
Then, see Minimal installation from SVN#Set up environment. Finally, configure, build, and install m5w/apertium. See Minimal installation from SVN#Configure, build, and install.
Usage
See apertium-tagger -h
.
Train a Model on a Hand-Tagged Corpus
First, get a hand-tagged corpus as one would for all other models.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$ ^aa/a<a>+a<a>$ ^aa/a<a>+a<b>$ ^aa/a<a>+a<b>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$
Example 1: a Hand-Tagged Corpus for apertium-tagger
Then, replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the filename to which you'd like to write the model, and train the tagger.
$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt
Disambiguate
Either write input to a file or pipe it.
$ cat raw.txt ^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$
Example 2: Input for apertium-tagger
Replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the file to which you wrote the unigram model, and disambiguate the input.
$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt ^a/a<b>$ ^aa/a<b>+a<b>$ $ echo '^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER ^a/a<b>$ ^aa/a<b>+a<b>$
Unigram Models
See section 5.3 of "A set of open-source tools for Turkish natural language processing."
Model 1
See section 5.3.1 of "A set of open-source tools for Turkish natural language processing." This model scores each analysis string in proportion to its frequency with add-one smoothing. Consider the following corpus.
^a/a<a>$ ^a/a<b>$ ^a/a<b>$
Passed the lexical unit ^a/a<a>/a<b>/a<c>$
, the tagger assigns the analysis string a<a>
a score of
The tagger assigns the analysis string a<b>
a score of
The unknown analysis string a<c>
is assigned a score of
If reconfigured with --enable-debug
, the tagger prints such calculations to stderr.
score("a<a>") == 2 == 2.000000000000000000 score("a<b>") == 3 == 3.000000000000000000 score("a<c>") == 1 == 1.000000000000000000 ^a<b>$
Training on Corpora with Ambiguous Lexical Units
Consider the following corpus.
$ cat handtagged.txt ^a/a<a>$ ^a/a<a>/a<b>$ ^a/a<b>$ ^a/a<b>$
The probabilities of a<a>
and a<b>
and both half for the first lexical unit. However, all unigram models store frequencies as std::size_t.
To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to one, expecting unambiguous lexical units. The size of this corpus' first lexical unit, ^a/a<a>$
, one, is divisible by the LCM, one, so the tagger increments the frequency of its analysis, a<a>
by
The size of the next lexical unit, ^a/a<a>/a<b>$
, two, isn't divisible by the LCM, one. Therefore, the tagger first multiplies the LCM, one, by the size, two, to yield two. Then, the tagger does the same to the frequency of a<a>
by the size, also yielding two. Finally, the tagger increments the frequency of each of this lexical unit's analyses, a<a>
and a<b>
, by
The frequency of a<a>
is then three, and the frequency of a<b>
is one.
The tagger then increments the frequency of the next lexical unit's analysis, a<b>
by
After doing the same for the last lexical unit, the frequency of a<a>
is three and the frequency of a<b>
is five.
Each model implements functions to increment analyses and multiply previous ones, so this method works for models 2 and 3 as well.
TODO: If one passes the -d
option to apertium-tagger
, the tagger prints warnings about ambiguous analyses in corpora to stderr.
$ apertium-tagger -ds 0 -u 1 handtagged.txt apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal ysis "a<a>" ^a/a<a>/a<b>$ ^
Model 2
See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."
Consider the same corpus from Unigram tagger#Model 1.
The tag string <b>
is twice as frequent as <a>
. However, model 1 scores b<a>
and b<b>
equally because neither analysis appears in the corpus.
This model splits each analysis string into a root, r
, and the part of the analysis string that isn't the root, a
. An analysis string's root is its first lemma. For s<t>
, r
is s
, and a
is <t>
, for s<t>+u<v>
, r
, s
, a
, <t>+u<v>
. The tagger scores each analysis string in proportion to the product of the probability of r
given a
with additive smoothing and the frequency of a
. This model scores unknown analysis strings with frequent tag strings higher than unknown analysis strings with infrequent tag strings.
Passed the lexical unit ^b/b<a>/b<b>$
, the tagger assigns the analysis string b<a>
a score of
Note that counts the analysis string being scored. The tagger would assign the known analysis string a<a>
a score of
The tagger assigns the analysis string b<b>
a score of