Difference between revisions of "Unigram tagger"
(→Usage) |
|||
Line 65: | Line 65: | ||
==Unigram Models== |
==Unigram Models== |
||
+ | See section 5.3 of “A set of open-source tools for Turkish natural language processing.”<ref name="_2"/> |
||
− | See section 5.3 of [[Unigram tagger#1|[1]]] . |
||
===Model 1=== |
===Model 1=== |
||
− | See section 5.3.1 of |
+ | See section 5.3.1 of “A set of open-source tools for Turkish natural language processing.”<ref name="_2"/> |
This model assigns each analysis string a score of |
This model assigns each analysis string a score of |
||
+ | <math> |
||
− | <math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
\mathrm{score} &= f(T)~\text{,} |
\mathrm{score} &= f(T)~\text{,} |
||
Line 77: | Line 77: | ||
</math> |
</math> |
||
+ | <math> |
||
− | <math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
\text{where}~&T~\text{is the analysis string} |
\text{where}~&T~\text{is the analysis string} |
||
Line 98: | Line 98: | ||
Given the lexical unit <code>^a/a<a>/a<b>/a<c>$</code> , the tagger assigns the analysis string <code>a<a></code> a score of |
Given the lexical unit <code>^a/a<a>/a<b>/a<c>$</code> , the tagger assigns the analysis string <code>a<a></code> a score of |
||
+ | <math> |
||
− | <math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
\mathrm{score} &= \mathrm{tokenCount\_T} + 1\\ |
\mathrm{score} &= \mathrm{tokenCount\_T} + 1\\ |
||
Line 106: | Line 106: | ||
</math> |
</math> |
||
+ | <math> |
||
− | <math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
\text{where}~&\mathrm{tokenCount\_T}~\text{is the frequency of}~T~\text{in the corpus}~\text{.} |
\text{where}~&\mathrm{tokenCount\_T}~\text{is the frequency of}~T~\text{in the corpus}~\text{.} |
||
Line 114: | Line 114: | ||
The tagger then assigns the analysis string <code>a<b></code> a score of |
The tagger then assigns the analysis string <code>a<b></code> a score of |
||
+ | <math> |
||
− | <math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
\mathrm{score} &= 2 + 1\\ |
\mathrm{score} &= 2 + 1\\ |
||
Line 123: | Line 123: | ||
and the unknown analysis string <code>a<c></code> a score of |
and the unknown analysis string <code>a<c></code> a score of |
||
+ | <math> |
||
− | <math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
\mathrm{score} &= 0 + 1\\ |
\mathrm{score} &= 0 + 1\\ |
||
Line 134: | Line 134: | ||
$ ./autogen.sh --enable-debug |
$ ./autogen.sh --enable-debug |
||
$ make |
$ make |
||
− | $ echo |
+ | $ echo ’^a/a<a>/a<b>/a<c>$’ | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER |
Line 159: | Line 159: | ||
</pre> |
</pre> |
||
− | ''Example 3.1.1.1: a Hand-Tagged Corpus for'' <code>apertium-tagger</code> |
+ | ''Example 3.1.1.1:'' <code>handtagged.txt</code> '': a Hand-Tagged Corpus for'' <code>apertium-tagger</code> |
+ | The tagger expects lexical units of 1 analysis string, or lexical units of size 1. However, the size of the lexical unit <code>^a/a<a>/a<b>$</code> is 2. For this lexical unit, |
||
− | The probabilities of <code>a<a></code> and <code>a<b></code> and both half for the first lexical unit. However, all unigram models store frequencies as [http://en.cppreference.com/w/cpp/types/size_t std::size_t.] |
||
− | |||
− | To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to one, expecting unambiguous lexical units. The size of this corpus' first lexical unit, <code>^a/a<a>$</code>, one, is divisible by the LCM, one, so the tagger increments the frequency of its analysis, <code>a<a></code> by |
||
<math> |
<math> |
||
− | \ |
+ | P(\texttt{a<a>}) = P(\texttt{a<b>}) = \frac12~\text{;} |
</math> |
</math> |
||
+ | the tagger must effectively increment the frequency of both analysis strings by <code>0.500000000000000000</code> . However, the tagger can’t increment the analysis strings’ frequencies by a non-integral number because model 1 represents analysis strings’ frequencies as <code>std::size_t</code> <ref name="_6">http://en.cppreference.com/w/cpp/types/size_t</ref>. |
||
− | The size of the next lexical unit, <code>^a/a<a>/a<b>$</code>, two, isn't divisible by the LCM, one. Therefore, the tagger first multiplies the LCM, one, by the size, two, to yield two. Then, the tagger does the same to the frequency of <code>a<a></code> by the size, also yielding two. Finally, the tagger increments the frequency of each of this lexical unit's analyses, <code>a<a></code> and <code>a<b></code>, by |
||
+ | |||
+ | Instead, the tagger multiplies all the stored analysis strings’ frequencies by this lexical unit’s size and increments the frequency of each of this lexical unit’s analysis strings by 1. |
||
<math> |
<math> |
||
+ | \begin{align} |
||
⚫ | |||
+ | f(\texttt{a<a>}) &= (1)(2) &f(\texttt{a<b>}) &= (0)(1)\\ |
||
+ | &+ 1 = 2 + 1 = 3 & &+ 1 = 0 + 1 = 1 |
||
+ | \end{align} |
||
</math> |
</math> |
||
+ | The tagger could then increment the analysis strings’ frequencies of another lexical unit of size 2 without multiplying any of the stored analysis strings’ frequencies. To account for this, the tagger stores the least common multiple of all lexical units’ sizes; only if the LCM isn't divisible by a lexical unit’s size does the tagger multiply all the analysis strings’ frequencies. |
||
− | The frequency of <code>a<a></code> is then three, and the frequency of <code>a<b></code> is one. |
||
− | + | After incrementing the analysis strings’ frequencies of the lexical unit <code>^a/a<a>/a<b>$</code>, the tagger increments the analysis string <code>a<b></code> of the lexical unit <code>^a/a<b>$</code> by |
|
<math> |
<math> |
||
+ | \begin{align} |
||
− | \frac{LCM}{size} = \frac{ |
+ | \frac{\mathrm{LCM}}{\mathrm{TheLexicalUnit.size}} = \frac{2}{1} = 2~\text{.} |
+ | \end{align} |
||
</math> |
</math> |
||
− | + | If the tagger gets another lexical unit of size 2, it would increment the frequency of each of the lexical unit’s analysis strings by |
|
+ | |||
+ | <math> |
||
+ | \begin{align} |
||
⚫ | |||
+ | \end{align} |
||
+ | </math> |
||
+ | |||
+ | and if it gets a lexical unit of size 3, it would multiply all the analysis strings’ frequencies by 3 and then increment the frequency of each of the lexical unit’s analysis strings by |
||
+ | |||
+ | <math> |
||
+ | \being{align} |
||
+ | \frac{\mathrm{LCM}}{\mathrm{TheLexicalUnit.size}} = \frac{6}{3} = 2~\text{.} |
||
+ | \end{align} |
||
+ | </math> |
||
− | Each model |
+ | Each model supports functions to increment all their stored analysis strings’ frequencies, so models 2 and 3 support this algorithm as well. |
− | '''TODO''': If one passes the <code>-d</code> option to <code>apertium-tagger</code>, the tagger prints warnings about ambiguous analyses in corpora to stderr. |
+ | '''''TODO''': If one passes the'' <code>-d</code> ''option to'' <code>apertium-tagger</code> '', the tagger prints warnings about ambiguous analyses in corpora to stderr.'' |
<pre> |
<pre> |
||
$ apertium-tagger -ds 0 -u 1 handtagged.txt |
$ apertium-tagger -ds 0 -u 1 handtagged.txt |
||
Line 203: | Line 222: | ||
The tag string <code><b></code> is twice as frequent as <code><a></code>. However, model 1 scores <code>b<a></code> and <code>b<b></code> equally because neither analysis string appears in the corpus. |
The tag string <code><b></code> is twice as frequent as <code><a></code>. However, model 1 scores <code>b<a></code> and <code>b<b></code> equally because neither analysis string appears in the corpus. |
||
− | This model splits each analysis string into a root, <math>r</math> , and the part of the analysis string that |
+ | This model splits each analysis string into a root, <math>r</math> , and the part of the analysis string that isn’t the root, <math>a</math> . An analysis string’s root is its first lemma. The <math>r</math> of <code>a<b>+c<d></code> is <code>a</code> ; its <math>a</math> is <code><b>+c<d></code> . The tagger assigns each analysis string a score of <math>P(r|a)P(a)</math> with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent <math>a</math> than to unknown analysis strings with infrequent <math>a</math> . |
Given the lexical unit <code>^b/b<a>/b<b>$</code>, the tagger assigns the analysis string <code>b<a></code> a score of |
Given the lexical unit <code>^b/b<a>/b<b>$</code>, the tagger assigns the analysis string <code>b<a></code> a score of |
||
Line 243: | Line 262: | ||
\end{align} |
\end{align} |
||
</math> |
</math> |
||
+ | |||
==Notes== |
==Notes== |
||
<references/> |
<references/> |
Revision as of 06:04, 16 January 2016
apertium-tagger
from “m5w/apertium[1]” supports all the unigram models from “A set of open-source tools for Turkish natural language processing[2].”
Contents
Installation
First, install all prerequisites. See “If you want to add language data / do more advanced stuff[3].”
Then, replace <directory>
with the directory you'd like to clone “m5w/apertium[1]” into and clone the repository.
git clone https://github.com/m5w/apertium.git <directory>
Then, configure your environment[4] and finally configure, build, and install[5] “m5w/apertium[1].”
Usage
See apertium-tagger -h
.
Training a Model on a Hand-Tagged Corpus
First, get a hand-tagged corpus as one would for all other models.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$ ^aa/a<a>+a<a>$ ^aa/a<a>+a<b>$ ^aa/a<a>+a<b>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$
Example 2.1.1: handtagged.txt
: a Hand-Tagged Corpus for apertium-tagger
Then, replace MODEL
with the unigram model from “A set of open-source tools for Turkish natural language processing[2]” you'd like to use, replace SERIALISED_BASIC_TAGGER
with the filename to which you'd like to write the model, and train the tagger.
$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt
Disambiguation
Either write input to a file or pipe it to the tagger.
$ cat raw.txt ^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$
Example 2.2.1: raw.txt
: Input for apertium-tagger
Replace MODEL
with the unigram model from “A set of open-source tools for Turkish natural language processing[2]” you'd like to use, replace SERIALISED_BASIC_TAGGER
with the file to which you wrote the unigram model, and disambiguate the input.
$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt ^a/a<b>$ ^aa/a<b>+a<b>$ $ echo '^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER ^a/a<b>$ ^aa/a<b>+a<b>$
Unigram Models
See section 5.3 of “A set of open-source tools for Turkish natural language processing.”[2]
Model 1
See section 5.3.1 of “A set of open-source tools for Turkish natural language processing.”[2]
This model assigns each analysis string a score of
with additive smoothing.
Consider the following corpus.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$
Example 3.1.1: handtagged.txt
: A Hand-Tagged Corpus for apertium-tagger
Given the lexical unit ^a/a<a>/a<b>/a<c>$
, the tagger assigns the analysis string a<a>
a score of
The tagger then assigns the analysis string a<b>
a score of
and the unknown analysis string a<c>
a score of
If ./autogen.sh
is passed the option --enable-debug
, the tagger prints such calculations to standard error.
$ ./autogen.sh --enable-debug $ make $ echo ’^a/a<a>/a<b>/a<c>$’ | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER score("a<a>") == 2 == 2.000000000000000000 score("a<b>") == 3 == 3.000000000000000000 score("a<c>") == 1 == 1.000000000000000000 ^a<b>$
Training on Corpora with Ambiguous Lexical Units
Consider the following corpus.
$ cat handtagged.txt ^a/a<a>$ ^a/a<a>/a<b>$ ^a/a<b>$ ^a/a<b>$
Example 3.1.1.1: handtagged.txt
: a Hand-Tagged Corpus for apertium-tagger
The tagger expects lexical units of 1 analysis string, or lexical units of size 1. However, the size of the lexical unit ^a/a<a>/a<b>$
is 2. For this lexical unit,
the tagger must effectively increment the frequency of both analysis strings by 0.500000000000000000
. However, the tagger can’t increment the analysis strings’ frequencies by a non-integral number because model 1 represents analysis strings’ frequencies as std::size_t
[6].
Instead, the tagger multiplies all the stored analysis strings’ frequencies by this lexical unit’s size and increments the frequency of each of this lexical unit’s analysis strings by 1.
The tagger could then increment the analysis strings’ frequencies of another lexical unit of size 2 without multiplying any of the stored analysis strings’ frequencies. To account for this, the tagger stores the least common multiple of all lexical units’ sizes; only if the LCM isn't divisible by a lexical unit’s size does the tagger multiply all the analysis strings’ frequencies.
After incrementing the analysis strings’ frequencies of the lexical unit ^a/a<a>/a<b>$
, the tagger increments the analysis string a<b>
of the lexical unit ^a/a<b>$
by
If the tagger gets another lexical unit of size 2, it would increment the frequency of each of the lexical unit’s analysis strings by
and if it gets a lexical unit of size 3, it would multiply all the analysis strings’ frequencies by 3 and then increment the frequency of each of the lexical unit’s analysis strings by
Failed to parse (unknown function "\being"): {\displaystyle \being{align} \frac{\mathrm{LCM}}{\mathrm{TheLexicalUnit.size}} = \frac{6}{3} = 2~\text{.} \end{align} }
Each model supports functions to increment all their stored analysis strings’ frequencies, so models 2 and 3 support this algorithm as well.
TODO: If one passes the -d
option to apertium-tagger
, the tagger prints warnings about ambiguous analyses in corpora to stderr.
$ apertium-tagger -ds 0 -u 1 handtagged.txt apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal ysis "a<a>" ^a/a<a>/a<b>$ ^
Model 2
See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."
Consider Example 3.1.1.
The tag string <b>
is twice as frequent as <a>
. However, model 1 scores b<a>
and b<b>
equally because neither analysis string appears in the corpus.
This model splits each analysis string into a root, , and the part of the analysis string that isn’t the root, . An analysis string’s root is its first lemma. The of a<b>+c<d>
is a
; its is <b>+c<d>
. The tagger assigns each analysis string a score of with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent than to unknown analysis strings with infrequent .
Given the lexical unit ^b/b<a>/b<b>$
, the tagger assigns the analysis string b<a>
a score of
Note that counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a>
a score of
The tagger assigns the analysis string b<b>
a score of
Notes
- ↑ 1.0 1.1 1.2 https://github.com/m5w/apertium
- ↑ 2.0 2.1 2.2 2.3 2.4 http://coltekin.net/cagri/papers/trmorph-tools.pdf
- ↑ Installation#If you want to add language data / do more advanced stuff
- ↑ Minimal installation from SVN#Set up environment
- ↑ Minimal installation from SVN#Configure, build, and install
- ↑ http://en.cppreference.com/w/cpp/types/size_t