Difference between revisions of "Unigram tagger"

Revision as of 19:18, 15 January 2016

apertium-tagger from m5w/apertium supports all the unigram models from A set of open-source tools for Turkish natural language processing.

Install

First, install all prerequisites. See Installation#If you want to add language data / do more advanced stuff. Then, replace <directory> with the directory you'd like to clone m5w/apertium into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, see Minimal installation from SVN#Set up environment. Finally, configure, build, and install m5w/apertium. See Minimal installation from SVN#Configure, build, and install.

Usage

See apertium-tagger -h.

Train a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as one would for all other models.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example 1: a Hand-Tagged Corpus for apertium-tagger

Then, replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the filename to which you'd like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

Disambiguate

Either write input to a file or pipe it.

$ cat raw.txt
^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

Example 2: Input for apertium-tagger

Replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt
^a/a<b>$
^aa/a<b>+a<b>$
$ echo '^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \
apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER
^a/a<b>$
^aa/a<b>+a<b>$

Unigram Models

See section 5.3 of "A set of open-source tools for Turkish natural language processing."

Model 1

See section 5.3.1 of "A set of open-source tools for Turkish natural language processing." This model scores each analysis string in proportion to its frequency with add-one smoothing. Consider the following corpus.

^a/a<a>$
^a/a<b>$
^a/a<b>$

Passed the lexical unit ^a/a<a>/a/a<c>$, the tagger assigns the analysis string a<a> a score of

$tokenCount_{T}+1=(1)+1=2{\text{,}}$

${\text{where }}tokenCount_{T}{\text{ is the frequency of the }}T{\text{ }}{\texttt {a<a>}}{\text{ in the corpus.}}$

The tagger assigns the analysis string a a score of

$(2)+1=3{\text{.}}$

The unknown analysis string a<c> is assigned a score of

$1{\text{.}}$

If reconfigured with --enable-debug, the tagger prints such calculations to stderr.



score("a<a>") ==
  2 ==
  2.000000000000000000
score("a<b>") ==
  3 ==
  3.000000000000000000
score("a<c>") ==
  1 ==
  1.000000000000000000
^a<b>$

Training on Corpora with Ambiguous Lexical Units

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<a>/a<b>$
^a/a<b>$
^a/a<b>$

The probabilities of a<a> and a and both half for the first lexical unit. However, all unigram models store frequencies as std::size_t.

To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to one, expecting unambiguous lexical units. The size of this corpus' first lexical unit, ^a/a<a>$, one, is divisible by the LCM, one, so the tagger increments the frequency of its analysis, a<a> by

${\frac {LCM}{size}}={\frac {(1)}{(1)}}=1{\text{.}}$

The size of the next lexical unit, ^a/a<a>/a$, two, isn't divisible by the LCM, one. Therefore, the tagger first multiplies the LCM, one, by the size, two, to yield two. Then, the tagger does the same to the frequency of a<a> by the size, also yielding two. Finally, the tagger increments the frequency of each of this lexical unit's analyses, a<a> and a, by

${\frac {LCM}{size}}={\frac {(2)}{(2)}}=1{\text{.}}$

The frequency of a<a> is then three, and the frequency of a is one.

The tagger then increments the frequency of the next lexical unit's analysis, a by

${\frac {LCM}{size}}={\frac {(2)}{(1)}}=2{\text{.}}$

After doing the same for the last lexical unit, the frequency of a<a> is three and the frequency of a is five.

Each model implements functions to increment analyses and multiply previous ones, so this method works for models 2 and 3 as well.

TODO: If one passes the -d option to apertium-tagger, the tagger prints warnings about ambiguous analyses in corpora to stderr.

$ apertium-tagger -ds 0 -u 1 handtagged.txt
apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal
ysis "a<a>"
^a/a<a>/a<b>$
            ^

Model 2

See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."

Consider the same corpus from Unigram tagger#Model 1.

The tag string  is twice as frequent as <a>. However, model 1 scores b<a> and b equally because neither analysis appears in the corpus.

This model splits each analysis string into a root, r, and the part of the analysis string that isn't the root, a. An analysis string's root is its first lemma. For s<t>, r is s, and a is <t>, for s<t>+u<v>, r, s, a, <t>+u<v>. The tagger scores each analysis string in proportion to the product of the probability of r given a with additive smoothing and the frequency of a. This model scores unknown analysis strings with frequent tag strings higher than unknown analysis strings with infrequent tag strings.

Passed the lexical unit ^b/b<a>/b$, the tagger assigns the analysis string b<a> a score of

${\begin{aligned}\mathrm {score} &={\frac {(\mathrm {tokenCount\_r\_a} +1)(\mathrm {tokenCount\_a} +1)}{\mathrm {tokenCount\_a} +1+\mathrm {typeCount\_a} }}\\&={\frac {(0+1)(1+1)}{1+1+2}}\\&={\frac {(1)(2)}{4}}\\&={\frac {1}{2}}~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&\mathrm {tokenCount\_r\_a} ~{\text{is the frequency of the}}~r,a~{\text{in the corpus ,}}\\&\mathrm {tokenCount\_a} ~{\text{is the frequency of the}}~a~{\text{in the corpus ,}}\\{\text{and}}~&\mathrm {typeCount\_a} ~{\text{is the size of the parameter vector of all}}~r~{\text{preceding the}}~a~{\text{.}}\end{aligned}}$

Note that $\mathrm {typeCount\_a}$ counts the analysis string being scored. The tagger would assign the known analysis string a<a> a score of

${\begin{aligned}\mathrm {score} &={\frac {(1+1)(1+1)}{1+1+1}}\\&={\frac {(2)(2)}{3}}\\&={\frac {4}{3}}~{\text{.}}\end{aligned}}$

The tagger assigns the analysis string b a score of

${\begin{aligned}\mathrm {score} &={\frac {(0+1)(2+1)}{2+1+2}}\\&={\frac {(1)(3)}{5}}\\&={\frac {3}{5}}~{\text{.}}\end{aligned}}$

@@ Line 160: / Line 160: @@
 <math style="padding-left:1cm;">
 \begin{align}
-score & = \frac{(tokenCount_{r,a} + 1)(tokenCount_a + 1)}{tokenCount_a + 1 + typeCount_a} \\
+\mathrm{score} & = \frac{(\mathrm{tokenCount\_r\_a} + 1)(\mathrm{tokenCount\_a} + 1)}{\mathrm{tokenCount\_a} + 1 + \mathrm{typeCount\_a}} \\
-& = \frac{[(0) + 1][(1) + 1]}{(1) + 1 + (2)} \\
+& = \frac{(0 + 1)(1 + 1)}{1 + 1 + 2} \\
 & = \frac{(1)(2)}{4} \\
-& = \frac{1}{2}
+& = \frac{1}{2}~\text{,}
 \end{align}
 </math>
@@ Line 169: / Line 169: @@
 <math style="padding-left:1cm;">
 \begin{align}
-\text{where } & tokenCount_{r,a} & \text{ is } & \text{the frequency of the }r,a\text{ }\texttt{b<a>}\text{ in the corpus,} \\
+\text{where}~&\mathrm{tokenCount\_r\_a}~\text{is the frequency of the}~r,a~\text{in the corpus ,} \\
-& tokenCount_a & \text{ is } & \text{the frequency of the }a\text{ }\texttt{<a>}\text{ in the corpus,} \\
+&\mathrm{tokenCount\_a}~\text{is the frequency of the}~a~\text{in the corpus ,} \\
-\text{and } & typeCount_a & \text{ is } & \text{the number of unique }r\text{ followed by the }a\text{ }\texttt{<a>}\text{.}
+\text{and}~&\mathrm{typeCount\_a}~\text{is the size of the parameter vector of all}~r~\text{preceding the}~a~\text{.}
 \end{align}
 </math>
+Note that <math>\mathrm{typeCount\_a}</math> counts the analysis string being scored. The tagger would assign the known analysis string <code>a&lt;a&gt;</code> a score of
-<math>
-\text{Note that }typeCount_a\text{ includes the scored }r,a\text{ }\texttt{b<a>}\text{.}
-</math>
+<math style="padding-left:1cm;">
-<math>
+\begin{align}
-\text{For the known }r,a\text{ }\texttt{a<a>}\text{, }typeCount_a = 1\text{.}
+\mathrm{score} & = \frac{(1 + 1)(1 + 1)}{1 + 1 + 1} \\
+& = \frac{(2)(2)}{3} \\
+& = \frac{4}{3}~\text{.}
+\end{align}
 </math>
@@ Line 186: / Line 188: @@
 <math>
+\begin{align}
-\frac{(0) + 1}{(2) + 1 + (2)}\times{[(2) + 1]} = \frac{1}{5}\times{3} = \frac{3}{5}\text{.}
+\mathrm{score} & = \frac{(0 + 1)(2 + 1)}{2 + 1 + 2} \\
+& = \frac{(1)(3)}{5} \\
+& = \frac{3}{5}~\text{.}
+\end{align}
 </math>
 [[Category:Development]]

Difference between revisions of "Unigram tagger"

Revision as of 19:18, 15 January 2016

Contents

Install

Usage

Train a Model on a Hand-Tagged Corpus

Disambiguate

Unigram Models

Model 1

Training on Corpora with Ambiguous Lexical Units

Model 2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools