Difference between revisions of "Unigram tagger"

Revision as of 21:40, 15 January 2016

apertium-tagger from m5w/apertium supports all the unigram models from A set of open-source tools for Turkish natural language processing.

Installation

First, install all prerequisites. See Installation#If you want to add language data / do more advanced stuff. Then, replace <directory> with the directory you'd like to clone m5w/apertium into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, see Minimal installation from SVN#Set up environment. Finally, configure, build, and install m5w/apertium. See Minimal installation from SVN#Configure, build, and install.

Usage

See apertium-tagger -h .

Training a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as one would for all other models.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example 2.1.1: a Hand-Tagged Corpus for apertium-tagger

Then, replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the filename to which you'd like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

Disambiguate

Either write input to a file or pipe it to the tagger.

$ cat raw.txt
^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

Example 2.2.1: Input for apertium-tagger

Replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt
^a/a<b>$
^aa/a<b>+a<b>$
$ echo '^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \
apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER
^a/a<b>$
^aa/a<b>+a<b>$

Unigram Models

See section 5.3 of "A set of open-source tools for Turkish natural language processing."

Model 1

See section 5.3.1 of "A set of open-source tools for Turkish natural language processing."

This model assigns each analysis string a score of

${\begin{aligned}\mathrm {score} &=f(T)~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&T~{\text{is the analysis string}}\end{aligned}}$

with additive smoothing.

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$

Example 3.1.1: handtagged.txt : A Hand-Tagged Corpus for apertium-tagger

Given the lexical unit ^a/a<a>/a/a<c>$ , the tagger assigns the analysis string a<a> a score of

${\begin{aligned}\mathrm {score} &=\mathrm {tokenCount\_T} +1\\&=1+1\\&=2~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&\mathrm {tokenCount\_T} ~{\text{is the frequency of}}~T~{\text{in the corpus}}~{\text{.}}\end{aligned}}$

The tagger then assigns the analysis string a a score of

${\begin{aligned}\mathrm {score} &=2+1\\&=3\end{aligned}}$

and the unknown analysis string a<c> a score of

${\begin{aligned}\mathrm {score} &=0+1\\&=1~{\text{.}}\end{aligned}}$

If ./autogen.sh is passed the option --enable-debug , the tagger prints such calculations to standard error.

$ ./autogen.sh --enable-debug
$ make
$ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER


score("a<a>") ==
  2 ==
  2.000000000000000000
score("a<b>") ==
  3 ==
  3.000000000000000000
score("a<c>") ==
  1 ==
  1.000000000000000000
^a<b>$

Training on Corpora with Ambiguous Lexical Units

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<a>/a<b>$
^a/a<b>$
^a/a<b>$

Example 3.1.1.1: a Hand-Tagged Corpus for apertium-tagger

The probabilities of a<a> and a and both half for the first lexical unit. However, all unigram models store frequencies as std::size_t.

To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to one, expecting unambiguous lexical units. The size of this corpus' first lexical unit, ^a/a<a>$, one, is divisible by the LCM, one, so the tagger increments the frequency of its analysis, a<a> by

${\frac {LCM}{size}}={\frac {(1)}{(1)}}=1{\text{.}}$

The size of the next lexical unit, ^a/a<a>/a$, two, isn't divisible by the LCM, one. Therefore, the tagger first multiplies the LCM, one, by the size, two, to yield two. Then, the tagger does the same to the frequency of a<a> by the size, also yielding two. Finally, the tagger increments the frequency of each of this lexical unit's analyses, a<a> and a, by

${\frac {LCM}{size}}={\frac {(2)}{(2)}}=1{\text{.}}$

The frequency of a<a> is then three, and the frequency of a is one.

The tagger then increments the frequency of the next lexical unit's analysis, a by

${\frac {LCM}{size}}={\frac {(2)}{(1)}}=2{\text{.}}$

After doing the same for the last lexical unit, the frequency of a<a> is three and the frequency of a is five.

Each model implements functions to increment analyses and multiply previous ones, so this method works for models 2 and 3 as well.

TODO: If one passes the -d option to apertium-tagger, the tagger prints warnings about ambiguous analyses in corpora to stderr.

$ apertium-tagger -ds 0 -u 1 handtagged.txt
apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal
ysis "a<a>"
^a/a<a>/a<b>$
            ^

Model 2

See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."

Consider Example 3.1.1.

The tag string  is twice as frequent as <a>. However, model 1 scores b<a> and b equally because neither analysis string appears in the corpus.

This model splits each analysis string into a root, $r$ , and the part of the analysis string that isn't the root, $a$ . An analysis string's root is its first lemma. The $r$ of a+c<d> is a ; its $a$ is +c<d> . The tagger assigns each analysis string a score of $P(r|a)P(a)$ with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent $a$ than to unknown analysis strings with infrequent $a$ .

Given the lexical unit ^b/b<a>/b$, the tagger assigns the analysis string b<a> a score of

${\begin{aligned}\mathrm {score} &={\frac {(\mathrm {tokenCount\_r\_a} +1)(\mathrm {tokenCount\_a} +1)}{\mathrm {tokenCount\_a} +1+\mathrm {typeCount\_a} }}\\&={\frac {(0+1)(1+1)}{1+1+2}}\\&={\frac {(1)(2)}{4}}\\&={\frac {1}{2}}~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&\mathrm {tokenCount\_r\_a} ~{\text{is the frequency of the}}~r,a~{\text{in the corpus ,}}\\&\mathrm {tokenCount\_a} ~{\text{is the frequency of the}}~a~{\text{in the corpus ,}}\\{\text{and}}~&\mathrm {typeCount\_a} ~{\text{is the size of the parameter vector of all}}~r~{\text{preceding the}}~a~{\text{.}}\end{aligned}}$

Note that $\mathrm {typeCount\_a}$ counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a> a score of

${\begin{aligned}\mathrm {score} &={\frac {(1+1)(1+1)}{1+1+1}}\\&={\frac {(2)(2)}{3}}\\&={\frac {4}{3}}~{\text{.}}\end{aligned}}$

The tagger assigns the analysis string b a score of

${\begin{aligned}\mathrm {score} &={\frac {(0+1)(2+1)}{2+1+2}}\\&={\frac {(1)(3)}{5}}\\&={\frac {3}{5}}~{\text{.}}\end{aligned}}$

@@ Line 1: / Line 1: @@
 <code>apertium-tagger</code> from [https://github.com/m5w/apertium m5w/apertium] supports all the unigram models from [http://coltekin.net/cagri/papers/trmorph-tools.pdf A set of open-source tools for Turkish natural language processing].
-==Install==
+==Installation==
 First, install all prerequisites. See [[Installation#If you want to add language data / do more advanced stuff]].
 Then, replace <code>&lt;directory&gt;</code> with the directory you'd like to clone m5w/apertium into and clone the repository.
@@ Line 9: / Line 9: @@
 ==Usage==
 See <code>apertium-tagger -h</code> .
-===Train a Model on a Hand-Tagged Corpus===
+===Training a Model on a Hand-Tagged Corpus===
 First, get a hand-tagged corpus as one would for all other models.
 <pre>
@@ Line 27: / Line 27: @@
 ^aa/a<b>+a<b>$
 </pre>
-''Example 1: a Hand-Tagged Corpus for'' <code>apertium-tagger</code>
+''Example 2.1.1: a Hand-Tagged Corpus for'' <code>apertium-tagger</code>
 Then, replace <code>MODEL</code> with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace <code>SERIALISED_BASIC_TAGGER</code> with the filename to which you'd like to write the model, and train the tagger.
@@ Line 41: / Line 41: @@
 ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$
 </pre>
-''Example 2: Input for'' <code>apertium-tagger</code>
+''Example 2.2.1: Input for'' <code>apertium-tagger</code>
 Replace <code>MODEL</code> with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace <code>SERIALISED_BASIC_TAGGER</code> with the file to which you wrote the unigram model, and disambiguate the input.
@@ Line 60: / Line 60: @@
 See section 5.3.1 of "A set of open-source tools for Turkish natural language processing."
-This model scores each analysis string in proportion to its frequency with add-one smoothing.
+This model assigns each analysis string a score of
+<math style="padding-left:1cm;">
+\begin{align}
+\mathrm{score} &= f(T)~\text{,}
+\end{align}
+</math>
+<math style="padding-left:1cm;">
+\begin{align}
+\text{where}~&T~\text{is the analysis string}
+\end{align}
+</math>
+with additive smoothing.
 Consider the following corpus.
 <pre>
 $ cat handtagged.txt
@@ Line 69: / Line 84: @@
 ^a/a<b>$
 </pre>
-''Example 1.1.1: A Hand-Tagged Corpus for'' <code>apertium-tagger</code>
+''Example 3.1.1:'' <code>handtagged.txt</code> '': A Hand-Tagged Corpus for'' <code>apertium-tagger</code>
 Given the lexical unit <code>^a/a&lt;a&gt;/a&lt;b&gt;/a&lt;c&gt;$</code> , the tagger assigns the analysis string <code>a&lt;a&gt;</code> a score of
@@ Line 83: / Line 99: @@
 <math style="padding-left:1cm;">
 \begin{align}
-\text{where}~&\mathrm{tokenCount\_T}~\text{is the frequency of the}~T~\text{in the corpus} \\
+\text{where}~&\mathrm{tokenCount\_T}~\text{is the frequency of}~T~\text{in the corpus}~\text{.}
-\text{and the}~&T~\text{is the analysis string .}
 \end{align}
 </math>
@@ Line 126: / Line 141: @@
 ====Training on Corpora with Ambiguous Lexical Units====
 Consider the following corpus.
 <pre>
 $ cat handtagged.txt
@@ Line 133: / Line 149: @@
 ^a/a<b>$
 </pre>
+''Example 3.1.1.1: a Hand-Tagged Corpus for'' <code>apertium-tagger</code>
 The probabilities of <code>a&lt;a&gt;</code> and <code>a&lt;b&gt;</code> and both half for the first lexical unit. However, all unigram models store frequencies as [http://en.cppreference.com/w/cpp/types/size_t std::size_t.]
@@ Line 171: / Line 190: @@
 See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."
+Consider Example 3.1.1.
-Consider the same corpus from [[Unigram tagger#Model 1]].
-The tag string <code>&lt;b&gt;</code> is twice as frequent as <code>&lt;a&gt;</code>. However, model 1 scores <code>b&lt;a&gt;</code> and <code>b&lt;b&gt;</code> equally because neither analysis appears in the corpus.
+The tag string <code>&lt;b&gt;</code> is twice as frequent as <code>&lt;a&gt;</code>. However, model 1 scores <code>b&lt;a&gt;</code> and <code>b&lt;b&gt;</code> equally because neither analysis string appears in the corpus.
-This model splits each analysis string into a root, <code>r</code>, and the part of the analysis string that isn't the root, <code>a</code>. An analysis string's root is its first lemma. For <code>s&lt;t&gt;</code>, <code>r</code> is <code>s</code>, and <code>a</code> is <code>&lt;t&gt;</code>, for <code>s&lt;t&gt;+u&lt;v&gt;</code>, <code>r</code>, <code>s</code>, <code>a</code>, <code>&lt;t&gt;+u&lt;v&gt;</code>. The tagger scores each analysis string in proportion to the product of the probability of <code>r</code> given <code>a</code> with additive smoothing and the frequency of <code>a</code>. This model scores unknown analysis strings with frequent tag strings higher than unknown analysis strings with infrequent tag strings.
+This model splits each analysis string into a root, <math>r</math> , and the part of the analysis string that isn't the root, <math>a</math> . An analysis string's root is its first lemma. The <math>r</math> of <code>a&lt;b&gt;+c&lt;d&gt;</code> is <code>a</code> ; its <math>a</math> is <code>&lt;b&gt;+c&lt;d&gt;</code> . The tagger assigns each analysis string a score of <math>P(r|a)P(a)</math> with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent <math>a</math> than to unknown analysis strings with infrequent <math>a</math> .
-Passed the lexical unit <code>^b/b&lt;a&gt;/b&lt;b&gt;$</code>, the tagger assigns the analysis string <code>b&lt;a&gt;</code> a score of
+Given the lexical unit <code>^b/b&lt;a&gt;/b&lt;b&gt;$</code>, the tagger assigns the analysis string <code>b&lt;a&gt;</code> a score of
 <math style="padding-left:1cm;">

Difference between revisions of "Unigram tagger"

Revision as of 21:40, 15 January 2016

Contents

Installation

Usage

Training a Model on a Hand-Tagged Corpus

Disambiguate

Unigram Models

Model 1

Training on Corpora with Ambiguous Lexical Units

Model 2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools