Difference between revisions of "Unigram tagger"
m (→Model 1) |
|||
Line 1: | Line 1: | ||
<code>apertium-tagger</code> from [https://github.com/m5w/apertium m5w/apertium] supports all the unigram models from [http://coltekin.net/cagri/papers/trmorph-tools.pdf A set of open-source tools for Turkish natural language processing]. |
<code>apertium-tagger</code> from [https://github.com/m5w/apertium m5w/apertium] supports all the unigram models from [http://coltekin.net/cagri/papers/trmorph-tools.pdf A set of open-source tools for Turkish natural language processing]. |
||
− | == |
+ | ==Installation== |
First, install all prerequisites. See [[Installation#If you want to add language data / do more advanced stuff]]. |
First, install all prerequisites. See [[Installation#If you want to add language data / do more advanced stuff]]. |
||
Then, replace <code><directory></code> with the directory you'd like to clone m5w/apertium into and clone the repository. |
Then, replace <code><directory></code> with the directory you'd like to clone m5w/apertium into and clone the repository. |
||
Line 9: | Line 9: | ||
==Usage== |
==Usage== |
||
See <code>apertium-tagger -h</code> . |
See <code>apertium-tagger -h</code> . |
||
− | === |
+ | ===Training a Model on a Hand-Tagged Corpus=== |
First, get a hand-tagged corpus as one would for all other models. |
First, get a hand-tagged corpus as one would for all other models. |
||
<pre> |
<pre> |
||
Line 27: | Line 27: | ||
^aa/a<b>+a<b>$ |
^aa/a<b>+a<b>$ |
||
</pre> |
</pre> |
||
− | ''Example 1: a Hand-Tagged Corpus for'' <code>apertium-tagger</code> |
+ | ''Example 2.1.1: a Hand-Tagged Corpus for'' <code>apertium-tagger</code> |
Then, replace <code>MODEL</code> with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace <code>SERIALISED_BASIC_TAGGER</code> with the filename to which you'd like to write the model, and train the tagger. |
Then, replace <code>MODEL</code> with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace <code>SERIALISED_BASIC_TAGGER</code> with the filename to which you'd like to write the model, and train the tagger. |
||
Line 41: | Line 41: | ||
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$ |
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$ |
||
</pre> |
</pre> |
||
− | ''Example 2: Input for'' <code>apertium-tagger</code> |
+ | ''Example 2.2.1: Input for'' <code>apertium-tagger</code> |
Replace <code>MODEL</code> with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace <code>SERIALISED_BASIC_TAGGER</code> with the file to which you wrote the unigram model, and disambiguate the input. |
Replace <code>MODEL</code> with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace <code>SERIALISED_BASIC_TAGGER</code> with the file to which you wrote the unigram model, and disambiguate the input. |
||
Line 60: | Line 60: | ||
See section 5.3.1 of "A set of open-source tools for Turkish natural language processing." |
See section 5.3.1 of "A set of open-source tools for Turkish natural language processing." |
||
− | This model |
+ | This model assigns each analysis string a score of |
+ | |||
+ | <math style="padding-left:1cm;"> |
||
+ | \begin{align} |
||
+ | \mathrm{score} &= f(T)~\text{,} |
||
+ | \end{align} |
||
+ | </math> |
||
+ | |||
+ | <math style="padding-left:1cm;"> |
||
+ | \begin{align} |
||
⚫ | |||
+ | \end{align} |
||
+ | </math> |
||
+ | |||
+ | with additive smoothing. |
||
Consider the following corpus. |
Consider the following corpus. |
||
+ | |||
<pre> |
<pre> |
||
$ cat handtagged.txt |
$ cat handtagged.txt |
||
Line 69: | Line 84: | ||
^a/a<b>$ |
^a/a<b>$ |
||
</pre> |
</pre> |
||
+ | |||
⚫ | |||
+ | ''Example 3.1.1:'' <code>handtagged.txt</code> '': A Hand-Tagged Corpus for'' <code>apertium-tagger</code> |
||
Given the lexical unit <code>^a/a<a>/a<b>/a<c>$</code> , the tagger assigns the analysis string <code>a<a></code> a score of |
Given the lexical unit <code>^a/a<a>/a<b>/a<c>$</code> , the tagger assigns the analysis string <code>a<a></code> a score of |
||
Line 83: | Line 99: | ||
<math style="padding-left:1cm;"> |
<math style="padding-left:1cm;"> |
||
\begin{align} |
\begin{align} |
||
− | \text{where}~&\mathrm{tokenCount\_T}~\text{is the frequency of |
+ | \text{where}~&\mathrm{tokenCount\_T}~\text{is the frequency of}~T~\text{in the corpus}~\text{.} |
⚫ | |||
\end{align} |
\end{align} |
||
</math> |
</math> |
||
Line 126: | Line 141: | ||
====Training on Corpora with Ambiguous Lexical Units==== |
====Training on Corpora with Ambiguous Lexical Units==== |
||
Consider the following corpus. |
Consider the following corpus. |
||
+ | |||
<pre> |
<pre> |
||
$ cat handtagged.txt |
$ cat handtagged.txt |
||
Line 133: | Line 149: | ||
^a/a<b>$ |
^a/a<b>$ |
||
</pre> |
</pre> |
||
+ | |||
⚫ | |||
+ | |||
The probabilities of <code>a<a></code> and <code>a<b></code> and both half for the first lexical unit. However, all unigram models store frequencies as [http://en.cppreference.com/w/cpp/types/size_t std::size_t.] |
The probabilities of <code>a<a></code> and <code>a<b></code> and both half for the first lexical unit. However, all unigram models store frequencies as [http://en.cppreference.com/w/cpp/types/size_t std::size_t.] |
||
Line 171: | Line 190: | ||
See section 5.3.2 of "A set of open-source tools for Turkish natural language processing." |
See section 5.3.2 of "A set of open-source tools for Turkish natural language processing." |
||
+ | Consider Example 3.1.1. |
||
− | Consider the same corpus from [[Unigram tagger#Model 1]]. |
||
− | The tag string <code><b></code> is twice as frequent as <code><a></code>. However, model 1 scores <code>b<a></code> and <code>b<b></code> equally because neither analysis appears in the corpus. |
+ | The tag string <code><b></code> is twice as frequent as <code><a></code>. However, model 1 scores <code>b<a></code> and <code>b<b></code> equally because neither analysis string appears in the corpus. |
− | This model splits each analysis string into a root, < |
+ | This model splits each analysis string into a root, <math>r</math> , and the part of the analysis string that isn't the root, <math>a</math> . An analysis string's root is its first lemma. The <math>r</math> of <code>a<b>+c<d></code> is <code>a</code> ; its <math>a</math> is <code><b>+c<d></code> . The tagger assigns each analysis string a score of <math>P(r|a)P(a)</math> with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent <math>a</math> than to unknown analysis strings with infrequent <math>a</math> . |
− | + | Given the lexical unit <code>^b/b<a>/b<b>$</code>, the tagger assigns the analysis string <code>b<a></code> a score of |
|
<math style="padding-left:1cm;"> |
<math style="padding-left:1cm;"> |
Revision as of 21:40, 15 January 2016
apertium-tagger
from m5w/apertium supports all the unigram models from A set of open-source tools for Turkish natural language processing.
Contents
Installation
First, install all prerequisites. See Installation#If you want to add language data / do more advanced stuff.
Then, replace <directory>
with the directory you'd like to clone m5w/apertium into and clone the repository.
git clone https://github.com/m5w/apertium.git <directory>
Then, see Minimal installation from SVN#Set up environment. Finally, configure, build, and install m5w/apertium. See Minimal installation from SVN#Configure, build, and install.
Usage
See apertium-tagger -h
.
Training a Model on a Hand-Tagged Corpus
First, get a hand-tagged corpus as one would for all other models.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$ ^aa/a<a>+a<a>$ ^aa/a<a>+a<b>$ ^aa/a<a>+a<b>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$
Example 2.1.1: a Hand-Tagged Corpus for apertium-tagger
Then, replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the filename to which you'd like to write the model, and train the tagger.
$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt
Disambiguate
Either write input to a file or pipe it to the tagger.
$ cat raw.txt ^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$
Example 2.2.1: Input for apertium-tagger
Replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the file to which you wrote the unigram model, and disambiguate the input.
$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt ^a/a<b>$ ^aa/a<b>+a<b>$ $ echo '^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER ^a/a<b>$ ^aa/a<b>+a<b>$
Unigram Models
See section 5.3 of "A set of open-source tools for Turkish natural language processing."
Model 1
See section 5.3.1 of "A set of open-source tools for Turkish natural language processing."
This model assigns each analysis string a score of
with additive smoothing.
Consider the following corpus.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$
Example 3.1.1: handtagged.txt
: A Hand-Tagged Corpus for apertium-tagger
Given the lexical unit ^a/a<a>/a<b>/a<c>$
, the tagger assigns the analysis string a<a>
a score of
The tagger then assigns the analysis string a<b>
a score of
and the unknown analysis string a<c>
a score of
If ./autogen.sh
is passed the option --enable-debug
, the tagger prints such calculations to standard error.
$ ./autogen.sh --enable-debug $ make $ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER score("a<a>") == 2 == 2.000000000000000000 score("a<b>") == 3 == 3.000000000000000000 score("a<c>") == 1 == 1.000000000000000000 ^a<b>$
Training on Corpora with Ambiguous Lexical Units
Consider the following corpus.
$ cat handtagged.txt ^a/a<a>$ ^a/a<a>/a<b>$ ^a/a<b>$ ^a/a<b>$
Example 3.1.1.1: a Hand-Tagged Corpus for apertium-tagger
The probabilities of a<a>
and a<b>
and both half for the first lexical unit. However, all unigram models store frequencies as std::size_t.
To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to one, expecting unambiguous lexical units. The size of this corpus' first lexical unit, ^a/a<a>$
, one, is divisible by the LCM, one, so the tagger increments the frequency of its analysis, a<a>
by
The size of the next lexical unit, ^a/a<a>/a<b>$
, two, isn't divisible by the LCM, one. Therefore, the tagger first multiplies the LCM, one, by the size, two, to yield two. Then, the tagger does the same to the frequency of a<a>
by the size, also yielding two. Finally, the tagger increments the frequency of each of this lexical unit's analyses, a<a>
and a<b>
, by
The frequency of a<a>
is then three, and the frequency of a<b>
is one.
The tagger then increments the frequency of the next lexical unit's analysis, a<b>
by
After doing the same for the last lexical unit, the frequency of a<a>
is three and the frequency of a<b>
is five.
Each model implements functions to increment analyses and multiply previous ones, so this method works for models 2 and 3 as well.
TODO: If one passes the -d
option to apertium-tagger
, the tagger prints warnings about ambiguous analyses in corpora to stderr.
$ apertium-tagger -ds 0 -u 1 handtagged.txt apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal ysis "a<a>" ^a/a<a>/a<b>$ ^
Model 2
See section 5.3.2 of "A set of open-source tools for Turkish natural language processing."
Consider Example 3.1.1.
The tag string <b>
is twice as frequent as <a>
. However, model 1 scores b<a>
and b<b>
equally because neither analysis string appears in the corpus.
This model splits each analysis string into a root, , and the part of the analysis string that isn't the root, . An analysis string's root is its first lemma. The of a<b>+c<d>
is a
; its is <b>+c<d>
. The tagger assigns each analysis string a score of with additive smoothing. (See [1]. Without additive smoothing, this model would be the same as model 1.) The tagger assigns higher scores to unknown analysis strings with frequent than to unknown analysis strings with infrequent .
Given the lexical unit ^b/b<a>/b<b>$
, the tagger assigns the analysis string b<a>
a score of
Note that counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a>
a score of
The tagger assigns the analysis string b<b>
a score of