Unigram tagger

apertium-tagger from “m5w/apertium”^[1] supports all the unigram models from “A set of open-source tools for Turkish natural language processing.”^[2]

Installation

First, install all prerequisites. See “If you want to add language data / do more advanced stuff.”^[3]

Then, replace <directory> with the directory you’d like to clone “m5w/apertium”^[1] into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, configure your environment^[4] and finally configure, build, and install^[5] “m5w/apertium.”^[1]

Usage

See apertium-tagger -h .

Training a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as you would for any non-unigram model.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example 2.1.1: handtagged.txt : a Hand-Tagged Corpus for apertium-tagger

Then, replace MODEL with the unigram model from “A set of open-source tools for Turkish natural language processing”^[2] you’d like to use, replace SERIALISED_BASIC_TAGGER with the filename to which you’d like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

Disambiguation

Either write your input to a file or pipe it to the tagger.

$ cat raw.txt
^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

Example 2.2.1: raw.txt : Input for apertium-tagger

Replace MODEL with the unigram model from “A set of open-source tools for Turkish natural language processing”^[2] you’d like to use, replace SERIALISED_BASIC_TAGGER with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt
^a/a<b>$
^aa/a<b>+a<b>$
$ echo '^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \
apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER
^a/a<b>$
^aa/a<b>+a<b>$

Unigram Models

See section 5.3 of “A set of open-source tools for Turkish natural language processing.”^[2]

Model 1

See section 5.3.1 of “A set of open-source tools for Turkish natural language processing.”^[2]

This model assigns each analysis string a score of

${\begin{aligned}\mathrm {score} &=f(T)~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&T~{\text{is the analysis string}}\end{aligned}}$

with additive smoothing.

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$

Example 3.1.1: handtagged.txt : A Hand-Tagged Corpus for apertium-tagger

Given the lexical unit ^a/a<a>/a/a<c>$ , the tagger assigns the analysis string a<a> a score of

${\begin{aligned}\mathrm {score} &=\mathrm {tokenCount\_T} +1\\&=1+1\\&=2~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&\mathrm {tokenCount\_T} ~{\text{is the frequency of}}~T~{\text{in the corpus}}~{\text{.}}\end{aligned}}$

The tagger then assigns the analysis string a a score of

${\begin{aligned}\mathrm {score} &=2+1\\&=3\end{aligned}}$

and the unknown analysis string a<c> a score of

${\begin{aligned}\mathrm {score} &=0+1\\&=1~{\text{.}}\end{aligned}}$

If ./autogen.sh is passed the option --enable-debug , the tagger prints such calculations to standard error.

$ ./autogen.sh --enable-debug
$ make
$ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER


score("a<a>") ==
  2 ==
  2.000000000000000000
score("a<b>") ==
  3 ==
  3.000000000000000000
score("a<c>") ==
  1 ==
  1.000000000000000000
^a<b>$

Training on Corpora with Ambiguous Lexical Units

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<a>/a<b>$
^a/a<b>$
^a/a<b>$

Example 3.1.1.1: handtagged.txt : a Hand-Tagged Corpus for apertium-tagger

The tagger expects lexical units of 1 analysis string, or lexical units of size 1. However, the size of the lexical unit ^a/a<a>/a$ is 2. For this lexical unit,

$P({\texttt {a<a>}})=P({\texttt {a}})={\frac {1}{2}}~{\text{;}}$

the tagger must effectively increment the frequency of both analysis strings by 0.500000000000000000 . However, the tagger can’t increment the analysis strings’ frequencies by a non-integral number because model 1 represents analysis strings’ frequencies as std::size_t ^[6].

Instead, the tagger multiplies all the stored analysis strings’ frequencies by this lexical unit’s size and increments the frequency of each of this lexical unit’s analysis strings by 1.

${\begin{aligned}f({\texttt {a<a>}})&=(1)(2)&f({\texttt {a}})&=(0)(1)\\&+1=2+1=3&&+1=0+1=1\end{aligned}}$

The tagger could then increment the analysis strings’ frequencies of another lexical unit of size 2 without multiplying any of the stored analysis strings’ frequencies. To account for this, the tagger stores the least common multiple of all lexical units’ sizes; only if the LCM isn’t divisible by a lexical unit’s size does the tagger multiply all the analysis strings’ frequencies.

After incrementing the analysis strings’ frequencies of the lexical unit ^a/a<a>/a$, the tagger increments the analysis string a of the lexical unit ^a/a$ by

${\begin{aligned}{\frac {\mathrm {LCM} }{\mathrm {TheLexicalUnit.size} }}={\frac {2}{1}}=2~{\text{.}}\end{aligned}}$

If the tagger gets another lexical unit of size 2, it would increment the frequency of each of the lexical unit’s analysis strings by

${\begin{aligned}{\frac {\mathrm {LCM} }{\mathrm {TheLexicalUnit.size} }}={\frac {2}{2}}=1~{\text{,}}\end{aligned}}$

and if it gets a lexical unit of size 3, it would multiply all the analysis strings’ frequencies by 3 and then increment the frequency of each of the lexical unit’s analysis strings by

${\begin{aligned}{\frac {\mathrm {LCM} }{\mathrm {TheLexicalUnit.size} }}={\frac {6}{3}}=2~{\text{.}}\end{aligned}}$

Each model supports functions to increment all their stored analysis strings’ frequencies, so models 2 and 3 support this algorithm as well.

TODO: If one passes the -d option to apertium-tagger , the tagger prints warnings about ambiguous analyses in corpora to stderr.

$ apertium-tagger -ds 0 -u 1 handtagged.txt
apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal
ysis "a<a>"
^a/a<a>/a<b>$
            ^

File Format

The tagger represents this model as std::map<Analysis, std::size_t> .^[7]^[6]^[8]

Given the hand-tagged corpus Example 2.1.1: handtagged.txt , the tagger represents the model as

${\begin{aligned}&{\texttt {std::map<Analysis,std::size\_t>Model}}\\&\qquad {\begin{aligned}&{\texttt {a<a>}}&1\\&{\texttt {a}}&2\\&{\texttt {a<a>+a<a>}}&1\\&{\texttt {a<a>+a}}&2\\&{\texttt {a+a<a>}}&3\\&{\texttt {a+a}}&4~{\text{.}}\end{aligned}}\end{aligned}}$

The tagger then serialises the model as

0x01 // size of the number of unique analysis strings in bytes
0x06 // number of unique analysis strings
0x01 // size of the number of morphemes in the analysis string a<a> in bytes
0x01 // number of morphemes in the analysis string a<a>
0x01 // size of the length of the lemma of the first morpheme of the analysis string a<a> in bytes
0x01 // length of the lemma of the first morpheme of the analysis string a<a>
0x01 // size of the first character of the lemma of the first morpheme of the analysis string a<a> in bytes
0x61 // first character of the lemma of the first morpheme of the analysis string a<a>
0x01 // size of the number of tags in the first morpheme of the analysis string a<a> in bytes
0x01 // number of tags in the first morpheme of the analysis string a<a>
0x01 // size of the length of the first tag of the first morpheme of the analysis string a<a> in bytes
0x01 // length of the first tag of the first morpheme of the analysis string a<a>
0x01 // size of the first character of the first tag of the first morpheme of the analysis string a<a> in bytes
0x61 // first character of the first tag of the first morpheme of the analysis string a<a>
0x01 // size of the frequency of the analysis string a<a> in bytes
0x01 // frequency of the analysis string a<a>

0x01 // size of the number of morphemes in the analysis string a<a>+a<a> in bytes
0x02 // number of morphemes in the analysis string a<a>+a<a>
. . .

or, more concisely, as

0000000: 0106 0101 0101 0161 0101 0101 0161 0101  .......a.....a..
0000010: 0102 0101 0161 0101 0101 0161 0101 0161  .....a.....a...a
0000020: 0101 0101 0161 0101 0102 0101 0161 0101  .....a.......a..
0000030: 0101 0161 0101 0161 0101 0101 0162 0102  ...a...a.....b..
0000040: 0101 0101 0161 0101 0101 0162 0102 0102  .....a.....b....
0000050: 0101 0161 0101 0101 0162 0101 0161 0101  ...a.....b...a..
0000060: 0101 0161 0103 0102 0101 0161 0101 0101  ...a.......a....
0000070: 0162 0101 0161 0101 0101 0162 0104       .b...a.....b..

.

Model 2

See section 5.3.2 of “A set of open-source tools for Turkish natural language processing.”^[2]

Consider Example 3.1.1: handtagged.txt .

The tag string  is twice as frequent as <a>. However, model 1 scores b<a> and b equally because neither analysis string appears in the corpus.

This model splits each analysis string into a root, $r$ , and the part of the analysis string that isn’t the root, $a$ . An analysis string’s root is its first lemma. The $r$ of a+c<d> is a , and its $a$ is +c<d> . The tagger assigns each analysis string a score of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(r|a)f(a)} with add-one smoothing. (Without additive smoothing, this model would be the same as model 1.)^[9] The tagger assigns higher scores to unknown analysis strings with frequent Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a} than to unknown analysis strings with infrequent Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a} .

Given the lexical unit ^b/b<a>/b$, the tagger assigns the analysis string b<a> a score of

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \mathrm{score} & = \frac{(\mathrm{tokenCount\_r\_a} + 1)(\mathrm{tokenCount\_a} + 1)}{\mathrm{tokenCount\_a} + 1 + \mathrm{typeCount\_a}} \\ & = \frac{(0 + 1)(1 + 1)}{1 + 1 + 2} \\ & = \frac{(1)(2)}{4} \\ & = \frac{1}{2}~\text{,} \end{align} }

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \text{where}~&\mathrm{tokenCount\_r\_a}~\text{is the frequency of the}~r,a~\text{in the corpus ,} \\ &\mathrm{tokenCount\_a}~\text{is the frequency of the}~a~\text{in the corpus ,} \\ \text{and}~&\mathrm{typeCount\_a}~\text{is the size of the parameter vector of all}~r~\text{preceding the}~a~\text{.} \end{align} }

Note that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{typeCount\_a}} counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a> a score of

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \mathrm{score} & = \frac{(1 + 1)(1 + 1)}{1 + 1 + 1} \\ & = \frac{(2)(2)}{3} \\ & = \frac{4}{3}~\text{.} \end{align} }

The tagger assigns the analysis string b a score of

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \mathrm{score} & = \frac{(0 + 1)(2 + 1)}{2 + 1 + 2} \\ & = \frac{(1)(3)}{5} \\ & = \frac{3}{5}~\text{.} \end{align} }

File Format

Model 3

See section 5.3.3 of “A set of open-source tools for Turkish natural language processing.”^[2]

Consider Example 3.1.1: handtagged.txt .

The morpheme a is twice as frequent as the morpheme a<a> . However, model 2 scores the analysis strings a<a>+a<a> and a+a<a> equally because the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a} of neither appears in the corpus.

This model splits each analysis string into an Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r~\text{,}} a first inflection, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i_0~\text{,}} and a sequence of derivation-inflection pairs, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (d_1,i_1)...(d_n,i_n)~\text{.}} The Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r} of the analysis string a+c<d> is a , its Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i_0} is  , and its Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (d_1,i_1)...(d_n,i_n)} is c<d> , where its Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d_1} is c , and its Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i_1} is <d> . The tagger assigns each analysis string a score of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(r|i_0)f(i_0)\prod_{i = 1}^n P(d_i|i_{i-1})P(i_i|d_i)} with add-one smoothing. The tagger assigns higher scores to unknown analysis strings with frequent Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r,i_0} than to unknown analysis strings with infrequent Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r,i_0~\text{.}}

Given the lexical unit ^aa/a<a>+a<a>/a+a<a>$ , the tagger assigns the analysis string a<a>+a<a> a score of

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \mathrm{score} =\;&\frac{(\mathrm{tokenCount\_r\_i\_0} + 1)(\mathrm{tokenCount\_i\_0} + 1)}{\mathrm{tokenCount\_i\_0} + 1 + \mathrm{typeCount\_i\_0}}\\ &\begin{align}\prod_{i = 1}^n\,&\frac{\mathrm{tokenCount\_d\_i}(d_n, i_{n - 1}) + 1}{\mathrm{tokenCount\_i}(i_{n - 1}) + 1 + \mathrm{typeCount\_i}(i_{n - 1}, d_n)}\\ &\frac{\mathrm{tokenCount\_i\_d}(i_n, d_n) + 1}{\mathrm{tokenCount\_d}(d_n) + 1 + \mathrm{typeCount\_d}(d_n, i_n)}\end{align}\\ =\;&\frac{(1 + 1)(1 + 1)}{1 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\\ =\;&\frac{(2)(2)}3\frac12\frac12\\ =\;&\frac43\frac14\\ =\;&\frac13~\text{,} \end{align} }

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \text{where}~&\mathrm{tokenCount\_r\_i\_0}~\text{is the frequency of the}~r,i_0~\text{in the corpus}~\text{,}\\ &\mathrm{tokenCount\_i\_0}~\text{is the frequency of the}~i_0~\text{in the corpus}~\text{,}\\ &\mathrm{typeCount\_i\_0}~\text{is the size of the parameter vector of}~r~\text{preceding the}~i_0~\text{,}\\ &\mathrm{tokenCount\_d\_i}(d_n, i_{n - 1})~\text{is the frequency of the}~d_n~\text{following the}~i_{n - 1}~\text{in the corpus}~\text{,}\\ &\mathrm{tokenCount\_i}(i_{n - 1})~\text{is the frequency of non-final}~i_{n - 1}~\text{in the corpus}~\text{,}\\ &\mathrm{typeCount\_i}(i_{n - 1}, d_n)~\text{is the size of the parameter vector of}~d~\text{following the}~i_{n - 1}~\text{,}\\ &\mathrm{tokenCount\_i\_d}(i_n, d_n)~\text{is the frequency of the}~i_n~\text{following the}~d_n~\text{in the corpus}~\text{,}\\ &\mathrm{tokenCount\_d}(d_n)~\text{is the frequency of the}~d_n~\text{in the corpus}~\text{,}\\ \text{and}~&\mathrm{typeCount\_d}(d_n, i_n)~\text{is the size of the parameter vector of}~i~\text{following the}~d_n~\text{.} \end{align} }

The tagger assigns the analysis string a+a<a> a score of

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \mathrm{score} =\;&\frac{(2 + 1)(2 + 1)}{2 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\\ =\;&\frac{(3)(3)}{4}\frac12\frac12\\ =\;&\frac94\frac14\\ =\;&\frac9{16}~\text{.} \end{align} }

File Format

Notes

↑ ^1.0 ^1.1 ^1.2 https://github.com/m5w/apertium
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 http://coltekin.net/cagri/papers/trmorph-tools.pdf
↑ Installation#If you want to add language data / do more advanced stuff
↑ Minimal installation from SVN#Set up environment
↑ Minimal installation from SVN#Configure, build, and install
↑ ^6.0 ^6.1 http://en.cppreference.com/w/cpp/types/size_t
↑ http://en.cppreference.com/w/cpp/container/map
↑ https://github.com/m5w/apertium/blob/master/apertium/analysis.h
↑ Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align}\mathrm{score} &= \frac{(\mathrm{tokenCount\_r\_a})(\mathrm{tokenCount\_a})}{\mathrm{tokenCount\_a}}\\&= \mathrm{tokenCount\_r\_a} = \mathrm{tokenCount\_T}\end{align}}

[_1-1] 1.0 ^1.1 ^1.2 https://github.com/m5w/apertium

[_2-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 http://coltekin.net/cagri/papers/trmorph-tools.pdf

[_3-3] Installation#If you want to add language data / do more advanced stuff

[_4-4] Minimal installation from SVN#Set up environment

[_5-5] Minimal installation from SVN#Configure, build, and install

[_6-6] 6.0 ^6.1 http://en.cppreference.com/w/cpp/types/size_t

[std::map-7] ttp://en.cppreference.com/w/cpp/container/map

[_9-8] ttps://github.com/m5w/apertium/blob/master/apertium/analysis.h

[_7-9] Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align}\mathrm{score} &= \frac{(\mathrm{tokenCount\_r\_a})(\mathrm{tokenCount\_a})}{\mathrm{tokenCount\_a}}\\&= \mathrm{tokenCount\_r\_a} = \mathrm{tokenCount\_T}\end{align}}

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Unigram tagger

Contents

Installation

Usage

Training a Model on a Hand-Tagged Corpus

Disambiguation

Unigram Models

Model 1

Training on Corpora with Ambiguous Lexical Units

File Format

Model 2

File Format

Model 3

File Format

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools