Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Unigram tagger

From Apertium
Jump to: navigation, search

apertium-tagger from “m5w/apertium”[1] supports all the unigram models from “A set of open-source tools for Turkish natural language processing.”[2]

Contents

[edit] Installation

First, install all prerequisites. See “If you want to add language data / do more advanced stuff.”[3]

Then, replace <directory> with the directory you’d like to clone “m5w/apertium”[1] into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, configure your environment[4] and finally configure, build, and install[5] “m5w/apertium.”[1]

[edit] Usage

See apertium-tagger --help .

[edit] Training a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as you would for any non-unigram model.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example 2.1.1: handtagged.txt : a Hand-Tagged Corpus for apertium-tagger

Then, replace MODEL with the unigram model from “A set of open-source tools for Turkish natural language processing”[2] you’d like to use, replace SERIALISED_BASIC_TAGGER with the filename to which you’d like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

[edit] Disambiguation

Either write your input to a file or pipe it to the tagger.

$ cat raw.txt
^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

Example 2.2.1: raw.txt : Input for apertium-tagger

Replace MODEL with the unigram model from “A set of open-source tools for Turkish natural language processing”[2] you’d like to use, replace SERIALISED_BASIC_TAGGER with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt
^a/a<b>$
^aa/a<b>+a<b>$
$ echo '^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \
apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER
^a/a<b>$
^aa/a<b>+a<b>$

[edit] Unigram Models

See section 5.3 of “A set of open-source tools for Turkish natural language processing.”[2]

[edit] Model 1

See section 5.3.1 of “A set of open-source tools for Turkish natural language processing.”[2]

This model assigns each analysis string a score of


\begin{align}
\mathrm{score} &= f(T)~\text{,}
\end{align}


\begin{align}
\text{where}~&T~\text{is the analysis string}
\end{align}

with additive smoothing.

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$

Example 3.1.1: handtagged.txt : A Hand-Tagged Corpus for apertium-tagger

Given the lexical unit ^a/a<a>/a<b>/a<c>$ , the tagger assigns the analysis string a<a> a score of


\begin{align}
\mathrm{score} &= \mathrm{tokenCount\_T} + 1\\
&= 1 + 1\\
&= 2~\text{,}
\end{align}


\begin{align}
\text{where}~&\mathrm{tokenCount\_T}~\text{is the frequency of}~T~\text{in the corpus}~\text{.}
\end{align}

The tagger then assigns the analysis string a<b> a score of


\begin{align}
\mathrm{score} &= 2 + 1\\
&= 3
\end{align}

and the unknown analysis string a<c> a score of


\begin{align}
\mathrm{score} &= 0 + 1\\
&= 1~\text{.}
\end{align}

If ./autogen.sh is passed the option --enable-debug , the tagger prints such calculations to standard error.

$ ./autogen.sh --enable-debug
$ make
$ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER


score("a<a>") ==
2 ==
  2.000000000000000000
  score("a<b>") ==
3 ==
  3.000000000000000000
  score("a<c>") ==
1 ==
  1.000000000000000000
  ^a<b>$

[edit] Training on Corpora with Ambiguous Lexical Units

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<a>/a<b>$
^a/a<b>$
^a/a<b>$

Example 3.1.1.1: handtagged.txt : a Hand-Tagged Corpus for apertium-tagger

The tagger expects lexical units of 1 analysis string, or lexical units of size 1. However, the size of the lexical unit ^a/a<a>/a<b>$ is 2. For this lexical unit,


P(\texttt{a<a>}) = P(\texttt{a<b>}) = \frac12~\text{;}

the tagger must effectively increment the frequency of both analysis strings by 0.500000000000000000 . However, the tagger can’t increment the analysis strings’ frequencies by a non-integral number because model 1 represents analysis strings’ frequencies as std::size_t .[6]

Instead, the tagger multiplies all the stored analysis strings’ frequencies by this lexical unit’s size and increments the frequency of each of this lexical unit’s analysis strings by 1.


\begin{align}
f(\texttt{a<a>}) &= (1)(2) &f(\texttt{a<b>}) &= (0)(1)\\
&+ 1 = 2 + 1 = 3 & &+ 1 = 0 + 1 = 1
\end{align}

The tagger could then increment the analysis strings’ frequencies of another lexical unit of size 2 without multiplying any of the stored analysis strings’ frequencies. To account for this, the tagger stores the least common multiple of all lexical units’ sizes; only if the LCM isn’t divisible by a lexical unit’s size does the tagger multiply all the analysis strings’ frequencies.

After incrementing the analysis strings’ frequencies of the lexical unit ^a/a<a>/a<b>$, the tagger increments the analysis string a<b> of the lexical unit ^a/a<b>$ by


\begin{align}
\frac{\mathrm{LCM}}{\mathrm{TheLexicalUnit.size}} = \frac{2}{1} = 2~\text{.}
\end{align}

If the tagger gets another lexical unit of size 2, it would increment the frequency of each of the lexical unit’s analysis strings by


\begin{align}
\frac{\mathrm{LCM}}{\mathrm{TheLexicalUnit.size}} = \frac{2}{2} = 1~\text{,}
\end{align}

and if it gets a lexical unit of size 3, it would multiply all the analysis strings’ frequencies by 3 and then increment the frequency of each of the lexical unit’s analysis strings by


\begin{align}
\frac{\mathrm{LCM}}{\mathrm{TheLexicalUnit.size}} = \frac{6}{3} = 2~\text{.}
\end{align}

Each model supports functions to increment all their stored analysis strings’ frequencies, so models 2 and 3 support this algorithm as well.

TODO: If one passes the -d option to apertium-tagger , the tagger prints warnings about ambiguous analyses in corpora to stderr.

$ apertium-tagger -ds 0 -u 1 handtagged.txt
apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal
ysis "a<a>"
^a/a<a>/a<b>$
^

[edit] File Format

The tagger represents this model as std::map<Analysis, std::size_t> Model; .[7][6][8][9]

It first serialises Model.size() , the size of the parameter vector of analysis strings, which is of type std::size_t ,[6] followed by analysis string-frequency pairs.

To reduce file size, the it writes only the non-zero bytes of a std::size_t [6] preceded by the number of bytes to read.

[. . . .]

([. . .]).serialise(0x00000000, [. . .]); // 00
([. . .]).serialise(0x000000ff, [. . .]); // 01ff
([. . .]).serialise(0x0000ffff, [. . .]); // 02ffff
([. . .]).serialise(0x00ffffff, [. . .]); // 03ffffff
([. . .]).serialise(0xffffffff, [. . .]); // 04ffffffff

[. . . .]

Example 3.2.1: std::size_t [6] Serialisation

The tagger serialises the analysis string-frequency pairs, which are of type std::pair<Analysis, std::size_t> .[10] For each analysis string-frequency pair, it first serialises the analysis string, followed by the frequency, which is of type std::size_t .[6] It serialises TheMorphemes ,[11] the sequence of morphemes, which is of type std::vector<Morpheme> .[12][13] The tagger first serialises TheMorphemes.size() ,[11] the size of the sequence of morphemes, which is of type std::size_t ,[6] followed by the morphemes. For each morpheme, it first serialises the lemma, followed by the sequence of tags. The tagger serialises TheLemma ,[14] the lemma, which is of type std::wstring .[15] It first serialises TheLemma.size() , the length of the lemma, which is of type std::size_t ,[6] followed by the lemma itself.

The tagger then serialises TheTags.size() ,[16] the size of the sequence of tags, which is of type std::size_t ,[6] followed by the tag sequence. For each tag, it first serialises TheTag.size() ,[17] the length of the tag, followed by the tag itself.

Given the corpus

^a/a<b>$

the tagger writes [18]

0000000: 0101 0101 0101 0161 0101 0101 0161 0101  .......a.....b..
0000010: 0a                                       .

[edit] Model 2

See section 5.3.2 of “A set of open-source tools for Turkish natural language processing.”[2]

Consider Example 3.1.1: handtagged.txt .

The tag string <b> is twice as frequent as <a>. However, model 1 scores b<a> and b<b> equally because neither analysis string appears in the corpus.

This model splits each analysis string into a root, r , and the part of the analysis string that isn’t the root, a . An analysis string’s root is its first lemma. The r of a<b>+c<d> is a , and its a is <b>+c<d> . The tagger assigns each analysis string a score of P(r | a)f(a) with add-one smoothing. (Without additive smoothing, this model would be the same as model 1.)[19] The tagger assigns higher scores to unknown analysis strings with frequent a than to unknown analysis strings with infrequent a .

Given the lexical unit ^b/b<a>/b<b>$, the tagger assigns the analysis string b<a> a score of


\begin{align}
\mathrm{score} & = \frac{(\mathrm{tokenCount\_r\_a} + 1)(\mathrm{tokenCount\_a} + 1)}{\mathrm{tokenCount\_a} + 1 + \mathrm{typeCount\_a}} \\
& = \frac{(0 + 1)(1 + 1)}{1 + 1 + 2} \\
& = \frac{(1)(2)}{4} \\
& = \frac{1}{2}~\text{,}
\end{align}


\begin{align}
\text{where}~&\mathrm{tokenCount\_r\_a}~\text{is the frequency of the}~r,a~\text{in the corpus ,} \\
&\mathrm{tokenCount\_a}~\text{is the frequency of the}~a~\text{in the corpus ,} \\
\text{and}~&\mathrm{typeCount\_a}~\text{is the size of the parameter vector of all}~r~\text{preceding the}~a~\text{.}
\end{align}

Note that typeCount_a counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a> a score of


\begin{align}
\mathrm{score} & = \frac{(1 + 1)(1 + 1)}{1 + 1 + 1} \\
& = \frac{(2)(2)}{3} \\
& = \frac{4}{3}~\text{.}
\end{align}

The tagger assigns the analysis string b<b> a score of


\begin{align}
\mathrm{score} & = \frac{(0 + 1)(2 + 1)}{2 + 1 + 2} \\
& = \frac{(1)(3)}{5} \\
& = \frac{3}{5}~\text{.}
\end{align}

[edit] File Format

The tagger represents this model as std::map<a, std::map<Lemma, std::size_t> > Model; .[7][20][21][6][22]

See section 3.1.2.

For each a ,[20] the tagger first serialises TheTags ,[23] the tag sequence, which is of type std::vector<Tag> ,[12][24] followed by TheMorphemes ,[25] the morpheme sequence, which is of type std::vector<Morpheme> .[12][13] For each Lemma ,[21] the tagger serialises TheLemma ,[26] the lemma, which is of type std::wstring .[15]

Given the corpus

^a/a<b>+c<d>$

the tagger writes [18]

0000000: 0101 0101 0101 0162 0101 0101 0163 0101 .......b.....c..
0000010: 0101 0164 0101 0101 0161 0101 0a        ...d.....a...

[edit] Model 3

See section 5.3.3 of “A set of open-source tools for Turkish natural language processing.”[2]

Consider Example 3.1.1: handtagged.txt .

The morpheme a<b> is twice as frequent as the morpheme a<a> . However, model 2 scores the analysis strings a<a>+a<a> and a<b>+a<a> equally because the a of neither appears in the corpus.

This model splits each analysis string into an r~\text{,} a first inflection, i_0~\text{,} and a sequence of derivation-inflection pairs, (d_1,i_1)...(d_n,i_n)~\text{.} The r of the analysis string a<b>+c<d> is a , its i0 is <b> , and its (d1,i1)...(dn,in) is c<d> , where its d1 is c , and its i1 is <d> . The tagger assigns each analysis string a score of P(r|i_0)f(i_0)\prod_{i = 1}^n P(d_i|i_{i-1})P(i_i|d_i) with add-one smoothing. The tagger assigns higher scores to unknown analysis strings with frequent r,i0 than to unknown analysis strings with infrequent r,i_0~\text{.}

Given the lexical unit ^aa/a<a>+a<a>/a<b>+a<a>$ , the tagger assigns the analysis string a<a>+a<a> a score of


\begin{align}
\mathrm{score} =\;&\frac{(\mathrm{tokenCount\_r\_i\_0} + 1)(\mathrm{tokenCount\_i\_0} + 1)}{\mathrm{tokenCount\_i\_0} + 1 + \mathrm{typeCount\_i\_0}}\\
&\begin{align}\prod_{i = 1}^n\,&\frac{\mathrm{tokenCount\_d\_i}(d_n, i_{n - 1}) + 1}{\mathrm{tokenCount\_i}(i_{n - 1}) + 1 + \mathrm{typeCount\_i}(i_{n - 1}, d_n)}\\
&\frac{\mathrm{tokenCount\_i\_d}(i_n, d_n) + 1}{\mathrm{tokenCount\_d}(d_n) + 1 + \mathrm{typeCount\_d}(d_n, i_n)}\end{align}\\
=\;&\frac{(1 + 1)(1 + 1)}{1 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\\
=\;&\frac{(2)(2)}3\frac12\frac12\\
=\;&\frac43\frac14\\
=\;&\frac13~\text{,}
\end{align}


\begin{align}
\text{where}~&\mathrm{tokenCount\_r\_i\_0}~\text{is the frequency of the}~r,i_0~\text{in the corpus}~\text{,}\\
&\mathrm{tokenCount\_i\_0}~\text{is the frequency of the}~i_0~\text{in the corpus}~\text{,}\\
&\mathrm{typeCount\_i\_0}~\text{is the size of the parameter vector of}~r~\text{preceding the}~i_0~\text{,}\\
&\mathrm{tokenCount\_d\_i}(d_n, i_{n - 1})~\text{is the frequency of the}~d_n~\text{following the}~i_{n - 1}~\text{in the corpus}~\text{,}\\
&\mathrm{tokenCount\_i}(i_{n - 1})~\text{is the frequency of non-final}~i_{n - 1}~\text{in the corpus}~\text{,}\\
&\mathrm{typeCount\_i}(i_{n - 1}, d_n)~\text{is the size of the parameter vector of}~d~\text{following the}~i_{n - 1}~\text{,}\\
&\mathrm{tokenCount\_i\_d}(i_n, d_n)~\text{is the frequency of the}~i_n~\text{following the}~d_n~\text{in the corpus}~\text{,}\\
&\mathrm{tokenCount\_d}(d_n)~\text{is the frequency of the}~d_n~\text{in the corpus}~\text{,}\\
\text{and}~&\mathrm{typeCount\_d}(d_n, i_n)~\text{is the size of the parameter vector of}~i~\text{following the}~d_n~\text{.}
\end{align}

The tagger assigns the analysis string a<b>+a<a> a score of


\begin{align}
\mathrm{score} =\;&\frac{(2 + 1)(2 + 1)}{2 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\frac{0 + 1}{0 + 1 + 1}\\
=\;&\frac{(3)(3)}{4}\frac12\frac12\\
=\;&\frac94\frac14\\
=\;&\frac9{16}~\text{.}
\end{align}

[edit] File Format

The tagger represents this model as std::pair<std::map<i, std::map<Lemma, std::size_t> >, std::pair<std::map<i, std::map<Lemma, std::size_t> >, std::map<Lemma, std::map<i, std::size_t> > > > Model; .[10][7][27][21][6]

See section 3.1.2.

For each i ,[27] the tagger serialises TheTags ,[28] the tag sequence, which is of type std::vector<Tag> .[12][24]

Given the corpus

^a/a<b>+c<d>$

the tagger writes [18]

0000000: 0101 0101 0101 0162 0101 0101 0161 0101 .......b.....a..
0000010: 0101 0101 0101 0162 0101 0101 0163 0101 .......b.....c..
0000020: 0101 0101 0163 0101 0101 0101 0164 0101 .....c.......d..
0000030: 0a                                      .

[edit] Notes

  1. 1.0 1.1 1.2 https://github.com/m5w/apertium
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 http://coltekin.net/cagri/papers/trmorph-tools.pdf
  3. Installation#If you want to add language data / do more advanced stuff
  4. Minimal installation from SVN#Set up environment
  5. Minimal installation from SVN#Configure, build, and install
  6. 6.00 6.01 6.02 6.03 6.04 6.05 6.06 6.07 6.08 6.09 6.10 http://en.cppreference.com/w/cpp/types/size_t
  7. 7.0 7.1 7.2 http://en.cppreference.com/w/cpp/container/map
  8. https://github.com/m5w/apertium/blob/master/apertium/analysis.h
  9. https://github.com/m5w/apertium/blob/master/apertium/basic_5_3_1_tagger.h#L28
  10. 10.0 10.1 http://en.cppreference.com/w/cpp/utility/pair
  11. 11.0 11.1 https://github.com/m5w/apertium/blob/master/apertium/analysis.h#L33
  12. 12.0 12.1 12.2 12.3 http://en.cppreference.com/w/cpp/container/vector
  13. 13.0 13.1 https://github.com/m5w/apertium/blob/master/apertium/morpheme.h
  14. https://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L30
  15. 15.0 15.1 http://en.cppreference.com/w/cpp/string/basic_string
  16. https://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L31
  17. https://github.com/m5w/apertium/blob/master/apertium/tag.h#L27
  18. 18.0 18.1 18.2 http://linux.die.net/man/1/xxd
  19. \begin{align}\mathrm{score} &= \frac{(\mathrm{tokenCount\_r\_a})(\mathrm{tokenCount\_a})}{\mathrm{tokenCount\_a}}\\&= \mathrm{tokenCount\_r\_a} = \mathrm{tokenCount\_T}\end{align}
  20. 20.0 20.1 https://github.com/m5w/apertium/blob/master/apertium/a.h
  21. 21.0 21.1 21.2 https://github.com/m5w/apertium/blob/master/apertium/lemma.h
  22. https://github.com/m5w/apertium/blob/master/apertium/basic_5_3_2_tagger.h#L29
  23. https://github.com/m5w/apertium/blob/master/apertium/a.h#L32
  24. 24.0 24.1 https://github.com/m5w/apertium/blob/master/apertium/tag.h
  25. https://github.com/m5w/apertium/blob/master/apertium/a.h#L33
  26. https://github.com/m5w/apertium/blob/master/apertium/lemma.h#L32
  27. 27.0 27.1 https://github.com/m5w/apertium/blob/master/apertium/i.h
  28. https://github.com/m5w/apertium/blob/master/apertium/i.h#L34
Personal tools