Unigram tagger

apertium-tagger from “m5w/apertium”^[1] supports all the unigram models from “A set of open-source tools for Turkish natural language processing.”^[2]

Installation[edit]

First, install all prerequisites. See “If you want to add language data / do more advanced stuff.”^[3]

Then, replace <directory> with the directory you’d like to clone “m5w/apertium”^[1] into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, configure your environment^[4] and finally configure, build, and install^[5] “m5w/apertium.”^[1]

Usage[edit]

See apertium-tagger --help .

Training a Model on a Hand-Tagged Corpus[edit]

First, get a hand-tagged corpus as you would for any non-unigram model.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example 2.1.1: handtagged.txt : a Hand-Tagged Corpus for apertium-tagger

Then, replace MODEL with the unigram model from “A set of open-source tools for Turkish natural language processing”^[2] you’d like to use, replace SERIALISED_BASIC_TAGGER with the filename to which you’d like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

Disambiguation[edit]

Either write your input to a file or pipe it to the tagger.

$ cat raw.txt
^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

Example 2.2.1: raw.txt : Input for apertium-tagger

Replace MODEL with the unigram model from “A set of open-source tools for Turkish natural language processing”^[2] you’d like to use, replace SERIALISED_BASIC_TAGGER with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt
^a/a<b>$
^aa/a<b>+a<b>$
$ echo '^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \
apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER
^a/a<b>$
^aa/a<b>+a<b>$

Unigram Models[edit]

See section 5.3 of “A set of open-source tools for Turkish natural language processing.”^[2]

Model 1[edit]

See section 5.3.1 of “A set of open-source tools for Turkish natural language processing.”^[2]

This model assigns each analysis string a score of

${\begin{aligned}\mathrm {score} &=f(T)~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&T~{\text{is the analysis string}}\end{aligned}}$

with additive smoothing.

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$

Example 3.1.1: handtagged.txt : A Hand-Tagged Corpus for apertium-tagger

Given the lexical unit ^a/a<a>/a/a<c>$ , the tagger assigns the analysis string a<a> a score of

${\begin{aligned}\mathrm {score} &=\mathrm {tokenCount\_T} +1\\&=1+1\\&=2~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&\mathrm {tokenCount\_T} ~{\text{is the frequency of}}~T~{\text{in the corpus}}~{\text{.}}\end{aligned}}$

The tagger then assigns the analysis string a a score of

${\begin{aligned}\mathrm {score} &=2+1\\&=3\end{aligned}}$

and the unknown analysis string a<c> a score of

${\begin{aligned}\mathrm {score} &=0+1\\&=1~{\text{.}}\end{aligned}}$

If ./autogen.sh is passed the option --enable-debug , the tagger prints such calculations to standard error.

$ ./autogen.sh --enable-debug
$ make
$ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER


score("a<a>") ==
2 ==
  2.000000000000000000
  score("a<b>") ==
3 ==
  3.000000000000000000
  score("a<c>") ==
1 ==
  1.000000000000000000
  ^a<b>$

Training on Corpora with Ambiguous Lexical Units[edit]

Consider the following corpus.

$ cat handtagged.txt
^a/a<a>$
^a/a<a>/a<b>$
^a/a<b>$
^a/a<b>$

Example 3.1.1.1: handtagged.txt : a Hand-Tagged Corpus for apertium-tagger

The tagger expects lexical units of 1 analysis string, or lexical units of size 1. However, the size of the lexical unit ^a/a<a>/a$ is 2. For this lexical unit,

$P({\texttt {a<a>}})=P({\texttt {a}})={\frac {1}{2}}~{\text{;}}$

the tagger must effectively increment the frequency of both analysis strings by 0.500000000000000000 . However, the tagger can’t increment the analysis strings’ frequencies by a non-integral number because model 1 represents analysis strings’ frequencies as std::size_t .^[6]

Instead, the tagger multiplies all the stored analysis strings’ frequencies by this lexical unit’s size and increments the frequency of each of this lexical unit’s analysis strings by 1.

${\begin{aligned}f({\texttt {a<a>}})&=(1)(2)&f({\texttt {a}})&=(0)(1)\\&+1=2+1=3&&+1=0+1=1\end{aligned}}$

The tagger could then increment the analysis strings’ frequencies of another lexical unit of size 2 without multiplying any of the stored analysis strings’ frequencies. To account for this, the tagger stores the least common multiple of all lexical units’ sizes; only if the LCM isn’t divisible by a lexical unit’s size does the tagger multiply all the analysis strings’ frequencies.

After incrementing the analysis strings’ frequencies of the lexical unit ^a/a<a>/a$, the tagger increments the analysis string a of the lexical unit ^a/a$ by

${\begin{aligned}{\frac {\mathrm {LCM} }{\mathrm {TheLexicalUnit.size} }}={\frac {2}{1}}=2~{\text{.}}\end{aligned}}$

If the tagger gets another lexical unit of size 2, it would increment the frequency of each of the lexical unit’s analysis strings by

${\begin{aligned}{\frac {\mathrm {LCM} }{\mathrm {TheLexicalUnit.size} }}={\frac {2}{2}}=1~{\text{,}}\end{aligned}}$

and if it gets a lexical unit of size 3, it would multiply all the analysis strings’ frequencies by 3 and then increment the frequency of each of the lexical unit’s analysis strings by

${\begin{aligned}{\frac {\mathrm {LCM} }{\mathrm {TheLexicalUnit.size} }}={\frac {6}{3}}=2~{\text{.}}\end{aligned}}$

Each model supports functions to increment all their stored analysis strings’ frequencies, so models 2 and 3 support this algorithm as well.

TODO: If one passes the -d option to apertium-tagger , the tagger prints warnings about ambiguous analyses in corpora to stderr.

$ apertium-tagger -ds 0 -u 1 handtagged.txt
apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal
ysis "a<a>"
^a/a<a>/a<b>$
^

File Format[edit]

The tagger represents this model as std::map<Analysis, std::size_t> Model; .^[7]^[6]^[8]^[9]

It first serialises Model.size() , the size of the parameter vector of analysis strings, which is of type std::size_t ,^[6] followed by analysis string-frequency pairs.

To reduce file size, the it writes only the non-zero bytes of a std::size_t ^[6] preceded by the number of bytes to read.

[. . . .]

([. . .]).serialise(0x00000000, [. . .]); // 00
([. . .]).serialise(0x000000ff, [. . .]); // 01ff
([. . .]).serialise(0x0000ffff, [. . .]); // 02ffff
([. . .]).serialise(0x00ffffff, [. . .]); // 03ffffff
([. . .]).serialise(0xffffffff, [. . .]); // 04ffffffff

[. . . .]

Example 3.2.1: std::size_t ^[6] Serialisation

The tagger serialises the analysis string-frequency pairs, which are of type std::pair<Analysis, std::size_t> .^[10] For each analysis string-frequency pair, it first serialises the analysis string, followed by the frequency, which is of type std::size_t .^[6] It serialises TheMorphemes ,^[11] the sequence of morphemes, which is of type std::vector<Morpheme> .^[12]^[13] The tagger first serialises TheMorphemes.size() ,^[11] the size of the sequence of morphemes, which is of type std::size_t ,^[6] followed by the morphemes. For each morpheme, it first serialises the lemma, followed by the sequence of tags. The tagger serialises TheLemma ,^[14] the lemma, which is of type std::wstring .^[15] It first serialises TheLemma.size() , the length of the lemma, which is of type std::size_t ,^[6] followed by the lemma itself.

The tagger then serialises TheTags.size() ,^[16] the size of the sequence of tags, which is of type std::size_t ,^[6] followed by the tag sequence. For each tag, it first serialises TheTag.size() ,^[17] the length of the tag, followed by the tag itself.

Given the corpus

^a/a<b>$

the tagger writes ^[18]

0000000: 0101 0101 0101 0161 0101 0101 0161 0101  .......a.....b..
0000010: 0a                                       .

Model 2[edit]

See section 5.3.2 of “A set of open-source tools for Turkish natural language processing.”^[2]

Consider Example 3.1.1: handtagged.txt .

The tag string  is twice as frequent as <a>. However, model 1 scores b<a> and b equally because neither analysis string appears in the corpus.

This model splits each analysis string into a root, $r$ , and the part of the analysis string that isn’t the root, $a$ . An analysis string’s root is its first lemma. The $r$ of a+c<d> is a , and its $a$ is +c<d> . The tagger assigns each analysis string a score of $P(r|a)f(a)$ with add-one smoothing. (Without additive smoothing, this model would be the same as model 1.)^[19] The tagger assigns higher scores to unknown analysis strings with frequent $a$ than to unknown analysis strings with infrequent $a$ .

Given the lexical unit ^b/b<a>/b$, the tagger assigns the analysis string b<a> a score of

${\begin{aligned}\mathrm {score} &={\frac {(\mathrm {tokenCount\_r\_a} +1)(\mathrm {tokenCount\_a} +1)}{\mathrm {tokenCount\_a} +1+\mathrm {typeCount\_a} }}\\&={\frac {(0+1)(1+1)}{1+1+2}}\\&={\frac {(1)(2)}{4}}\\&={\frac {1}{2}}~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&\mathrm {tokenCount\_r\_a} ~{\text{is the frequency of the}}~r,a~{\text{in the corpus ,}}\\&\mathrm {tokenCount\_a} ~{\text{is the frequency of the}}~a~{\text{in the corpus ,}}\\{\text{and}}~&\mathrm {typeCount\_a} ~{\text{is the size of the parameter vector of all}}~r~{\text{preceding the}}~a~{\text{.}}\end{aligned}}$

Note that $\mathrm {typeCount\_a}$ counts the analysis string being scored. For example, the tagger would assign the known analysis string a<a> a score of

${\begin{aligned}\mathrm {score} &={\frac {(1+1)(1+1)}{1+1+1}}\\&={\frac {(2)(2)}{3}}\\&={\frac {4}{3}}~{\text{.}}\end{aligned}}$

The tagger assigns the analysis string b a score of

${\begin{aligned}\mathrm {score} &={\frac {(0+1)(2+1)}{2+1+2}}\\&={\frac {(1)(3)}{5}}\\&={\frac {3}{5}}~{\text{.}}\end{aligned}}$

File Format[edit]

The tagger represents this model as std::map<a, std::map<Lemma, std::size_t> > Model; .^[7]^[20]^[21]^[6]^[22]

See section 3.1.2.

For each a ,^[20] the tagger first serialises TheTags ,^[23] the tag sequence, which is of type std::vector<Tag> ,^[12]^[24] followed by TheMorphemes ,^[25] the morpheme sequence, which is of type std::vector<Morpheme> .^[12]^[13] For each Lemma ,^[21] the tagger serialises TheLemma ,^[26] the lemma, which is of type std::wstring .^[15]

Given the corpus

^a/a<b>+c<d>$

the tagger writes ^[18]

0000000: 0101 0101 0101 0162 0101 0101 0163 0101 .......b.....c..
0000010: 0101 0164 0101 0101 0161 0101 0a        ...d.....a...

Model 3[edit]

See section 5.3.3 of “A set of open-source tools for Turkish natural language processing.”^[2]

Consider Example 3.1.1: handtagged.txt .

The morpheme a is twice as frequent as the morpheme a<a> . However, model 2 scores the analysis strings a<a>+a<a> and a+a<a> equally because the $a$ of neither appears in the corpus.

This model splits each analysis string into an $r~{\text{,}}$ a first inflection, $i_{0}~{\text{,}}$ and a sequence of derivation-inflection pairs, $(d_{1},i_{1})...(d_{n},i_{n})~{\text{.}}$ The $r$ of the analysis string a+c<d> is a , its $i_{0}$ is  , and its $(d_{1},i_{1})...(d_{n},i_{n})$ is c<d> , where its $d_{1}$ is c , and its $i_{1}$ is <d> . The tagger assigns each analysis string a score of $P(r|i_{0})f(i_{0})\prod _{i=1}^{n}P(d_{i}|i_{i-1})P(i_{i}|d_{i})$ with add-one smoothing. The tagger assigns higher scores to unknown analysis strings with frequent $r,i_{0}$ than to unknown analysis strings with infrequent $r,i_{0}~{\text{.}}$

Given the lexical unit ^aa/a<a>+a<a>/a+a<a>$ , the tagger assigns the analysis string a<a>+a<a> a score of

${\begin{aligned}\mathrm {score} =\;&{\frac {(\mathrm {tokenCount\_r\_i\_0} +1)(\mathrm {tokenCount\_i\_0} +1)}{\mathrm {tokenCount\_i\_0} +1+\mathrm {typeCount\_i\_0} }}\\&{\begin{aligned}\prod _{i=1}^{n}\,&{\frac {\mathrm {tokenCount\_d\_i} (d_{n},i_{n-1})+1}{\mathrm {tokenCount\_i} (i_{n-1})+1+\mathrm {typeCount\_i} (i_{n-1},d_{n})}}\\&{\frac {\mathrm {tokenCount\_i\_d} (i_{n},d_{n})+1}{\mathrm {tokenCount\_d} (d_{n})+1+\mathrm {typeCount\_d} (d_{n},i_{n})}}\end{aligned}}\\=\;&{\frac {(1+1)(1+1)}{1+1+1}}{\frac {0+1}{0+1+1}}{\frac {0+1}{0+1+1}}\\=\;&{\frac {(2)(2)}{3}}{\frac {1}{2}}{\frac {1}{2}}\\=\;&{\frac {4}{3}}{\frac {1}{4}}\\=\;&{\frac {1}{3}}~{\text{,}}\end{aligned}}$

${\begin{aligned}{\text{where}}~&\mathrm {tokenCount\_r\_i\_0} ~{\text{is the frequency of the}}~r,i_{0}~{\text{in the corpus}}~{\text{,}}\\&\mathrm {tokenCount\_i\_0} ~{\text{is the frequency of the}}~i_{0}~{\text{in the corpus}}~{\text{,}}\\&\mathrm {typeCount\_i\_0} ~{\text{is the size of the parameter vector of}}~r~{\text{preceding the}}~i_{0}~{\text{,}}\\&\mathrm {tokenCount\_d\_i} (d_{n},i_{n-1})~{\text{is the frequency of the}}~d_{n}~{\text{following the}}~i_{n-1}~{\text{in the corpus}}~{\text{,}}\\&\mathrm {tokenCount\_i} (i_{n-1})~{\text{is the frequency of non-final}}~i_{n-1}~{\text{in the corpus}}~{\text{,}}\\&\mathrm {typeCount\_i} (i_{n-1},d_{n})~{\text{is the size of the parameter vector of}}~d~{\text{following the}}~i_{n-1}~{\text{,}}\\&\mathrm {tokenCount\_i\_d} (i_{n},d_{n})~{\text{is the frequency of the}}~i_{n}~{\text{following the}}~d_{n}~{\text{in the corpus}}~{\text{,}}\\&\mathrm {tokenCount\_d} (d_{n})~{\text{is the frequency of the}}~d_{n}~{\text{in the corpus}}~{\text{,}}\\{\text{and}}~&\mathrm {typeCount\_d} (d_{n},i_{n})~{\text{is the size of the parameter vector of}}~i~{\text{following the}}~d_{n}~{\text{.}}\end{aligned}}$

The tagger assigns the analysis string a+a<a> a score of

${\begin{aligned}\mathrm {score} =\;&{\frac {(2+1)(2+1)}{2+1+1}}{\frac {0+1}{0+1+1}}{\frac {0+1}{0+1+1}}\\=\;&{\frac {(3)(3)}{4}}{\frac {1}{2}}{\frac {1}{2}}\\=\;&{\frac {9}{4}}{\frac {1}{4}}\\=\;&{\frac {9}{16}}~{\text{.}}\end{aligned}}$

File Format[edit]

The tagger represents this model as std::pair<std::map<i, std::map<Lemma, std::size_t> >, std::pair<std::map<i, std::map<Lemma, std::size_t> >, std::map<Lemma, std::map<i, std::size_t> > > > Model; .^[10]^[7]^[27]^[21]^[6]

See section 3.1.2.

For each i ,^[27] the tagger serialises TheTags ,^[28] the tag sequence, which is of type std::vector<Tag> .^[12]^[24]

Given the corpus

^a/a<b>+c<d>$

the tagger writes ^[18]

0000000: 0101 0101 0101 0162 0101 0101 0161 0101 .......b.....a..
0000010: 0101 0101 0101 0162 0101 0101 0163 0101 .......b.....c..
0000020: 0101 0101 0163 0101 0101 0101 0164 0101 .....c.......d..
0000030: 0a                                      .

Notes[edit]

↑ ^1.0 ^1.1 ^1.2 https://github.com/m5w/apertium
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 http://coltekin.net/cagri/papers/trmorph-tools.pdf
↑ Installation#If you want to add language data / do more advanced stuff
↑ Minimal installation from SVN#Set up environment
↑ Minimal installation from SVN#Configure, build, and install
↑ ^6.00 ^6.01 ^6.02 ^6.03 ^6.04 ^6.05 ^6.06 ^6.07 ^6.08 ^6.09 ^6.10 http://en.cppreference.com/w/cpp/types/size_t
↑ ^7.0 ^7.1 ^7.2 http://en.cppreference.com/w/cpp/container/map
↑ https://github.com/m5w/apertium/blob/master/apertium/analysis.h
↑ https://github.com/m5w/apertium/blob/master/apertium/basic_5_3_1_tagger.h#L28
↑ ^10.0 ^10.1 http://en.cppreference.com/w/cpp/utility/pair
↑ ^11.0 ^11.1 https://github.com/m5w/apertium/blob/master/apertium/analysis.h#L33
↑ ^12.0 ^12.1 ^12.2 ^12.3 http://en.cppreference.com/w/cpp/container/vector
↑ ^13.0 ^13.1 https://github.com/m5w/apertium/blob/master/apertium/morpheme.h
↑ https://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L30
↑ ^15.0 ^15.1 http://en.cppreference.com/w/cpp/string/basic_string
↑ https://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L31
↑ https://github.com/m5w/apertium/blob/master/apertium/tag.h#L27
↑ ^18.0 ^18.1 ^18.2 http://linux.die.net/man/1/xxd
↑ ${\begin{aligned}\mathrm {score} &={\frac {(\mathrm {tokenCount\_r\_a} )(\mathrm {tokenCount\_a} )}{\mathrm {tokenCount\_a} }}\\&=\mathrm {tokenCount\_r\_a} =\mathrm {tokenCount\_T} \end{aligned}}$
↑ ^20.0 ^20.1 https://github.com/m5w/apertium/blob/master/apertium/a.h
↑ ^21.0 ^21.1 ^21.2 https://github.com/m5w/apertium/blob/master/apertium/lemma.h
↑ https://github.com/m5w/apertium/blob/master/apertium/basic_5_3_2_tagger.h#L29
↑ https://github.com/m5w/apertium/blob/master/apertium/a.h#L32
↑ ^24.0 ^24.1 https://github.com/m5w/apertium/blob/master/apertium/tag.h
↑ https://github.com/m5w/apertium/blob/master/apertium/a.h#L33
↑ https://github.com/m5w/apertium/blob/master/apertium/lemma.h#L32
↑ ^27.0 ^27.1 https://github.com/m5w/apertium/blob/master/apertium/i.h
↑ https://github.com/m5w/apertium/blob/master/apertium/i.h#L34

[apertium-1] 1.0 ^1.1 ^1.2 https://github.com/m5w/apertium

[trmorph-tools-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 http://coltekin.net/cagri/papers/trmorph-tools.pdf

[prerequisites-3] Installation#If you want to add language data / do more advanced stuff

[stow-4] Minimal installation from SVN#Set up environment

[autogen.sh-5] Minimal installation from SVN#Configure, build, and install

[std::size_t-6] 6.00 ^6.01 ^6.02 ^6.03 ^6.04 ^6.05 ^6.06 ^6.07 ^6.08 ^6.09 ^6.10 http://en.cppreference.com/w/cpp/types/size_t

[std::map-7] 7.0 ^7.1 ^7.2 http://en.cppreference.com/w/cpp/container/map

[Analysis-8] ttps://github.com/m5w/apertium/blob/master/apertium/analysis.h

[basic_5_3_1_tagger::Model-9] ttps://github.com/m5w/apertium/blob/master/apertium/basic_5_3_1_tagger.h#L28

[std::pair-10] 10.0 ^10.1 http://en.cppreference.com/w/cpp/utility/pair

[Analysis::TheMorphemes-11] 11.0 ^11.1 https://github.com/m5w/apertium/blob/master/apertium/analysis.h#L33

[std::vector-12] 12.0 ^12.1 ^12.2 ^12.3 http://en.cppreference.com/w/cpp/container/vector

[Morpheme-13] 13.0 ^13.1 https://github.com/m5w/apertium/blob/master/apertium/morpheme.h

[Morpheme::TheLemma-14] ttps://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L30

[std::wstring-15] 15.0 ^15.1 http://en.cppreference.com/w/cpp/string/basic_string

[Morpheme::TheTags-16] ttps://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L31

[Tag::TheTag-17] ttps://github.com/m5w/apertium/blob/master/apertium/tag.h#L27

[xxd-18] 18.0 ^18.1 ^18.2 http://linux.die.net/man/1/xxd

[score-19] ${\begin{aligned}\mathrm {score} &={\frac {(\mathrm {tokenCount\_r\_a} )(\mathrm {tokenCount\_a} )}{\mathrm {tokenCount\_a} }}\\&=\mathrm {tokenCount\_r\_a} =\mathrm {tokenCount\_T} \end{aligned}}$

[a-20] 20.0 ^20.1 https://github.com/m5w/apertium/blob/master/apertium/a.h

[Lemma-21] 21.0 ^21.1 ^21.2 https://github.com/m5w/apertium/blob/master/apertium/lemma.h

[basic_5_3_2_Tagger::Model-22] ttps://github.com/m5w/apertium/blob/master/apertium/basic_5_3_2_tagger.h#L29

[a::TheTags-23] ttps://github.com/m5w/apertium/blob/master/apertium/a.h#L32

[Tag-24] 24.0 ^24.1 https://github.com/m5w/apertium/blob/master/apertium/tag.h

[a::TheMorphemes-25] ttps://github.com/m5w/apertium/blob/master/apertium/a.h#L33

[Lemma::TheLemma-26] ttps://github.com/m5w/apertium/blob/master/apertium/lemma.h#L32

[i-27] 27.0 ^27.1 https://github.com/m5w/apertium/blob/master/apertium/i.h

[i::TheTags-28] ttps://github.com/m5w/apertium/blob/master/apertium/i.h#L34

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

Unigram tagger

Contents

Installation[edit]

Usage[edit]

Training a Model on a Hand-Tagged Corpus[edit]

Disambiguation[edit]

Unigram Models[edit]

Model 1[edit]

Training on Corpora with Ambiguous Lexical Units[edit]

File Format[edit]

Model 2[edit]

File Format[edit]

Model 3[edit]

File Format[edit]

Notes[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools