**Apertium has moved from SourceForge to GitHub.**

If you have any questions, please come and talk to us on

`#apertium`

on `irc.freenode.net`

or contact the GitHub migration team.# Unigram tagger

`apertium-tagger`

from “m5w/apertium”^{[1]} supports all the unigram models from “A set of open-source tools for Turkish natural language processing.”^{[2]}

## Contents |

## [edit] Installation

First, install all prerequisites. See “If you want to add language data / do more advanced stuff.”^{[3]}

Then, replace `<directory>`

with the directory you’d like to clone “m5w/apertium”^{[1]} into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, configure your environment^{[4]} and finally configure, build, and install^{[5]} “m5w/apertium.”^{[1]}

## [edit] Usage

See `apertium-tagger --help`

.

### [edit] Training a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as you would for any non-unigram model.

$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$ ^aa/a<a>+a<a>$ ^aa/a<a>+a<b>$ ^aa/a<a>+a<b>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$

*Example 2.1.1:* `handtagged.txt`

*: a Hand-Tagged Corpus for* `apertium-tagger`

Then, replace `MODEL`

with the unigram model from “A set of open-source tools for Turkish natural language processing”^{[2]} you’d like to use, replace `SERIALISED_BASIC_TAGGER`

with the filename to which you’d like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

### [edit] Disambiguation

Either write your input to a file or pipe it to the tagger.

$ cat raw.txt ^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

*Example 2.2.1:* `raw.txt`

*: Input for* `apertium-tagger`

Replace `MODEL`

with the unigram model from “A set of open-source tools for Turkish natural language processing”^{[2]} you’d like to use, replace `SERIALISED_BASIC_TAGGER`

with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt ^a/a<b>$ ^aa/a<b>+a<b>$ $ echo '^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER ^a/a<b>$ ^aa/a<b>+a<b>$

## [edit] Unigram Models

See section 5.3 of “A set of open-source tools for Turkish natural language processing.”^{[2]}

### [edit] Model 1

See section 5.3.1 of “A set of open-source tools for Turkish natural language processing.”^{[2]}

This model assigns each analysis string a score of

with additive smoothing.

Consider the following corpus.

$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$

*Example 3.1.1:* `handtagged.txt`

*: A Hand-Tagged Corpus for* `apertium-tagger`

Given the lexical unit `^a/a<a>/a<b>/a<c>$`

, the tagger assigns the analysis string `a<a>`

a score of

The tagger then assigns the analysis string `a<b>`

a score of

and the unknown analysis string `a<c>`

a score of

If `./autogen.sh`

is passed the option `--enable-debug`

, the tagger prints such calculations to standard error.

$ ./autogen.sh --enable-debug $ make $ echo '^a/a<a>/a<b>/a<c>$' | apertium-tagger -gu 1 SERIALISED_BASIC_TAGGER score("a<a>") == 2 == 2.000000000000000000 score("a<b>") == 3 == 3.000000000000000000 score("a<c>") == 1 == 1.000000000000000000 ^a<b>$

#### [edit] Training on Corpora with Ambiguous Lexical Units

Consider the following corpus.

$ cat handtagged.txt ^a/a<a>$ ^a/a<a>/a<b>$ ^a/a<b>$ ^a/a<b>$

*Example 3.1.1.1:* `handtagged.txt`

*: a Hand-Tagged Corpus for* `apertium-tagger`

The tagger expects lexical units of 1 analysis string, or lexical units of size 1. However, the size of the lexical unit `^a/a<a>/a<b>$`

is 2. For this lexical unit,

the tagger must effectively increment the frequency of both analysis strings by `0.500000000000000000`

. However, the tagger can’t increment the analysis strings’ frequencies by a non-integral number because model 1 represents analysis strings’ frequencies as `std::size_t`

.^{[6]}

Instead, the tagger multiplies all the stored analysis strings’ frequencies by this lexical unit’s size and increments the frequency of each of this lexical unit’s analysis strings by 1.

The tagger could then increment the analysis strings’ frequencies of another lexical unit of size 2 without multiplying any of the stored analysis strings’ frequencies. To account for this, the tagger stores the least common multiple of all lexical units’ sizes; only if the LCM isn’t divisible by a lexical unit’s size does the tagger multiply all the analysis strings’ frequencies.

After incrementing the analysis strings’ frequencies of the lexical unit `^a/a<a>/a<b>$`

, the tagger increments the analysis string `a<b>`

of the lexical unit `^a/a<b>$`

by

If the tagger gets another lexical unit of size 2, it would increment the frequency of each of the lexical unit’s analysis strings by

and if it gets a lexical unit of size 3, it would multiply all the analysis strings’ frequencies by 3 and then increment the frequency of each of the lexical unit’s analysis strings by

Each model supports functions to increment all their stored analysis strings’ frequencies, so models 2 and 3 support this algorithm as well.

**TODO**: If one passes the`-d`

*option to* `apertium-tagger`

*, the tagger prints warnings about ambiguous analyses in corpora to stderr.*

$ apertium-tagger -ds 0 -u 1 handtagged.txt apertium-tagger: handtagged.txt: 2:13: unexpected analysis "a<b>" following anal ysis "a<a>" ^a/a<a>/a<b>$ ^

#### [edit] File Format

The tagger represents this model as `std::map<Analysis, std::size_t> Model;`

.^{[7]}^{[6]}^{[8]}^{[9]}

It first serialises `Model.size()`

, the size of the parameter vector of analysis strings, which is of type `std::size_t`

,^{[6]} followed by analysis string-frequency pairs.

To reduce file size, the it writes only the non-zero bytes of a `std::size_t`

^{[6]} preceded by the number of bytes to read.

[. . . .] ([. . .]).serialise(0x00000000, [. . .]); // 00 ([. . .]).serialise(0x000000ff, [. . .]); // 01ff ([. . .]).serialise(0x0000ffff, [. . .]); // 02ffff ([. . .]).serialise(0x00ffffff, [. . .]); // 03ffffff ([. . .]).serialise(0xffffffff, [. . .]); // 04ffffffff [. . . .]

*Example 3.2.1:* `std::size_t`

^{[6]} Serialisation

The tagger serialises the analysis string-frequency pairs, which are of type `std::pair<Analysis, std::size_t>`

.^{[10]} For each analysis string-frequency pair, it first serialises the analysis string, followed by the frequency, which is of type `std::size_t`

.^{[6]} It serialises `TheMorphemes`

,^{[11]} the sequence of morphemes, which is of type `std::vector<Morpheme>`

.^{[12]}^{[13]} The tagger first serialises `TheMorphemes.size()`

,^{[11]} the size of the sequence of morphemes, which is of type `std::size_t`

,^{[6]} followed by the morphemes. For each morpheme, it first serialises the lemma, followed by the sequence of tags. The tagger serialises `TheLemma`

,^{[14]} the lemma, which is of type `std::wstring`

.^{[15]} It first serialises `TheLemma.size()`

, the length of the lemma, which is of type `std::size_t`

,^{[6]} followed by the lemma itself.

The tagger then serialises `TheTags.size()`

,^{[16]} the size of the sequence of tags, which is of type `std::size_t`

,^{[6]} followed by the tag sequence. For each tag, it first serialises `TheTag.size()`

,^{[17]} the length of the tag, followed by the tag itself.

Given the corpus

^a/a<b>$

the tagger writes ^{[18]}

0000000: 0101 0101 0101 0161 0101 0101 0161 0101 .......a.....b.. 0000010: 0a .

### [edit] Model 2

See section 5.3.2 of “A set of open-source tools for Turkish natural language processing.”^{[2]}

Consider Example 3.1.1: `handtagged.txt`

.

The tag string `<b>`

is twice as frequent as `<a>`

. However, model 1 scores `b<a>`

and `b<b>`

equally because neither analysis string appears in the corpus.

This model splits each analysis string into a root, *r* , and the part of the analysis string that isn’t the root, *a* . An analysis string’s root is its first lemma. The *r* of `a<b>+c<d>`

is `a`

, and its *a* is `<b>+c<d>`

. The tagger assigns each analysis string a score of *P*(*r* | *a*)*f*(*a*) with add-one smoothing. (Without additive smoothing, this model would be the same as model 1.)^{[19]} The tagger assigns higher scores to unknown analysis strings with frequent *a* than to unknown analysis strings with infrequent *a* .

Given the lexical unit `^b/b<a>/b<b>$`

, the tagger assigns the analysis string `b<a>`

a score of

Note that typeCount_a counts the analysis string being scored. For example, the tagger would assign the known analysis string `a<a>`

a score of

The tagger assigns the analysis string `b<b>`

a score of

#### [edit] File Format

The tagger represents this model as `std::map<a, std::map<Lemma, std::size_t> > Model;`

.^{[7]}^{[20]}^{[21]}^{[6]}^{[22]}

See section 3.1.2.

For each `a`

,^{[20]} the tagger first serialises `TheTags`

,^{[23]} the tag sequence, which is of type `std::vector<Tag>`

,^{[12]}^{[24]} followed by `TheMorphemes`

,^{[25]} the morpheme sequence, which is of type `std::vector<Morpheme>`

.^{[12]}^{[13]} For each `Lemma`

,^{[21]} the tagger serialises `TheLemma`

,^{[26]} the lemma, which is of type `std::wstring`

.^{[15]}

Given the corpus

^a/a<b>+c<d>$

the tagger writes ^{[18]}

0000000: 0101 0101 0101 0162 0101 0101 0163 0101 .......b.....c.. 0000010: 0101 0164 0101 0101 0161 0101 0a ...d.....a...

### [edit] Model 3

See section 5.3.3 of “A set of open-source tools for Turkish natural language processing.”^{[2]}

Consider Example 3.1.1: `handtagged.txt`

.

The morpheme `a<b>`

is twice as frequent as the morpheme `a<a>`

. However, model 2 scores the analysis strings `a<a>+a<a>`

and `a<b>+a<a>`

equally because the *a* of neither appears in the corpus.

This model splits each analysis string into an a first inflection, and a sequence of derivation-inflection pairs, The *r* of the analysis string `a<b>+c<d>`

is `a`

, its *i*_{0} is `<b>`

, and its (*d*_{1},*i*_{1})...(*d*_{n},*i*_{n}) is `c<d>`

, where its *d*_{1} is `c`

, and its *i*_{1} is `<d>`

. The tagger assigns each analysis string a score of with add-one smoothing. The tagger assigns higher scores to unknown analysis strings with frequent *r*,*i*_{0} than to unknown analysis strings with infrequent

Given the lexical unit `^aa/a<a>+a<a>/a<b>+a<a>$`

, the tagger assigns the analysis string `a<a>+a<a>`

a score of

The tagger assigns the analysis string `a<b>+a<a>`

a score of

#### [edit] File Format

The tagger represents this model as `std::pair<std::map<i, std::map<Lemma, std::size_t> >, std::pair<std::map<i, std::map<Lemma, std::size_t> >, std::map<Lemma, std::map<i, std::size_t> > > > Model;`

.^{[10]}^{[7]}^{[27]}^{[21]}^{[6]}

See section 3.1.2.

For each `i`

,^{[27]} the tagger serialises `TheTags`

,^{[28]} the tag sequence, which is of type `std::vector<Tag>`

.^{[12]}^{[24]}

Given the corpus

^a/a<b>+c<d>$

the tagger writes ^{[18]}

0000000: 0101 0101 0101 0162 0101 0101 0161 0101 .......b.....a.. 0000010: 0101 0101 0101 0162 0101 0101 0163 0101 .......b.....c.. 0000020: 0101 0101 0163 0101 0101 0101 0164 0101 .....c.......d.. 0000030: 0a .

## [edit] Notes

- ↑
^{1.0}^{1.1}^{1.2}https://github.com/m5w/apertium - ↑
^{2.0}^{2.1}^{2.2}^{2.3}^{2.4}^{2.5}^{2.6}http://coltekin.net/cagri/papers/trmorph-tools.pdf - ↑ Installation#If you want to add language data / do more advanced stuff
- ↑ Minimal installation from SVN#Set up environment
- ↑ Minimal installation from SVN#Configure, build, and install
- ↑
^{6.00}^{6.01}^{6.02}^{6.03}^{6.04}^{6.05}^{6.06}^{6.07}^{6.08}^{6.09}^{6.10}http://en.cppreference.com/w/cpp/types/size_t - ↑
^{7.0}^{7.1}^{7.2}http://en.cppreference.com/w/cpp/container/map - ↑ https://github.com/m5w/apertium/blob/master/apertium/analysis.h
- ↑ https://github.com/m5w/apertium/blob/master/apertium/basic_5_3_1_tagger.h#L28
- ↑
^{10.0}^{10.1}http://en.cppreference.com/w/cpp/utility/pair - ↑
^{11.0}^{11.1}https://github.com/m5w/apertium/blob/master/apertium/analysis.h#L33 - ↑
^{12.0}^{12.1}^{12.2}^{12.3}http://en.cppreference.com/w/cpp/container/vector - ↑
^{13.0}^{13.1}https://github.com/m5w/apertium/blob/master/apertium/morpheme.h - ↑ https://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L30
- ↑
^{15.0}^{15.1}http://en.cppreference.com/w/cpp/string/basic_string - ↑ https://github.com/m5w/apertium/blob/master/apertium/morpheme.h#L31
- ↑ https://github.com/m5w/apertium/blob/master/apertium/tag.h#L27
- ↑
^{18.0}^{18.1}^{18.2}http://linux.die.net/man/1/xxd - ↑
- ↑
^{20.0}^{20.1}https://github.com/m5w/apertium/blob/master/apertium/a.h - ↑
^{21.0}^{21.1}^{21.2}https://github.com/m5w/apertium/blob/master/apertium/lemma.h - ↑ https://github.com/m5w/apertium/blob/master/apertium/basic_5_3_2_tagger.h#L29
- ↑ https://github.com/m5w/apertium/blob/master/apertium/a.h#L32
- ↑
^{24.0}^{24.1}https://github.com/m5w/apertium/blob/master/apertium/tag.h - ↑ https://github.com/m5w/apertium/blob/master/apertium/a.h#L33
- ↑ https://github.com/m5w/apertium/blob/master/apertium/lemma.h#L32
- ↑
^{27.0}^{27.1}https://github.com/m5w/apertium/blob/master/apertium/i.h - ↑ https://github.com/m5w/apertium/blob/master/apertium/i.h#L34