Unigram tagger

From Apertium
Jump to navigation Jump to search

m5w/apertium's apertium-tagger supports all A set of open-source tools for Turkish natural language processing's unigram models.

Install

First, install all prerequisites. See Installation#If you want to add language data / do more advanced stuff. Then, replace <directory> with the directory you'd like to clone m5w/apertium into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, see Minimal installation from SVN#Set up environment. Finally, configure, build, and install m5w/apertium. See Minimal installation from SVN#Configure, build, and install.

Usage

See apertium-tagger -h.

Training a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as one would for all other models.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example: a Hand-Tagged Corpus for apertium-tagger Then, MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, SERIALISED_BASIC_TAGGER with the filename you'd like to write the model to, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

Unigram Models

This code's apertium-tagger implements the three unigram models in A set of open-source tools for Turkish natural language processing. See section 5.3.

Model 1

See section 5.3.1. This model scores each analysis string in proportion to its frequency with add-one smoothing. Consider the following corpus.

^a/a<a>$
^a/a<b>$
^a/a<b>$

Passed the lexical unit ^a/a<a>/a<b>/a<c>$, the tagger assigns the analysis string a<a> a score of

f + 1 =
  (1) + 1 =
  2

and a<b> a score of (2) + 1 = 3. The unknown analysis string a<c> is assigned a score of 1.

If reconfigured with --enable-debug, the tagger prints such calculations to stderr.



score("a<a>") ==
  2 ==
  2.000000000000000000
score("a<b>") ==
  3 ==
  3.000000000000000000
score("a<c>") ==
  1 ==
  1.000000000000000000
^a<b>$