Unigram tagger
m5w/apertium's apertium-tagger
supports all A set of open-source tools for Turkish natural language processing's unigram models.
Contents
Install
First, install all prerequisites. See Installation#If you want to add language data / do more advanced stuff.
Then, replace <directory>
with the directory you'd like to clone m5w/apertium into and clone the repository.
git clone https://github.com/m5w/apertium.git <directory>
Then, see Minimal installation from SVN#Set up environment. Finally, configure, build, and install m5w/apertium. See Minimal installation from SVN#Configure, build, and install.
Usage
See apertium-tagger -h
.
Train a Model on a Hand-Tagged Corpus
First, get a hand-tagged corpus as one would for all other models.
$ cat handtagged.txt ^a/a<a>$ ^a/a<b>$ ^a/a<b>$ ^aa/a<a>+a<a>$ ^aa/a<a>+a<b>$ ^aa/a<a>+a<b>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<a>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$ ^aa/a<b>+a<b>$
Example 1: a Hand-Tagged Corpus for apertium-tagger
Then, replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the filename you'd like to write the model to, and train the tagger.
$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt
Disambiguate
Either write input to a file or pipe it.
$ cat raw.txt ^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$
Example 2: Input for apertium-tagger
Replace MODEL
with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER
with the file to which you wrote the unigram model, and disambiguate the input.
$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt ^a/a<b>$ ^aa/a<b>+a<b>$ $ echo '^a/a<a>/a<b>/a<c>$ ^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER ^a/a<b>$ ^aa/a<b>+a<b>$
Unigram Models
This code's apertium-tagger
implements the three unigram models in A set of open-source tools for Turkish natural language processing. See section 5.3.
Model 1
See section 5.3.1. This model scores each analysis string in proportion to its frequency with add-one smoothing. Consider the following corpus.
^a/a<a>$ ^a/a<b>$ ^a/a<b>$
Passed the lexical unit ^a/a<a>/a<b>/a<c>$
, the tagger assigns the analysis string a<a>
a score of
f + 1 = (1) + 1 = 2
and a<b>
a score of (2) + 1 = 3
. The unknown analysis string a<c>
is assigned a score of 1
.
If reconfigured with --enable-debug
, the tagger prints such calculations to stderr.
score("a<a>") == 2 == 2.000000000000000000 score("a<b>") == 3 == 3.000000000000000000 score("a<c>") == 1 == 1.000000000000000000 ^a<b>$