Difference between revisions of "Unigram tagger"

From Apertium
Jump to navigation Jump to search
Line 56: Line 56:
   
 
==Unigram Models==
 
==Unigram Models==
This code's <code>apertium-tagger</code> implements the three unigram models in [http://coltekin.net/cagri/papers/trmorph-tools.pdf A set of open-source tools for Turkish natural language processing]. See section 5.3.
+
See section 5.3 of "A set of open-source tools for Turkish natural language processing."
 
===Model 1===
 
===Model 1===
  +
See section 5.3.1 of "A set of open-source tools for Turkish natural language processing."
See section 5.3.1.
 
 
This model scores each analysis string in proportion to its frequency with add-one smoothing.
 
This model scores each analysis string in proportion to its frequency with add-one smoothing.
 
Consider the following corpus.
 
Consider the following corpus.
Line 87: Line 87:
 
^a<b>$
 
^a<b>$
 
</pre>
 
</pre>
  +
===Training on Corpora with Ambiguous Lexical Units===
  +
Consider the following corpus.
  +
<pre>
  +
^a/a<a>$
  +
^a/a<a>/a<b>$
  +
^a/a<b>$
  +
^a/a<b>$
  +
</pre>
  +
The probabilities of <code>a&lt;a&gt;</code> and <code>a&lt;b&gt;</code> and both half for the first lexical unit. However, all unigram models store frequencies as [http://en.cppreference.com/w/cpp/types/size_t std::size_t]'s, integral types.
  +
  +
To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to 1, expecting unambiguous lexical units. The size of this corpus' first lexical unit, <code>^a/a&lt;a&gt;$</code>, 1, is divisible by the LCM, 1, so the tagger increments the frequency of its analysis, <code>a&lt;a&gt;</code> by <code>LCM / size = (1) / (1) = 1</code>.
  +
  +
The size of the next lexical unit, <code>^a/a&lt;a&gt;/a&lt;b&gt;&</code>, 2, isn't divisible by the LCM, 1, so the tagger multiplies both the LCM and all previous analysis frequencies by it. The tagger multiplies the frequency of <code>a&lt;a&gt;</code> by this lexical unit's size, yielding a new frequency of 2. Then, the tagger increments the frequency of each of this lexical unit's analyses, <code>a&lt;a&gt;</code> and <code>a&lt;b&gt;</code>, by <code>LCM / size = (2) / (2) = 1</code>. The frequency of <code>a&lt;a</code> is then 3, the frequency of <code>a&lt;b&gt;</code>, 1.
 
[[Category:Development]]
 
[[Category:Development]]

Revision as of 20:53, 14 January 2016

apertium-tagger from m5w/apertium supports all the unigram models from A set of open-source tools for Turkish natural language processing.

Install

First, install all prerequisites. See Installation#If you want to add language data / do more advanced stuff. Then, replace <directory> with the directory you'd like to clone m5w/apertium into and clone the repository.

git clone https://github.com/m5w/apertium.git <directory>

Then, see Minimal installation from SVN#Set up environment. Finally, configure, build, and install m5w/apertium. See Minimal installation from SVN#Configure, build, and install.

Usage

See apertium-tagger -h.

Train a Model on a Hand-Tagged Corpus

First, get a hand-tagged corpus as one would for all other models.

$ cat handtagged.txt
^a/a<a>$
^a/a<b>$
^a/a<b>$
^aa/a<a>+a<a>$
^aa/a<a>+a<b>$
^aa/a<a>+a<b>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<a>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$
^aa/a<b>+a<b>$

Example 1: a Hand-Tagged Corpus for apertium-tagger

Then, replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the filename to which you'd like to write the model, and train the tagger.

$ apertium-tagger -s 0 -u MODEL SERIALISED_BASIC_TAGGER handtagged.txt

Disambiguate

Either write input to a file or pipe it.

$ cat raw.txt
^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$

Example 2: Input for apertium-tagger

Replace MODEL with the unigram model from "A set of open-source tools for Turkish natural language processing" you'd like to use, replace SERIALISED_BASIC_TAGGER with the file to which you wrote the unigram model, and disambiguate the input.

$ apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER raw.txt
^a/a<b>$
^aa/a<b>+a<b>$
$ echo '^a/a<a>/a<b>/a<c>$
^aa/a<a>+a<a>/a<a>+a<b>/a<b>+a<a>/a<b>+a<b>/a<a>+a<c>/a<c>+a<a>/a<c>+a<c>$' | \
apertium-tagger -gu MODEL SERIALISED_BASIC_TAGGER
^a/a<b>$
^aa/a<b>+a<b>$

Unigram Models

See section 5.3 of "A set of open-source tools for Turkish natural language processing."

Model 1

See section 5.3.1 of "A set of open-source tools for Turkish natural language processing." This model scores each analysis string in proportion to its frequency with add-one smoothing. Consider the following corpus.

^a/a<a>$
^a/a<b>$
^a/a<b>$

Passed the lexical unit ^a/a<a>/a<b>/a<c>$, the tagger assigns the analysis string a<a> a score of

f + 1 =
  (1) + 1 =
  2

and a<b> a score of (2) + 1 = 3. The unknown analysis string a<c> is assigned a score of 1.

If reconfigured with --enable-debug, the tagger prints such calculations to stderr.



score("a<a>") ==
  2 ==
  2.000000000000000000
score("a<b>") ==
  3 ==
  3.000000000000000000
score("a<c>") ==
  1 ==
  1.000000000000000000
^a<b>$

Training on Corpora with Ambiguous Lexical Units

Consider the following corpus.

^a/a<a>$
^a/a<a>/a<b>$
^a/a<b>$
^a/a<b>$

The probabilities of a<a> and a<b> and both half for the first lexical unit. However, all unigram models store frequencies as std::size_t's, integral types.

To account for this, the tagger stores the LCM of all lexical unit's sizes. A lexical unit's size is the size of its analysis vector. It initializes this value to 1, expecting unambiguous lexical units. The size of this corpus' first lexical unit, ^a/a<a>$, 1, is divisible by the LCM, 1, so the tagger increments the frequency of its analysis, a<a> by LCM / size = (1) / (1) = 1.

The size of the next lexical unit, ^a/a<a>/a<b>&, 2, isn't divisible by the LCM, 1, so the tagger multiplies both the LCM and all previous analysis frequencies by it. The tagger multiplies the frequency of a<a> by this lexical unit's size, yielding a new frequency of 2. Then, the tagger increments the frequency of each of this lexical unit's analyses, a<a> and a<b>, by LCM / size = (2) / (2) = 1. The frequency of a<a is then 3, the frequency of a<b>, 1.