Difference between revisions of "Unigram tagger"

From Apertium
Jump to navigation Jump to search
Line 17: Line 17:
 
(1) + 1 =
 
(1) + 1 =
 
2
 
2
</pre> and <code>a&lt;b&gt;</code> a score of <code>(2) + 1 = 3</code>. The tagger assigns the unknown analysis string <code>a&lt;c&gt;</code> a score of <code>1</code>.
+
</pre> and <code>a&lt;b&gt;</code> a score of <code>(2) + 1 = 3</code>. The unknown analysis string <code>a&lt;c&gt;</code> is assigned a score of <code>1</code>.
  +
If reconfigured with <code>--enable-debug</code>, the tagger prints such calculations to stderr.
  +
<pre>
  +
  +
  +
score("a<a>") ==
  +
2 ==
  +
2.000000000000000000
  +
score("a<b>") ==
  +
3 ==
  +
3.000000000000000000
  +
score("a<c>") ==
  +
1 ==
  +
1.000000000000000000
  +
^a<b>$
  +
</pre>
 
[[Category:Development]]
 
[[Category:Development]]

Revision as of 03:31, 14 January 2016

Install

The code is a clone of apertium and is at m5w/apertium. It has the same dependencies as apertium, so one should install it in the same way. See Installation and Minimal installation from SVN for more information.

Unigram Models

This code's apertium-tagger implements the three unigram models in A set of open-source tools for Turkish natural language processing. See section 5.3.

Model 1

See section 5.3.1. This model scores each analysis string in proportion to its frequency with add-one smoothing. Consider the following corpus.

^a/a<a>$
^a/a<b>$
^a/a<b>$

Passed the lexical unit ^a/a<a>/a<b>/a<c>$, the tagger assigns the analysis string a<a> a score of

f + 1 =
  (1) + 1 =
  2

and a<b> a score of (2) + 1 = 3. The unknown analysis string a<c> is assigned a score of 1.

If reconfigured with --enable-debug, the tagger prints such calculations to stderr.



score("a<a>") ==
  2 ==
  2.000000000000000000
score("a<b>") ==
  3 ==
  3.000000000000000000
score("a<c>") ==
  1 ==
  1.000000000000000000
^a<b>$