Difference between revisions of "User:Firespeaker/Cleaning up a tail"

From Apertium
Jump to navigation Jump to search
(Created page with "== The problem == For Turkic languages, there is a huge tail of unknown words when running coverage. Presumably this is because of morphological complexity—i.e., a small ha...")
 
Line 1: Line 1:
== The problem ==
== The problem ==
[[File:Religion.unk.png|thumb=200px|Zipf's law seen in the unknown words from two Turkic corpora]]
For Turkic languages, there is a huge tail of unknown words when running coverage. Presumably this is because of morphological complexity—i.e., a small handful of unknown stems can result in hundreds of unknown forms.

Due to [http://en.wikipedia.org/wiki/Zipf's%20law Zipf's law], there's a huge tail of unknown words when running coverage. This effect is compounded in languages with high levels of morphological complexity—i.e., a small handful of unknown stems can result in hundreds of unknown forms.

If some of these stems could be interpolated from all of their forms, transducer coverage could be increased much more quickly.


== A proposed solution ==
== A proposed solution ==

# Convert transducer to use a wildcard for a certain lemma category (especially nouns and verbs)
# Run coverage on unknown words list
# The top of the hitparade should include the most common unknown stems
# Verify before adding to dictionary

Revision as of 18:54, 12 March 2014

The problem

File:Religion.unk.png
Zipf's law seen in the unknown words from two Turkic corpora

Due to Zipf's law, there's a huge tail of unknown words when running coverage. This effect is compounded in languages with high levels of morphological complexity—i.e., a small handful of unknown stems can result in hundreds of unknown forms.

If some of these stems could be interpolated from all of their forms, transducer coverage could be increased much more quickly.

A proposed solution

  1. Convert transducer to use a wildcard for a certain lemma category (especially nouns and verbs)
  2. Run coverage on unknown words list
  3. The top of the hitparade should include the most common unknown stems
  4. Verify before adding to dictionary