User:Firespeaker/Cleaning up a tail

The problem

Due to Zipf's law, there's a huge tail of unknown words when running coverage. This effect is compounded in languages with high levels of morphological complexity—i.e., a small handful of unknown stems can result in hundreds of unknown forms.

If some of these stems could be interpolated from all of their forms, transducer coverage could be increased much more quickly.

A proposed solution

Convert transducer to use a wildcard for a certain lemma category (especially nouns and verbs)
Run coverage on unknown words list
The top of the hitparade should include the most common unknown stems
Verify before adding to dictionary

User:Firespeaker/Cleaning up a tail

The problem

A proposed solution

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools