User:Ilnar.salimzyan/Coverage

From Apertium
Jump to navigation Jump to search

While working on a machine translator for closely-related languages, we spend a lot of time adding new stems to the dictionary. Here are some thoughts I think would help to speed up that process and which I want to test out on the Tatar-Bashkir pair.

  • Given: a raw text in, a prototype morphological analyser for language X
  • run the text through the analyzer
  • for known words, disambiguate or correct the output of the morphological analyser leaving only one analysis you'd expect in that context
  • learn a function which assigns the lemma+tag(s) label to unknown words
  • assign the lemma+tag(s) label to unknown words using that function
  • manually correct the work of the function not looking/jumping over the words corrected at step 3