User:Ilnar.salimzyan/Coverage
Jump to navigation
Jump to search
While working on a machine translator for closely-related languages, we spend a lot of time adding new stems to the dictionary. Here are some thoughts I think would help to speed up that process and which I want to test out on the Tatar-Bashkir pair.
- Given: a raw text in, a prototype morphological analyser for language X
- run the text through the analyzer
- for known words, disambiguate or correct the output of the morphological analyser leaving only one analysis you'd expect in that context
- learn a function which assigns the lemma+tag(s) label to unknown words
- assign the lemma+tag(s) label to unknown words using that function
- manually correct the work of the function not looking/jumping over the words corrected at step 3