Improved corpus-based paradigm matching

<spectie> we want to increase the coverage of our analysers
<spectie> one way we can do it is for a human to think about every entry
<spectie> but, we already have a lot of knowledge already
<spectie> so, if we have a load of paradigms, and a load of lemmas -- we can try and match the lemmas to paradigms by generating 
          the forms for each paradigm and looking in a corpus
<spectie> there are quite a few papers on this.
<spectie> so, e.g. in english let's say we have the paradigms "-s" plural and "-ren" plural
<spectie> and a load of lemmas 'child, cat, dog, ...'
<spectie> we look up {child, childs}, {child, children}, {cat, catren}, {cat, cats} etc. in the corpus
<spectie> but then there is the problem of ambiguity
<spectie> in english we have ambiguity between verb forms with -s and noun forms 
<spectie> which usually isn't a big problem
<spectie> so the idea is to preprocess your corpus, using your existing analyser, and tagger to give possible values to unknown words
          based on the surrounding context
<spectie> so let's say you have: 
<spectie> na poziv *fizikalne i *matematične fakulteti u *Odesi
<spectie> you would assign possible values for case/number/gender to the unknown surface forms of *fizikalne and *matematične based 
          on the surrounding _known_ context (na .... fakulteti)
<spectie> then, when you come to check the paradigms against the surface forms in the corpus
<spectie> you're not only checking if the surface form is valid, but also if the morphological information that it predicts is valid
<spectie> as far as i know, this is novel
Improved corpus-based paradigm matching

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools