Difference between revisions of "Improved corpus-based paradigm matching"
Jump to navigation
Jump to search
(Created page with ' <pre> <spectie> we want to increase the coverage of our analysers <spectie> one way we can do it is for a human to think about every entry <spectie> but, we already have a lot o…') |
|||
Line 1: | Line 1: | ||
<pre> |
<pre> |
||
<spectie> we want to increase the coverage of our analysers |
<spectie> we want to increase the coverage of our analysers |
||
Line 6: | Line 5: | ||
<spectie> so, if we have a load of paradigms, and a load of lemmas -- we can try and match the lemmas to paradigms by generating |
<spectie> so, if we have a load of paradigms, and a load of lemmas -- we can try and match the lemmas to paradigms by generating |
||
the forms for each paradigm and looking in a corpus |
the forms for each paradigm and looking in a corpus |
||
<spectie> there are quite a few papers on this. |
<spectie> there are quite a few papers on this. |
||
<spectie> so, e.g. in english let's say we have the paradigms "-s" plural and "-ren" plural |
<spectie> so, e.g. in english let's say we have the paradigms "-s" plural and "-ren" plural |
||
<spectie> and a load of lemmas 'child, cat, dog, ...' |
<spectie> and a load of lemmas 'child, cat, dog, ...' |
||
Line 21: | Line 20: | ||
<spectie> then, when you come to check the paradigms against the surface forms in the corpus |
<spectie> then, when you come to check the paradigms against the surface forms in the corpus |
||
<spectie> you're not only checking if the surface form is valid, but also if the morphological information that it predicts is valid |
<spectie> you're not only checking if the surface form is valid, but also if the morphological information that it predicts is valid |
||
<spectie> as far as i know, this is novel |
<spectie> as far as i know, this is novel (apart from forsberg's preliminary work) |
||
<spectie> if you need to do some kind of machine learning, you could quite easily use the existing entries in your morphology for learning |
|||
which features best descriminate between ambiguous paradigms. |
|||
</pre> |
</pre> |
||
Revision as of 21:39, 1 October 2012
<spectie> we want to increase the coverage of our analysers <spectie> one way we can do it is for a human to think about every entry <spectie> but, we already have a lot of knowledge already <spectie> so, if we have a load of paradigms, and a load of lemmas -- we can try and match the lemmas to paradigms by generating the forms for each paradigm and looking in a corpus <spectie> there are quite a few papers on this. <spectie> so, e.g. in english let's say we have the paradigms "-s" plural and "-ren" plural <spectie> and a load of lemmas 'child, cat, dog, ...' <spectie> we look up {child, childs}, {child, children}, {cat, catren}, {cat, cats} etc. in the corpus <spectie> but then there is the problem of ambiguity <spectie> in english we have ambiguity between verb forms with -s and noun forms <spectie> which usually isn't a big problem <spectie> so the idea is to preprocess your corpus, using your existing analyser, and tagger to give possible values to unknown words based on the surrounding context <spectie> so let's say you have: <spectie> na poziv *fizikalne i *matematične fakulteti u *Odesi <spectie> you would assign possible values for case/number/gender to the unknown surface forms of *fizikalne and *matematične based on the surrounding _known_ context (na .... fakulteti) <spectie> then, when you come to check the paradigms against the surface forms in the corpus <spectie> you're not only checking if the surface form is valid, but also if the morphological information that it predicts is valid <spectie> as far as i know, this is novel (apart from forsberg's preliminary work) <spectie> if you need to do some kind of machine learning, you could quite easily use the existing entries in your morphology for learning which features best descriminate between ambiguous paradigms.