Difference between revisions of "Improved corpus-based paradigm matching"
Line 99: | Line 99: | ||
* [[Building_dictionaries#Generating monolingual dictionary entries]] |
* [[Building_dictionaries#Generating monolingual dictionary entries]] |
||
* [[Extract]] |
* [[Extract]] |
||
==Further reading== |
|||
* http://spraakbanken.gu.se/lb/personal/markus/publications/FinTAL2006.pdf |
|||
* http://spraakbanken.gu.se/lb/personal/markus/publications/Extract_Tech_Report.pdf |
|||
[[Category:Development]] |
[[Category:Development]] |
Latest revision as of 14:26, 10 February 2015
<spectie> we want to increase the coverage of our analysers <spectie> one way we can do it is for a human to think about every entry <spectie> but, we already have a lot of knowledge already <spectie> so, if we have a load of paradigms, and a load of lemmas -- we can try and match the lemmas to paradigms by generating the forms for each paradigm and looking in a corpus <spectie> there are quite a few papers on this. <spectie> so, e.g. in english let's say we have the paradigms "-s" plural and "-ren" plural <spectie> and a load of lemmas 'child, cat, dog, ...' <spectie> we look up {child, childs}, {child, children}, {cat, catren}, {cat, cats} etc. in the corpus <spectie> but then there is the problem of ambiguity <spectie> in english we have ambiguity between verb forms with -s and noun forms <spectie> which usually isn't a big problem <spectie> so the idea is to preprocess your corpus, using your existing analyser, and tagger to give possible values to unknown words based on the surrounding context <spectie> so let's say you have: <spectie> na poziv *fizikalne i *matematične fakulteti u *Odesi <spectie> you would assign possible values for case/number/gender to the unknown surface forms of *fizikalne and *matematične based on the surrounding _known_ context (na .... fakulteti) <spectie> then, when you come to check the paradigms against the surface forms in the corpus <spectie> you're not only checking if the surface form is valid, but also if the morphological information that it predicts is valid <spectie> as far as i know, this is novel (apart from forsberg's preliminary work) <spectie> if you need to do some kind of machine learning, you could quite easily use the existing entries in your morphology for learning which features best descriminate between ambiguous paradigms.
Example[edit]
We want to find English nouns.
We have the following paradigms:
- cat, cats
- child, children
- fox, foxes
- wolf, wolves
We can turn these into stem, suffix patterns:
- cat: - / s
- child: - / ren
- fox: - / es
- wol: f / ves
We can then extract possible candidates from our corpus, e.g.
$ cat /tmp/cand | rev | sort | rev | grep "\(f\|ves\) *$" | sort aardwolf aardwolves behalf behaves belief believes half halves leaf leaves lives ...
Some of these are good matches for the paradigm and others are not.
How can we determine if something is a good match or not ? We can use context. So our candidates are:
- aardwolf, aardwolves
- belief, believes
- half, halves
- leaf, leaves
As these all follow the pattern +f in singular and +ves in plural. Let's say to start we want to make sure that all of them are nouns, e.g. because "believes" is a verb and we don't want that.
We could say e.g. that we want to make sure that every part of the paradigm can be found with an article before it.
$ cat /tmp/sample | grep 'the aardwolf' | wc -l 6 $ cat /tmp/sample | grep 'the aardwolves' | wc -l 1 $ cat /tmp/sample | grep 'the belief' | wc -l 8 $ cat /tmp/sample | grep 'the believes' | wc -l 0 $ cat /tmp/sample | grep 'the half' | wc -l 6 $ cat /tmp/sample | grep 'the halves' | wc -l 0 $ cat /tmp/sample | grep 'the leaf' | wc -l 0 $ cat /tmp/sample | grep 'the leaves' | wc -l 2
So going by this output (with a very small corpus), you would accept the first pair, and reject the others.