Difference between revisions of "Task ideas for Google Code-in/Lemmatise words from frequency list"
Jump to navigation
Jump to search
(One intermediate revision by the same user not shown) | |||
Line 11: | Line 11: | ||
Consider this example of a Belarusian frequency list. On the left is the list with part-of-speech annotations, on the right is the list after being lemmatised. |
Consider this example of a Belarusian frequency list. On the left is the list with part-of-speech annotations, on the right is the list after being lemmatised. |
||
<div align="center"> |
|||
{|class=wikitable |
{|class=wikitable |
||
! Before !! After |
! Before !! After |
||
Line 38: | Line 38: | ||
</pre> |
</pre> |
||
|} |
|} |
||
</div> |
|||
==Useful commands== |
==Useful commands== |
Latest revision as of 16:21, 14 November 2013
Objective[edit]
Lemmatise words by frequency. The lemma of a word is it's "base form" (the form you might find in a dictionary)
You will receive a frequency list. Work from top to bottom. After each asterisk '*
' you should replace the surface form with the lemma.
If you cannot recognise a word then you can skip it. If a word can have more than one lemma then copy the line and paste it below with the other lemma.
Example[edit]
Consider this example of a Belarusian frequency list. On the left is the list with part-of-speech annotations, on the right is the list after being lemmatised.
Before | After |
---|---|
v ^4606/4606<num>$ ^былі/*былі$ v ^4493/4493<num>$ ^была/*была$ t ^4484/4484<num>$ ^Беларусі/*Беларусі$ n ^3570/3570<num>$ ^горад/*горад$ v ^3473/3473<num>$ ^было/*было$ n ^2491/2491<num>$ ^тэрыторыі/*тэрыторыі$ n ^2409/2409<num>$ ^вайны/*вайны$ n ^2316/2316<num>$ ^цэнтр/*цэнтр$ |
v ^4606/4606<num>$ ^былі/быць$ v ^4493/4493<num>$ ^была/быць$ t ^4484/4484<num>$ ^Беларусі/Беларусь$ n ^3570/3570<num>$ ^горад/горад$ v ^3473/3473<num>$ ^было/быць$ n ^2491/2491<num>$ ^тэрыторыі/тэрыторыя$ n ^2409/2409<num>$ ^вайны/вайна$ n ^2316/2316<num>$ ^цэнтр/цэнтр$ |
Useful commands[edit]
To find out how many words you have lemmatised for a particular part of speech:
cat <filename> | grep "^<code>" | grep -v '\*' | wc -l
e.g. for nouns:
cat bel.hitparade | grep "^n" | grep -v '\*' | wc -l 4