Difference between revisions of "Task ideas for Google Code-in/Lemmatise words from frequency list"
Jump to navigation
Jump to search
(Created page with '==Objective== Lemmatise words by frequency. The lemma of a word is it's "base form" (the form you might find in a dictionary) You will receive a frequency list. Work from top t…') |
|||
Line 42: | Line 42: | ||
==Useful commands== |
==Useful commands== |
||
− | To find out how many words you have |
+ | To find out how many words you have lemmatised for a particular part of speech: |
<pre> |
<pre> |
||
− | cat <filename> | grep "^<code>" | wc -l |
+ | cat <filename> | grep "^<code>" | grep -v '\*' | wc -l |
</pre> |
</pre> |
||
Line 51: | Line 51: | ||
<pre> |
<pre> |
||
− | cat bel.hitparade | grep "^n" | wc -l |
+ | cat bel.hitparade | grep "^n" | grep -v '\*' | wc -l |
4 |
4 |
||
</pre> |
</pre> |
Revision as of 15:50, 1 November 2013
Objective
Lemmatise words by frequency. The lemma of a word is it's "base form" (the form you might find in a dictionary)
You will receive a frequency list. Work from top to bottom. After each asterisk '*
' you should replace the surface form with the lemma.
If you cannot recognise a word then you can skip it. If a word can have more than one lemma then copy the line and paste it below with the other code.
Example
Consider this example of a Belarusian frequency list. On the left is the list with part-of-speech annotations, on the right is the list after being lemmatised.
Before | After |
---|---|
v ^4606/4606<num>$ ^былі/*былі$ v ^4493/4493<num>$ ^была/*была$ t ^4484/4484<num>$ ^Беларусі/*Беларусі$ n ^3570/3570<num>$ ^горад/*горад$ v ^3473/3473<num>$ ^было/*было$ n ^2491/2491<num>$ ^тэрыторыі/*тэрыторыі$ n ^2409/2409<num>$ ^вайны/*вайны$ n ^2316/2316<num>$ ^цэнтр/*цэнтр$ |
v ^4606/4606<num>$ ^былі/быць$ v ^4493/4493<num>$ ^была/быць$ t ^4484/4484<num>$ ^Беларусі/Беларусь$ n ^3570/3570<num>$ ^горад/горад$ v ^3473/3473<num>$ ^было/быць$ n ^2491/2491<num>$ ^тэрыторыі/тэрыторыя$ n ^2409/2409<num>$ ^вайны/вайна$ n ^2316/2316<num>$ ^цэнтр/цэнтр$ |
Useful commands
To find out how many words you have lemmatised for a particular part of speech:
cat <filename> | grep "^<code>" | grep -v '\*' | wc -l
e.g. for nouns:
cat bel.hitparade | grep "^n" | grep -v '\*' | wc -l 4