Difference between revisions of "Task ideas for Google Code-in/Categorise words from frequency list"
Jump to navigation
Jump to search
(Created page with '==Objective== Categorise words by frequency into one of the major part-of-speech categories. You will receive a frequency list. Work from top to bottom. At the beginning of eac…') |
|||
Line 9: | Line 9: | ||
==Example== |
==Example== |
||
+ | |||
+ | Consider this example of a Belarusian frequency list. On the left is the raw list, on the right is the list after part-of-speech letters have been added. |
||
+ | |||
<div align="center"> |
<div align="center"> |
||
{|class=wikitable |
{|class=wikitable |
||
Line 49: | Line 52: | ||
|} |
|} |
||
</p> |
</p> |
||
+ | |||
+ | ==Useful commands== |
||
+ | |||
+ | To find out how many words you have categorised for a particular part of speech: |
||
+ | |||
+ | <pre> |
||
+ | cat <filename> | grep "^<code>" | wc -l |
||
+ | </pre> |
||
+ | |||
+ | e.g. for nouns: |
||
+ | |||
+ | <pre> |
||
+ | cat bel.hitparade | grep "^n" | wc -l |
||
+ | 4 |
||
+ | </pre> |
||
Revision as of 00:38, 1 November 2013
Objective
Categorise words by frequency into one of the major part-of-speech categories.
You will receive a frequency list. Work from top to bottom. At the beginning of each line you should put a letter which categorises the word form by its part-of-speech. For example n
for noun, v
for verb, etc.
If you cannot recognise a word then you can skip it. If a word can have more than one part-of-speech then copy the line and paste it below with the other code.
Example
Consider this example of a Belarusian frequency list. On the left is the raw list, on the right is the list after part-of-speech letters have been added.
Before | After |
---|---|
^4606/4606<num>$ ^былі/*былі$ ^4493/4493<num>$ ^была/*была$ ^4484/4484<num>$ ^Беларусі/*Беларусі$ ^4394/4394<num>$ ^На/на<pr>$ ^3570/3570<num>$ ^горад/*горад$ ^3570/3570<num>$ ^але/*але$ ^3511/3511<num>$ ^пасля/*пасля$ ^3473/3473<num>$ ^было/*было$ ^3381/3381<num>$ ^пры/*пры$ ^2491/2491<num>$ ^тэрыторыі/*тэрыторыі$ ^2470/2470<num>$ ^Расіі/*Расіі$ ^2442/2442<num>$ ^дзе/*дзе$ ^2409/2409<num>$ ^вайны/*вайны$ ^2316/2316<num>$ ^цэнтр/*цэнтр$ |
v ^4606/4606<num>$ ^былі/*былі$ v ^4493/4493<num>$ ^была/*была$ t ^4484/4484<num>$ ^Беларусі/*Беларусі$ ^4394/4394<num>$ ^На/на<pr>$ n ^3570/3570<num>$ ^горад/*горад$ ^3570/3570<num>$ ^але/*але$ ^3511/3511<num>$ ^пасля/*пасля$ v ^3473/3473<num>$ ^было/*было$ ^3381/3381<num>$ ^пры/*пры$ n ^2491/2491<num>$ ^тэрыторыі/*тэрыторыі$ ^2470/2470<num>$ ^Расіі/*Расіі$ ^2442/2442<num>$ ^дзе/*дзе$ n ^2409/2409<num>$ ^вайны/*вайны$ n ^2316/2316<num>$ ^цэнтр/*цэнтр$ |
Useful commands
To find out how many words you have categorised for a particular part of speech:
cat <filename> | grep "^<code>" | wc -l
e.g. for nouns:
cat bel.hitparade | grep "^n" | wc -l 4