Difference between revisions of "Task ideas for Google Code-in/Lemmatise words from frequency list"

From Apertium
Jump to navigation Jump to search
(Created page with '==Objective== Lemmatise words by frequency. The lemma of a word is it's "base form" (the form you might find in a dictionary) You will receive a frequency list. Work from top t…')
 
Line 42: Line 42:
 
==Useful commands==
 
==Useful commands==
   
To find out how many words you have categorised for a particular part of speech:
+
To find out how many words you have lemmatised for a particular part of speech:
   
 
<pre>
 
<pre>
cat <filename> | grep "^<code>" | wc -l
+
cat <filename> | grep "^<code>" | grep -v '\*' | wc -l
 
</pre>
 
</pre>
   
Line 51: Line 51:
   
 
<pre>
 
<pre>
cat bel.hitparade | grep "^n" | wc -l
+
cat bel.hitparade | grep "^n" | grep -v '\*' | wc -l
 
4
 
4
 
</pre>
 
</pre>

Revision as of 15:50, 1 November 2013

Objective

Lemmatise words by frequency. The lemma of a word is it's "base form" (the form you might find in a dictionary)

You will receive a frequency list. Work from top to bottom. After each asterisk '*' you should replace the surface form with the lemma.

If you cannot recognise a word then you can skip it. If a word can have more than one lemma then copy the line and paste it below with the other code.

Example

Consider this example of a Belarusian frequency list. On the left is the list with part-of-speech annotations, on the right is the list after being lemmatised.

Before After
v   ^4606/4606<num>$ ^былі/*былі$
v   ^4493/4493<num>$ ^была/*была$
t   ^4484/4484<num>$ ^Беларусі/*Беларусі$
n   ^3570/3570<num>$ ^горад/*горад$
v   ^3473/3473<num>$ ^было/*было$
n   ^2491/2491<num>$ ^тэрыторыі/*тэрыторыі$
n   ^2409/2409<num>$ ^вайны/*вайны$
n   ^2316/2316<num>$ ^цэнтр/*цэнтр$
v   ^4606/4606<num>$ ^былі/быць$
v   ^4493/4493<num>$ ^была/быць$
t   ^4484/4484<num>$ ^Беларусі/Беларусь$
n   ^3570/3570<num>$ ^горад/горад$
v   ^3473/3473<num>$ ^было/быць$
n   ^2491/2491<num>$ ^тэрыторыі/тэрыторыя$
n   ^2409/2409<num>$ ^вайны/вайна$
n   ^2316/2316<num>$ ^цэнтр/цэнтр$

Useful commands

To find out how many words you have lemmatised for a particular part of speech:

cat <filename> | grep "^<code>" | grep -v '\*' | wc -l

e.g. for nouns:

cat bel.hitparade | grep "^n" | grep -v '\*' | wc -l
4