Difference between revisions of "Task ideas for Google Code-in/Add words from frequency list"
m (Francis Tyers moved page Task ideas for Google Code-in/Add nouns from frequency list to Task ideas for Google Code-in/Add words from frequency list) |
m (→Examples) |
||
Line 27: | Line 27: | ||
* Is the noun animate or inanimate ? |
* Is the noun animate or inanimate ? |
||
+ | |||
− | <div align="center"> |
||
{|class=wikitable |
{|class=wikitable |
||
! Before !! After |
! Before !! After |
||
Line 47: | Line 47: | ||
|- |
|- |
||
|} |
|} |
||
+ | |||
− | </div> |
||
===Using <code>.lexc</code>=== |
===Using <code>.lexc</code>=== |
Revision as of 17:12, 15 November 2013
Objective
In order to include a word in our morphological dictionaries, we need to know some information about it: its lemma (which you will have found with the lemmatisation task); and its part-of-speech (see the categorisation task). The next step is to find its inflectional paradigm.
The inflectional paradigm is basically a set of morphological affixes (often endings) which can be added to words to create their inflectional forms. For example: "fox", "box" and "watch" belong to the same paradigm as they both create the plural with -es. While "cat", "house" and "dog" both belong to the same paradigm as they create their plural with -s.
Examples
The paradigms (inflectional classes) will be different depending on the dictionary format and the language in question. When in doubt, ask your mentor for help.
Using .dix
- See also: Starting a new language with lttoolbox
When using lttoolbox you will also need to find:
- the stem of the word, that is the part onto which inflectional endings are added.
- e.g. the stem for "wolf" is "wol" because the singular is "wol + f" and the plural is "wol + ves".
- the paradigm of the word. Paradigms in the
.dix
file come inpardef
elements. Find the one that given your stem generates all the valid surface forms of the lemma.
If a paradigm for the word does not exist then you will need to add a new one. Ask your mentor for help with this.
When adding nouns, depending on the language, you should be careful with the following:
- What gender is the noun ?
- Does the noun exist in both singular and plural ?
- Is the noun animate or inanimate ?
Before | After |
---|---|
n ^3570/3570<num>$ ^горад/горад$ n ^2491/2491<num>$ ^тэрыторыі/тэрыторыя$ n ^2409/2409<num>$ ^вайны/вайна$ n ^2316/2316<num>$ ^цэнтр/цэнтр$ |
<e lm="горад"><i>горад</i><par n="..."/></e> <e lm="тэрыторыя"><i>тэрыторы</i><par n="..."/></e> <e lm="вайна"><i>вайн</i><par n="..."/></e> <e lm="цэнтр"><i>цэнтр</i><par n="..."/></e> |
Using .lexc
- See also: Starting a new language with HFST
Useful commands
To find the list of nouns with lemmas ending in a certain suffix:
cat <filename> | grep "^n" | grep "<prefix>\$"
e.g. to find the list of nouns with the lemma ending in -ыя:
cat <filename> | grep "^n" | grep "ыя\$"