Difference between revisions of "Task ideas for Google Code-in/Add words from frequency list"

From Apertium
Jump to navigation Jump to search
(Created page with '==Examples== The paradigms (inflectional classes) will be different depending on the dictionary format and the language in question. When in doubt, ask your mentor for help. ==…')
 
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  +
{{TOCD}}
  +
==Objective==
  +
  +
In order to include a word in our morphological dictionaries, we need to know some information about it: its lemma (which you will have found with the [[Task ideas for Google Code-in/Lemmatise words from frequency list|lemmatisation task]]); and its part-of-speech (see the [[Task ideas for Google Code-in/Categorise words from frequency list|categorisation task]]). The next step is to find its inflectional paradigm.
  +
  +
The inflectional paradigm is basically a set of morphological affixes (often endings) which can be added to words to create their inflectional forms. For example: "fox", "box" and "watch" belong to the same paradigm as they both create the plural with -es. While "cat", "house" and "dog" both belong to the same paradigm as they create their plural with -s.
  +
 
==Examples==
 
==Examples==
   
Line 8: Line 15:
 
When using lttoolbox you will also need to find:
 
When using lttoolbox you will also need to find:
   
* the ''stem'' of the word, that is the part onto which inflectional endings are added. For example, the stem for "wolf" is "wol" because the singular is "wol + f" and the plural is "wol + ves".
+
* the ''stem'' of the word, that is the part onto which inflectional endings are added.
  +
** e.g. the stem for "wolf" is "wol" because the singular is "wol + f" and the plural is "wol + ves".
 
* the ''paradigm'' of the word. Paradigms in the <code>.dix</code> file come in <code>pardef</code> elements. Find the one that given your stem generates all the valid surface forms of the lemma.
 
* the ''paradigm'' of the word. Paradigms in the <code>.dix</code> file come in <code>pardef</code> elements. Find the one that given your stem generates all the valid surface forms of the lemma.
   
 
If a paradigm for the word does not exist then you will need to add a new one. Ask your mentor for help with this.
 
If a paradigm for the word does not exist then you will need to add a new one. Ask your mentor for help with this.
   
  +
When adding nouns, depending on the language, you should be careful with the following:
<div align="center">
 
  +
  +
* What gender is the noun ?
  +
* Does the noun exist in both singular and plural ?
  +
* Is the noun animate or inanimate ?
  +
  +
 
{|class=wikitable
 
{|class=wikitable
 
! Before !! After
 
! Before !! After
Line 33: Line 47:
 
|-
 
|-
 
|}
 
|}
</div>
 
 
   
   
 
===Using <code>.lexc</code>===
 
===Using <code>.lexc</code>===
  +
{{see-also|Starting a new language with HFST}}
   
  +
Usually, there are far less paradigms in .lexc dictionaries, and knowing the lemma and part-of-speech of a word is enough to add it to the dictionary. However, in some cases, further distinctions are made. E.g., for verbs, we might want to know whether they are transitive or not, for adjectives - whether they have comparative forms or not. What information you have to provide depends on the language(s) in question, so ask your mentor for further details.
   
  +
{|class=wikitable
  +
! Before !! After
  +
|-
  +
|
  +
<pre>
  +
n ^3570/3570<num>$ ^kitaplar/kitap$
  +
v ^2491/2491<num>$ ^gördim/gör$
 
</pre>
  +
|
  +
<pre>
  +
kitap:kitap N ;
  +
gör:gör V-TV ;
  +
</pre>
  +
|-
  +
|}
  +
  +
==Useful commands==
  +
  +
To find the list of nouns with lemmas ending in a certain suffix:
  +
  +
<pre>
  +
cat <filename> | grep "^n" | grep "<suffix>\$"
  +
</pre>
  +
  +
e.g. to find the list of nouns with the lemma ending in -ыя:
  +
  +
<pre>
  +
cat <filename> | grep "^n" | grep "ы‌я\$"
  +
</pre>
   
   

Latest revision as of 19:05, 7 November 2016

Objective[edit]

In order to include a word in our morphological dictionaries, we need to know some information about it: its lemma (which you will have found with the lemmatisation task); and its part-of-speech (see the categorisation task). The next step is to find its inflectional paradigm.

The inflectional paradigm is basically a set of morphological affixes (often endings) which can be added to words to create their inflectional forms. For example: "fox", "box" and "watch" belong to the same paradigm as they both create the plural with -es. While "cat", "house" and "dog" both belong to the same paradigm as they create their plural with -s.

Examples[edit]

The paradigms (inflectional classes) will be different depending on the dictionary format and the language in question. When in doubt, ask your mentor for help.

Using .dix[edit]

See also: Starting a new language with lttoolbox

When using lttoolbox you will also need to find:

  • the stem of the word, that is the part onto which inflectional endings are added.
    • e.g. the stem for "wolf" is "wol" because the singular is "wol + f" and the plural is "wol + ves".
  • the paradigm of the word. Paradigms in the .dix file come in pardef elements. Find the one that given your stem generates all the valid surface forms of the lemma.

If a paradigm for the word does not exist then you will need to add a new one. Ask your mentor for help with this.

When adding nouns, depending on the language, you should be careful with the following:

  • What gender is the noun ?
  • Does the noun exist in both singular and plural ?
  • Is the noun animate or inanimate ?


Before After
n   ^3570/3570<num>$ ^горад/горад$
n   ^2491/2491<num>$ ^тэрыторыі/тэрыторыя$
n   ^2409/2409<num>$ ^вайны/вайна$
n   ^2316/2316<num>$ ^цэнтр/цэнтр$
 <e lm="горад"><i>горад</i><par n="..."/></e>
 <e lm="тэрыторыя"><i>тэрыторы</i><par n="..."/></e>
 <e lm="вайна"><i>вайн</i><par n="..."/></e>
 <e lm="цэнтр"><i>цэнтр</i><par n="..."/></e>


Using .lexc[edit]

See also: Starting a new language with HFST

Usually, there are far less paradigms in .lexc dictionaries, and knowing the lemma and part-of-speech of a word is enough to add it to the dictionary. However, in some cases, further distinctions are made. E.g., for verbs, we might want to know whether they are transitive or not, for adjectives - whether they have comparative forms or not. What information you have to provide depends on the language(s) in question, so ask your mentor for further details.

Before After
n   ^3570/3570<num>$ ^kitaplar/kitap$
v   ^2491/2491<num>$ ^gördim/gör$
kitap:kitap N ;
gör:gör V-TV ;

Useful commands[edit]

To find the list of nouns with lemmas ending in a certain suffix:

cat <filename> | grep "^n" | grep "<suffix>\$"

e.g. to find the list of nouns with the lemma ending in -ыя:

cat <filename> | grep "^n" | grep "ы‌я\$"