Difference between revisions of "Polish and Russian/Project description"

From Apertium
Jump to navigation Jump to search
Line 15: Line 15:
 
'''Monolingual and bilingual dictionaries'''
 
'''Monolingual and bilingual dictionaries'''
   
  +
The Sicilian dictionary contains a formal description of paradigms and entries for different word categories.
 
   
 
'''Constraint grammar and transfer rules'''
 
'''Constraint grammar and transfer rules'''
  +
'''Constraint Grammar''' rules allow us to distinguish words with different grammatical tags and words with different lexical meanings based on the grammatical and lexical context. CG rules work both for disambiguation within one part of speech and between words of different categories.
 
  +
'''Transfer rules''' help to make a better translation when there are structural differences between languages that cannot be translated directly.
   
   
 
'''Corpora and language data'''
 
'''Corpora and language data'''
   
  +
For measuring the coverage of the dictionaries and also for making decision about the grammar of the languages were used:
The corpora which were used for testing are the Russian National Corpus and Polish wikinews corpus.
 
  +
* for Russian: the Russian National Corpus (RNC)
  +
* for Pokish: the wikinews corpus.
   
   

Revision as of 18:08, 23 August 2016

Commitment

The list of all commits: https://apertium.projectjj.com/gsoc2016/maryszmary.html

Monolingual Polish package: https://svn.code.sf.net/p/apertium/svn/languages/apertium-pol/

Monolingual Russian package: https://svn.code.sf.net/p/apertium/svn/languages/apertium-rus/

Bilingual Polish-Russian package: https://svn.code.sf.net/p/apertium/svn/incubator/apertium-pol-rus/

Project description

Description of the main package components

Monolingual and bilingual dictionaries

The Sicilian dictionary contains a formal description of paradigms and entries for different word categories.

Constraint grammar and transfer rules Constraint Grammar rules allow us to distinguish words with different grammatical tags and words with different lexical meanings based on the grammatical and lexical context. CG rules work both for disambiguation within one part of speech and between words of different categories. Transfer rules help to make a better translation when there are structural differences between languages that cannot be translated directly.


Corpora and language data

For measuring the coverage of the dictionaries and also for making decision about the grammar of the languages were used:

  • for Russian: the Russian National Corpus (RNC)
  • for Pokish: the wikinews corpus.


Auxiliary scripts

For the purposes of the project a number of scripts were written (the link for the directory with the scripts described in this section is provided).

The following scripts were especially helpful:

  • A number of scripts for adding new words to the bilingual dictionary
  • A script for defining adding new words to the Polish morphological dictionary
  • A script for extracting verbs from Zalizniak's dictionary

Statistics

At first, the goal of the project was to achieve 90% coverage of corpora used. It turned out to be a challenging task for three months of work, partly because of the peculiarities of Slavic morphology and morphophonology and the lack of available bilingual electronic dictionaries (and poverty of the latter). As a result, starting from the end of July our main task was to lowering the number of mistakes than on achieving high coverage.

Coverage Polish → Russian (%) Russian → Polish (%)
Trimmed coverage 85.1% 83.8%
Coverage Russian (%) Polish (%)
Raw coverage 94.2% 87.6%

The number of lemmas in bilingual dictionary: 48,836.

The number of lemmas in Polish dictionary: 10,023.

Future work

Resources