Polish and Russian/Project description

Commitment[edit]

The list of all commits: https://apertium.projectjj.com/gsoc2016/maryszmary.html

Monolingual Polish package: https://svn.code.sf.net/p/apertium/svn/languages/apertium-pol/

Monolingual Russian package: https://svn.code.sf.net/p/apertium/svn/languages/apertium-rus/

Bilingual Polish-Russian package: https://svn.code.sf.net/p/apertium/svn/incubator/apertium-pol-rus/

Project description[edit]

Description of the main package components[edit]

Monolingual and bilingual dictionaries

The Polish and Russian morphological dictionaries contain a formal description of paradigms and entries for different word categories. There was already a good dictionary for Russian and a not so complete Polish dictionary. Both dictionaries were improved.

The Polish-Russian bilingual dictionary contains tags matching paradigms and a list of entries which match lexemes from the two languages.

The most challenging thing for improving the morphological was to deal with peculiarities of Slavic morphology. Slavic morphology has a lot of irregularities, which finally result in a great number of paradigms. For the bilingual dictionary the most problematic thing was the poverty of the existing resources.

Constraint grammar and transfer rules

Constraint Grammar rules allow us to distinguish words with different grammatical tags and words with different lexical meanings based on the grammatical and lexical context. CG rules work both for disambiguation within one part of speech and between words of different categories. Transfer rules help to make a better translation when there are structural differences between languages that cannot be translated directly.

Corpora and language data

For measuring the coverage of the dictionaries and also for making decision about the grammar of the languages were used:

for Russian: the Russian National Corpus (RNC)
for Polish: the wikinews corpus

Auxiliary scripts

For the purposes of the project a number of scripts were written (the link for the directory with the scripts described in this section is provided).

The following scripts were especially helpful:

A script for extracting verbs from Zalizniak's dictionary (from_z.py)
A script for defining adding new words to the Polish morphological dictionary (morpheus.py)
A number of scripts for adding new words to the bilingual dictionary

Statistics[edit]

At first, the goal of the project was to achieve 90% coverage of corpora used. It turned out to be a challenging task for three months of work, partly because of the peculiarities of Slavic morphology and morphophonology and the lack of available bilingual electronic dictionaries (and poverty of the latter). As a result, starting from the end of July our main task was to lowering the number of mistakes than on achieving high coverage.

Coverage	Polish → Russian (%)	Russian → Polish (%)
Trimmed coverage	85.1%	83.8%
Coverage	Russian (%)	Polish (%)
Raw coverage	94.2%	87.6%

The number of lemmas in bilingual dictionary: 48,836.

The number of lemmas in Polish dictionary: 10,023.

Future work[edit]

There is still a lot of work to be done. The most important aspects are:

Increasing a number of rules
Decreasing the number of mistakes
Increasing the coverage

Resources[edit]

For morphological dictionaries

For improving the Polish morphological dictionary, the Polish version of wiktionary and PoliMorph were used. For the Russian dictionary, we used the Russian version of wiktionary and Zaliznyak dictionary (the latter was used for extracting the Russian verbs).

For the bilingual dictionary

A number of online dictionaries which were useful for it:

Wiktionary
Glosbe
Bab.la

Polish and Russian/Project description

Contents

Commitment[edit]

Project description[edit]

Description of the main package components[edit]

Statistics[edit]

Future work[edit]

Resources[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools