Difference between revisions of "Siciliano y castellano/Informe final"

From Apertium
Jump to navigation Jump to search
Line 17: Line 17:
 
'''Monolingual and bilingual dictionaries'''
 
'''Monolingual and bilingual dictionaries'''
   
The Sicilian dictionary contains formal description of paradigms and entries for different word categories. The most challenging issue while creating the Sicilian dictionary was the abundance of spelling forms in the Sicilian language. For instance, one Sicilian verb with the meaning 'to join' can have the following forms: ''cunjùnciri, cognùngiri, conjùngiri, cugnùnciri, cognùncici, coniùngiri, conjùnciri.'' Important is that the stems of all these verb forms can serve for verb formation, but only the forms from one stem must be generated. Thus, the Sicilian monolingual dictionary contains paradigms both for main entries that are used for generation and for additional forms that can be analyzed, but not generated while translating from Spanish to Sicilian.
+
The Sicilian dictionary contains a formal description of paradigms and entries for different word categories. The most challenging issue while creating the Sicilian dictionary was the abundance of spelling forms in the Sicilian language. For instance, one Sicilian verb with the meaning 'to join' can have the following forms: ''cunjùnciri, cognùngiri, conjùngiri, cugnùnciri, cognùncici, coniùngiri, conjùnciri.'' Important is that the stems of all these verb forms can serve for verb formation, but only the forms from one stem must be generated. Thus, the Sicilian monolingual dictionary contains paradigms both for main entries that are used for generation and for additional forms that can be analyzed, but not generated while translating from Spanish to Sicilian.
   
 
Another issue is the complicated accent system in Sicilian that has a huge impact on spelling, especially on spelling of verbs. Depending on the particular grammatical form, a Sicilian verb can have a stress on different syllables so that unstressed vowels changes to stressed and vice versa under certain circumstances. It is particularly expressed in the case of enclitic pronouns that cling to the verb ending and change the number of word syllables. As the result, the Sicilian dictionary contains paradigms for stress change, similar to the change of the root vocal by irregular verbs.
 
Another issue is the complicated accent system in Sicilian that has a huge impact on spelling, especially on spelling of verbs. Depending on the particular grammatical form, a Sicilian verb can have a stress on different syllables so that unstressed vowels changes to stressed and vice versa under certain circumstances. It is particularly expressed in the case of enclitic pronouns that cling to the verb ending and change the number of word syllables. As the result, the Sicilian dictionary contains paradigms for stress change, similar to the change of the root vocal by irregular verbs.
Line 40: Line 40:
   
   
'''A total number of CG rules:''' 61.
+
'''The total number of CG rules:''' 61.
   
   
Line 54: Line 54:
 
* The transfer rules allow translating a non-reflexive verb with a reflexive verb which is often the case while translating from Sicilian to Spanish.
 
* The transfer rules allow translating a non-reflexive verb with a reflexive verb which is often the case while translating from Sicilian to Spanish.
 
* Sicilian and Spanish bear some resemblance in word order, however, they demonstrate some subtle differences, for example, in the case of articles and pronouns. These differences are handled by the transfer rules.
 
* Sicilian and Spanish bear some resemblance in word order, however, they demonstrate some subtle differences, for example, in the case of articles and pronouns. These differences are handled by the transfer rules.
'''A total number of transfer rules:''' 40.
+
'''The total number of transfer rules:''' 40.
   
   
   
  +
'''Corpus and pending tests'''
'''Corpora'''
 
   
  +
While working on the Sicilian-Spanish language pair, the language data from Sicilian and Spanish Wikipedia was used. Six articles from Sicilian Wikipedia were translated manually to Spanish in order to test the quality of translation.
To evaluate the quality of translation, different types of corpora
 
 
6 articles from Sicilian Wikipedia were translated to Spanish.
 
   
   

Revision as of 05:15, 23 August 2016

Commitment

The list of all commits: https://apertium.projectjj.com/gsoc2016/uliana-sentsova.html

Monolingual Sicilian dictionary:

Bilingual Sicilian-Spanish dictionary: https://svn.code.sf.net/p/apertium/svn/incubator/apertium-scn-spa/


Project description and challenging issues

The project goal is to create a machine translation package for Sicilian-Spanish language pair on the base of Apertium’s machine translation system. This project is using the existing Apertium’s Spanish dictionary in order to build a package with two monolingual dictionaries for Spanish and Sicilian languages and a bilingual package for Spanish-Sicilian equivalents and translation rules.

The most important package components are the following.


Monolingual and bilingual dictionaries

The Sicilian dictionary contains a formal description of paradigms and entries for different word categories. The most challenging issue while creating the Sicilian dictionary was the abundance of spelling forms in the Sicilian language. For instance, one Sicilian verb with the meaning 'to join' can have the following forms: cunjùnciri, cognùngiri, conjùngiri, cugnùnciri, cognùncici, coniùngiri, conjùnciri. Important is that the stems of all these verb forms can serve for verb formation, but only the forms from one stem must be generated. Thus, the Sicilian monolingual dictionary contains paradigms both for main entries that are used for generation and for additional forms that can be analyzed, but not generated while translating from Spanish to Sicilian.

Another issue is the complicated accent system in Sicilian that has a huge impact on spelling, especially on spelling of verbs. Depending on the particular grammatical form, a Sicilian verb can have a stress on different syllables so that unstressed vowels changes to stressed and vice versa under certain circumstances. It is particularly expressed in the case of enclitic pronouns that cling to the verb ending and change the number of word syllables. As the result, the Sicilian dictionary contains paradigms for stress change, similar to the change of the root vocal by irregular verbs.

Finally, Sicilian language has a very rich pronouns system that is somewhat similar to the pronouns system in Spanish. This feature was used to develop the pronouns paradigms in the Sicilian dictionary.


Constraint grammar

Constraint Grammar rules allow us to distinguish words with different grammatical tags and words with different lexical meanings based on the grammatical and lexical context. CG rules work both for disambiguation within one part of speech and between words of different categories.

The rules of lexical selection are needed when one word has different meaning depending on its context. A good example is the Sicilian noun"cristianu" that not only signifies a person of Christian faith but can also denote a human being in general.

The following cases of grammatical ambiguity were handled with CG rules in the Sicilian package.

  • Disambiguation within one part of speech. The coincidence of verb forms within one verb paradigm occurs fairly often in Sicilian language. For instance, all Sicilian verbs demonstrate coinciding forms for first, second and third forms of Present Subjunctive. Regular verbs of the 2-nd conjugation have the same forms for Present Indicative of the first and the second person, Present Indicative of the third person singular usually coincides with the Imperative of the second person plural by verbs of the first conjugation.
  • Disambiguation between words of different categories. Since "-a", "-i" and "-u" are standard endings for Sicilian nouns, adjectives, and verb forms, there are much more ambiguous wordforms in Sicilian than one can expect. A lot of Sicilian masculine nouns coincide with Present Indicative of regular verbs (like "munni" that is both plural of "munnu" and present of "munnari"), feminine nouns can match Imperative of verbs of the first conjugation. Conversion as word formation in Sicilian is also often the reason of ambiguous word forms.

Here is the list of ambiguous Sicilian and Spanish sentences that can be used to test the set of CG rules.


The total number of CG rules: 61.


Transfer rules

Transfer rules help to make a better translation when there are structural differences between languages that cannot be translated directly.

  • Unlike in Spanish, the synthetic future is no longer in use in Sicilian language, therefore it is replaced by the periphrastic compound forms with common verbs like "jiri", "vèniri" or "aviri".
  • The synthetic conditional forms of verbs are normally replaced by indicative or subjunctive forms.
  • Both Sicilian and Spanish have verb constructions with passive and modal meaning. Transfer rules are used to translate them correctly where the structure of phrasal constructions doesn't coincide in these languages.
  • The transfer rules allow translating a non-reflexive verb with a reflexive verb which is often the case while translating from Sicilian to Spanish.
  • Sicilian and Spanish bear some resemblance in word order, however, they demonstrate some subtle differences, for example, in the case of articles and pronouns. These differences are handled by the transfer rules.

The total number of transfer rules: 40.


Corpus and pending tests

While working on the Sicilian-Spanish language pair, the language data from Sicilian and Spanish Wikipedia was used. Six articles from Sicilian Wikipedia were translated manually to Spanish in order to test the quality of translation.


Statistics

The initial goal of the project was to achieve 90% coverage of the Sicilian Wikipedia corpus. However, this turned out to be

Coverage Sicilian-castellano (%) Castellano-siciliano (%)
Trimmed coverage 83.4% 83,8%
Coverage Sicilian (%) Spanish (%)
Raw coverage 85.5% 91,6%

The number of lemmas in bilingual dictionary: 11,253.

The number of lemmas in Sicilian dictionary: 13,140.


Future work

TODO

Syntactic properties, more rules, automatic forms merge algorithm


Resources

TODO

https://scn.wikipedia.org/wiki/P%C3%A0ggina_principali

https://scn.wiktionary.org/wiki/P%C3%A0ggina_principali

Bonner, Introduction to Sicilian Grammar

El nuovo dizionario siciliano-italiano