Difference between revisions of "User:Rantakaulio/GSoC2021Proposal"

From Apertium
Jump to navigation Jump to search
(Created page with "Finnish, Olonets-Karelian and Karelian lexicon development The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical...")
 
Line 2: Line 2:


The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.
The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.

= Finnish, Olonets-Karelian and Karelian lexicon development =

The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.

At present, the olo-fin language pair already exists, and for that the current work would mainly involve improving the lexicon, confirming earlier translations and verifying that the bidirectionality of translation pairs is correct. The two other pairs krl-fin and krl-olo will be developed entirely from scratch. The fact that two pairs demand significantly more work is taken into account in the plan.

For this task it is very important to know all of the languages involved at an advanced level. I have worked previously with Karelian dictionaries, speak Finnish natively, and have a MA degree in linguistics. I’m currently conducting my PhD research in the University of Helsinki. In 2013–2014 I worked in a research project where I added Finnish translations into Olonets–Finnish dictionary which is now what is found in the Giellatekno infrastructure. I also compiled extensive paradigm tests for different word types. In addition, I have also conducted fieldwork in Karelian speech communities over numerous years, and know these language varieties closely also at dialectological and historical level. This experience allows me to work with lexical resources effectively and contribute to dialect research and language facilitation.

The proposed project will test two different approaches to lexicon editing: (1) Importing a version controlled CSV file, and (2) editing the entries directly in the Ve’rdd platform, which was also developed as a Google Summer of Code task in 2020. This gives feedback to the Ve’rdd developers and helps to improve the software so that it supports more efficient editing workflows.

In addition to lexicon pairs, the added materials will also contain information needed for Constraint Grammar development, such as verbal government used in contextual disambiguation.

The Apertium infrastructure presently has approximately 1000 translations for [https://github.com/apertium/apertium-fin-krl/blob/master/apertium-fin-krl.fin-krl.dix ''fin-krl''], 260 for [https://github.com/apertium/apertium-fin-olo/blob/master/apertium-fin-olo.fin-olo.dix ''fin-olo''], 0 for [https://github.com/apertium/apertium-krl-olo/blob/master/apertium-krl-olo.krl-olo.dix ''krl-olo'']. On GiellaLT, there is an olo-fin dictionary of approximately (17,754) lemmas with glossing for over 20,000 translations pairs

Other possible resources:

260 for [https://github.com/apertium/apertium-fin-rus/blob/master/apertium-fin-rus.fin-rus.dix ''fin-rus''], olo-rus (13,384)

gtsvn/words/dicts/olorus/src/*.xml

Rus-olo (17,202)

gtsvn/words/dicts/rusolo/src/*.xml

Karelian dictionary from Kotus has lots of words but without olo-krl distinction and some very specific orthographical choices. However, as this dictionary aims to cover all Karelian dialects, it is a significant resource for now planned work.

The goal of the planned work is to get 10 000 translation pair per language pair. This is already a large parallel lexicon, and has to be essentially created within this project. This task also builds on earlier work with Karelian treebanks (Pirinen 2019), and creates a more solid foundation for the computational infrastructure of both of the orthographic variants.

In this project also Ve’rdd dictionary development platform will be used and further tested. This continues the work in an older Google Summer of Code project, but takes a new angle as it is truly tested in a large scale editing work.

'''Name: Timo Rantakaulio'''

'''E-mail address:''' timo.rantakaulio@gmail.com

'''Other information that may be useful to contact you (e.g. IRC):''' timo.rantakaulio@helsinki.fi

'''Why is it that you are interested in Apertium? '''

I have been working two years with Jack Rueter on the Olonets-Karelian - Finnish dictionary in Giellatekno infrastructure, so I’m familiar with the environment and consider online dictionaries as a powerful tool in language revitalisation.

'''Which of the published tasks are you interested in? What do you plan to do?'''

As a doctoral student in Fenno-Ugrian languages I am ready to implement my knowledge and skills of the actual languages in the process as a specialist in translation.

''Include a proposal, including''

''* a title, ''

''* reasons why Google and Apertium should sponsor it, ''

''* a description of how and who it will benefit in society,''

''* and a detailed work plan (including, if possible, a schedule with milestones and deliverables).''

'''Work plan'''

* Week 1: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary

* Week 2: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary

* Week 3: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary

* Week 4: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary

'''Deliverable #1'''

* Week 5: Importing 2500 word articles to the Karelian - Finnish dictionary

* Week 6: Importing 2500 word articles to the Karelian - Finnish dictionary

* Week 7: Importing 2500 word articles to the Karelian - Finnish dictionary

* Week 8: Importing 2500 word articles to the Karelian - Finnish dictionary

'''Deliverable #2'''

* Week 9: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary

* Week 10: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary

* Week 11: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary

* Week 12: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary

Also, some work on CG / disam / syntax / transfer.

'''Project completed.'''

'''List any non-Summer-of-Code plans you have for the Summer'''

I can participate in the project during most of the spring and summer, I plan to use two long weekends for agriculture in May and one long weekend for a holiday with my family around the Midsummer Day in June. I plan to use for this dictionary work at least 30 hrs per week, due to the solid amount of the word articles needed. Since the Karelian language is also a personally important topic for me, I will most likely work more than the minimum.

'''References '''

Pirinen, T. A. (2019). Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in Karelian treebanking. In Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019) (pp. 132-136).

Revision as of 11:51, 13 April 2021

Finnish, Olonets-Karelian and Karelian lexicon development

The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.

Finnish, Olonets-Karelian and Karelian lexicon development

The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.

At present, the olo-fin language pair already exists, and for that the current work would mainly involve improving the lexicon, confirming earlier translations and verifying that the bidirectionality of translation pairs is correct. The two other pairs krl-fin and krl-olo will be developed entirely from scratch. The fact that two pairs demand significantly more work is taken into account in the plan.

For this task it is very important to know all of the languages involved at an advanced level. I have worked previously with Karelian dictionaries, speak Finnish natively, and have a MA degree in linguistics. I’m currently conducting my PhD research in the University of Helsinki. In 2013–2014 I worked in a research project where I added Finnish translations into Olonets–Finnish dictionary which is now what is found in the Giellatekno infrastructure. I also compiled extensive paradigm tests for different word types. In addition, I have also conducted fieldwork in Karelian speech communities over numerous years, and know these language varieties closely also at dialectological and historical level. This experience allows me to work with lexical resources effectively and contribute to dialect research and language facilitation.

The proposed project will test two different approaches to lexicon editing: (1) Importing a version controlled CSV file, and (2) editing the entries directly in the Ve’rdd platform, which was also developed as a Google Summer of Code task in 2020. This gives feedback to the Ve’rdd developers and helps to improve the software so that it supports more efficient editing workflows.

In addition to lexicon pairs, the added materials will also contain information needed for Constraint Grammar development, such as verbal government used in contextual disambiguation.

The Apertium infrastructure presently has approximately 1000 translations for fin-krl, 260 for fin-olo, 0 for krl-olo. On GiellaLT, there is an olo-fin dictionary of approximately (17,754) lemmas with glossing for over 20,000 translations pairs

Other possible resources:

260 for fin-rus, olo-rus (13,384)

gtsvn/words/dicts/olorus/src/*.xml

Rus-olo (17,202)

gtsvn/words/dicts/rusolo/src/*.xml

Karelian dictionary from Kotus has lots of words but without olo-krl distinction and some very specific orthographical choices. However, as this dictionary aims to cover all Karelian dialects, it is a significant resource for now planned work.

The goal of the planned work is to get 10 000 translation pair per language pair. This is already a large parallel lexicon, and has to be essentially created within this project. This task also builds on earlier work with Karelian treebanks (Pirinen 2019), and creates a more solid foundation for the computational infrastructure of both of the orthographic variants.

In this project also Ve’rdd dictionary development platform will be used and further tested. This continues the work in an older Google Summer of Code project, but takes a new angle as it is truly tested in a large scale editing work.

Name: Timo Rantakaulio

E-mail address: timo.rantakaulio@gmail.com

Other information that may be useful to contact you (e.g. IRC): timo.rantakaulio@helsinki.fi

Why is it that you are interested in Apertium?

I have been working two years with Jack Rueter on the Olonets-Karelian - Finnish dictionary in Giellatekno infrastructure, so I’m familiar with the environment and consider online dictionaries as a powerful tool in language revitalisation.

Which of the published tasks are you interested in? What do you plan to do?

As a doctoral student in Fenno-Ugrian languages I am ready to implement my knowledge and skills of the actual languages in the process as a specialist in translation.

Include a proposal, including

* a title,

* reasons why Google and Apertium should sponsor it,

* a description of how and who it will benefit in society,

* and a detailed work plan (including, if possible, a schedule with milestones and deliverables).

Work plan

  • Week 1: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary
  • Week 2: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary
  • Week 3: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary
  • Week 4: Importing 2500 word articles to the Olonets Karelian - Finnish dictionary

Deliverable #1

  • Week 5: Importing 2500 word articles to the Karelian - Finnish dictionary
  • Week 6: Importing 2500 word articles to the Karelian - Finnish dictionary
  • Week 7: Importing 2500 word articles to the Karelian - Finnish dictionary
  • Week 8: Importing 2500 word articles to the Karelian - Finnish dictionary

Deliverable #2

  • Week 9: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary
  • Week 10: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary
  • Week 11: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary
  • Week 12: Importing 2500 word articles to the Olonets Karelian - Karelian dictionary

Also, some work on CG / disam / syntax / transfer.

Project completed.

List any non-Summer-of-Code plans you have for the Summer

I can participate in the project during most of the spring and summer, I plan to use two long weekends for agriculture in May and one long weekend for a holiday with my family around the Midsummer Day in June. I plan to use for this dictionary work at least 30 hrs per week, due to the solid amount of the word articles needed. Since the Karelian language is also a personally important topic for me, I will most likely work more than the minimum.

References

Pirinen, T. A. (2019). Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in Karelian treebanking. In Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019) (pp. 132-136).