Difference between revisions of "User:Rantakaulio/GSoC2021Proposal"

From Apertium
Jump to navigation Jump to search
(Created page with "Finnish, Olonets-Karelian and Karelian lexicon development The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical...")
 
(update)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
== Contact information ==
Finnish, Olonets-Karelian and Karelian lexicon development

'''Name: Timo Rantakaulio'''

'''E-mail address:''' timo.rantakaulio@gmail.com

'''Other information that may be useful to contact you (e.g. IRC):''' timo.rantakaulio@helsinki.fi

== Proposal ==


The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.
The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.

At present, the olo-fin language pair already exists, and for that the current work would mainly involve improving the lexicon, confirming earlier translations and verifying that the bidirectionality of translation pairs is correct. The two other pairs krl-fin and krl-olo will be developed entirely from scratch. The fact that two pairs demand significantly more work is taken into account in the plan.

For this task it is very important to know all of the languages involved at an advanced level. I have worked previously with Karelian dictionaries, speak Finnish natively, and have a MA degree in linguistics. I’m currently conducting my PhD research in the University of Helsinki. In 2013–2014 I worked in a research project where I added Finnish translations into Olonets–Finnish dictionary which is now what is found in the Giellatekno infrastructure. I also compiled extensive paradigm tests for different word types. In addition, I have also conducted fieldwork in Karelian speech communities over numerous years, and know these language varieties closely also at dialectological and historical level. This experience allows me to work with lexical resources effectively and contribute to dialect research and language facilitation.

The proposed project will test two different approaches to lexicon editing: (1) Importing a version controlled CSV file, and (2) editing the entries directly in the Ve’rdd platform, which was also developed as a Google Summer of Code task in 2020. This gives feedback to the Ve’rdd developers and helps to improve the software so that it supports more efficient editing workflows.

In addition to lexicon pairs, the added materials will also contain information needed for Constraint Grammar development, such as verbal government used in contextual disambiguation.

The Apertium infrastructure presently has approximately 1000 translations for [https://github.com/apertium/apertium-fin-krl/blob/master/apertium-fin-krl.fin-krl.dix ''fin-krl''], 260 for [https://github.com/apertium/apertium-fin-olo/blob/master/apertium-fin-olo.fin-olo.dix ''fin-olo''], 0 for [https://github.com/apertium/apertium-krl-olo/blob/master/apertium-krl-olo.krl-olo.dix ''krl-olo'']. On GiellaLT, there is an olo-fin dictionary of approximately (17,754) lemmas with glossing for over 20,000 translations pairs

Other possible resources include dictionaries with Russian as one pair. In the current proposal we do not work with Russian lexicon directly, but acknowledge these materials can be very relevant. There are 260 lexemes for [https://github.com/apertium/apertium-fin-rus/blob/master/apertium-fin-rus.fin-rus.dix ''fin-rus''] in Apertium infrastructure, and in GiellaLT infrastructure XML dictionaries 13,384 [https://gtsvn.uit.no/langtech/trunk/words/dicts/olorus/ ''olo-rus''] pairs and 17,202 [https://gtsvn.uit.no/langtech/trunk/words/dicts/rusolo/ ''rus-olo''] pairs.

Additionally the [https://kaino.kotus.fi/cgi-bin/kks/karjala.cgi ''Karelian dictionary from Kotus''] has lots of words but without olo-krl distinction and some very specific orthographical choices. However, as this dictionary aims to cover all Karelian dialects, it is a significant resource for now planned work. This material is available as a datapackage with CC-BY license, so using it in current work could be advisable also from that point of view.

Additional dictionaries for Karelian are available with a CC-BY license at [http://illhportal.krc.karelia.ru/section.php?plang=r&id=1202 ''the website''] of Institute of Linguistics, Literature and History of Karelian Research Centre.

The goal of the planned work is to get at least 4 500 translation pair per language pair. This is already a large parallel lexicon, and has to be essentially created within this project. This task also builds on earlier work with Karelian treebanks (Pirinen 2019), and creates a more solid foundation for the computational infrastructure of both of the orthographic variants.

In this project also Ve’rdd dictionary development platform will be used and further tested. This continues the work in an older Google Summer of Code project, but takes a new angle as it is truly tested in a large scale editing work. For further documentation of Ve’rdd project, see Alnajjar et. al. 2020.

== Why is it that you are interested in Apertium? ==

I have been working two years with Jack Rueter on the Olonets-Karelian - Finnish dictionary now available in the Giellatekno infrastructure, so I’m familiar with the environment and consider online dictionaries as a powerful tool in language revitalisation. It is also clear there is use and need for machine translation between Karelian varieties and Finnish, which my work will be partly enabling.

== Which of the published tasks are you interested in? ==

As a doctoral student in Finno-Ugrian languages I am ready to implement my knowledge and skills of the actual languages in the process as a specialist in translation. The resources for Karelian language are scarce, and therefore I consider the task of dictionary coverage improvement very important for further tasks on Karelian varieties, and I want to engage in this work specifically. There is also an ongoing Karelian language revitalization program in the University of Eastern Finland that has been planning to create Karelian dictionaries, so my proposed work also benefits these wider goals of the Karelian community and revitalization.

== Work plan ==

Before the work is started the existing resources will be evaluated and inspected, and the initial version that will be used as the starting point in following work is created in collaboration with other specialists of Uralic lexicography (Jack Rueter, Tommi Pirinen).

<ul>
<li><blockquote><p>Week 1: Importing 1500 word articles to the Olonets Karelian - Finnish dictionary</p></blockquote></li>
<li><blockquote><p>Week 2: Importing 1500 word articles to the Olonets Karelian - Finnish dictionary</p></blockquote></li>
<li><blockquote><p>Week 3: Importing 1500 word articles to the Olonets Karelian - Finnish dictionary</p></blockquote></li>
<li><blockquote><p>Week 4: Documenting the work that has been done, and evaluating the result</p></blockquote></li></ul>

=== Deliverable #1 ===

<ul>
<li><blockquote><p>Week 5: Importing 1500 word articles to the Karelian - Finnish dictionary</p></blockquote></li>
<li><blockquote><p>Week 6: Importing 1500 word articles to the Karelian - Finnish dictionary</p></blockquote></li>
<li><blockquote><p>Week 7: Importing 1500 word articles to the Karelian - Finnish dictionary</p></blockquote></li>
<li><blockquote><p>Week 8: Documenting the work that has been done, and evaluating the result</p></blockquote></li></ul>

=== Deliverable #2 ===

<ul>
<li><blockquote><p>Week 9: Importing 1500 word articles to the Olonets Karelian - Karelian dictionary</p></blockquote></li>
<li><blockquote><p>Week 10: Importing 1500 word articles to the Olonets Karelian - Karelian dictionary</p></blockquote></li>
<li><blockquote><p>Week 11: Importing 1500 word articles to the Olonets Karelian - Karelian dictionary</p></blockquote></li>
<li><blockquote><p>Week 12: Documenting the work that has been done, and evaluating the result</p></blockquote></li></ul>

While the dictionaries are completed, I will also tag according to the plan various grammatical properties of Karelian lexemes. This will be documented, and benefits thereby work on Constraint Grammar, disambiguation and syntactic description. This documentation will be a part of deliverables in last week of each working phase.

=== Project completed. ===

== List any non-Summer-of-Code plans you have for the Summer ==

I can participate in the project during most of the spring and summer, I plan to use two long weekends for agriculture in May and one long weekend for a holiday with my family around the Midsummer Day in June. I plan to use for this dictionary work at least 30 hrs per week, due to the solid amount of the word articles needed. Since the Karelian language is also a personally important topic for me, I will most likely work more than the minimum.

== References ==

Alnajjar, K., Hämäläinen, M., Rueter, J., &amp; Partanen, N. (2020, December). Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement. In Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations (pp. 1-6).

Pirinen, T. A. (2019). Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in Karelian treebanking. In Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019) (pp. 132-136).

Latest revision as of 15:28, 13 April 2021

Contact information[edit]

Name: Timo Rantakaulio

E-mail address: timo.rantakaulio@gmail.com

Other information that may be useful to contact you (e.g. IRC): timo.rantakaulio@helsinki.fi

Proposal[edit]

The three languages that this application targets are closely related Balto-Finnic languages spoken in geographical proximity to one another. Finnish is a large majority language with very advanced NLP infrastructure, whereas Olonets-Karelian and Karelian represent two orthographies in this Eastern Finnic dialect continuum. Both Olonets-Karelian and Karelian have written use and linguistic resources, such as Universal Dependencies treebanks, but the resource landscape is still very scarce. One of the current infrastructure problems is the imbalance: some languages and language pairs are much better covered than others. The proposed application aims to bring three closely related language pairs to comparable levels.

At present, the olo-fin language pair already exists, and for that the current work would mainly involve improving the lexicon, confirming earlier translations and verifying that the bidirectionality of translation pairs is correct. The two other pairs krl-fin and krl-olo will be developed entirely from scratch. The fact that two pairs demand significantly more work is taken into account in the plan.

For this task it is very important to know all of the languages involved at an advanced level. I have worked previously with Karelian dictionaries, speak Finnish natively, and have a MA degree in linguistics. I’m currently conducting my PhD research in the University of Helsinki. In 2013–2014 I worked in a research project where I added Finnish translations into Olonets–Finnish dictionary which is now what is found in the Giellatekno infrastructure. I also compiled extensive paradigm tests for different word types. In addition, I have also conducted fieldwork in Karelian speech communities over numerous years, and know these language varieties closely also at dialectological and historical level. This experience allows me to work with lexical resources effectively and contribute to dialect research and language facilitation.

The proposed project will test two different approaches to lexicon editing: (1) Importing a version controlled CSV file, and (2) editing the entries directly in the Ve’rdd platform, which was also developed as a Google Summer of Code task in 2020. This gives feedback to the Ve’rdd developers and helps to improve the software so that it supports more efficient editing workflows.

In addition to lexicon pairs, the added materials will also contain information needed for Constraint Grammar development, such as verbal government used in contextual disambiguation.

The Apertium infrastructure presently has approximately 1000 translations for fin-krl, 260 for fin-olo, 0 for krl-olo. On GiellaLT, there is an olo-fin dictionary of approximately (17,754) lemmas with glossing for over 20,000 translations pairs

Other possible resources include dictionaries with Russian as one pair. In the current proposal we do not work with Russian lexicon directly, but acknowledge these materials can be very relevant. There are 260 lexemes for fin-rus in Apertium infrastructure, and in GiellaLT infrastructure XML dictionaries 13,384 olo-rus pairs and 17,202 rus-olo pairs.

Additionally the Karelian dictionary from Kotus has lots of words but without olo-krl distinction and some very specific orthographical choices. However, as this dictionary aims to cover all Karelian dialects, it is a significant resource for now planned work. This material is available as a datapackage with CC-BY license, so using it in current work could be advisable also from that point of view.

Additional dictionaries for Karelian are available with a CC-BY license at the website of Institute of Linguistics, Literature and History of Karelian Research Centre.

The goal of the planned work is to get at least 4 500 translation pair per language pair. This is already a large parallel lexicon, and has to be essentially created within this project. This task also builds on earlier work with Karelian treebanks (Pirinen 2019), and creates a more solid foundation for the computational infrastructure of both of the orthographic variants.

In this project also Ve’rdd dictionary development platform will be used and further tested. This continues the work in an older Google Summer of Code project, but takes a new angle as it is truly tested in a large scale editing work. For further documentation of Ve’rdd project, see Alnajjar et. al. 2020.

Why is it that you are interested in Apertium?[edit]

I have been working two years with Jack Rueter on the Olonets-Karelian - Finnish dictionary now available in the Giellatekno infrastructure, so I’m familiar with the environment and consider online dictionaries as a powerful tool in language revitalisation. It is also clear there is use and need for machine translation between Karelian varieties and Finnish, which my work will be partly enabling.

Which of the published tasks are you interested in?[edit]

As a doctoral student in Finno-Ugrian languages I am ready to implement my knowledge and skills of the actual languages in the process as a specialist in translation. The resources for Karelian language are scarce, and therefore I consider the task of dictionary coverage improvement very important for further tasks on Karelian varieties, and I want to engage in this work specifically. There is also an ongoing Karelian language revitalization program in the University of Eastern Finland that has been planning to create Karelian dictionaries, so my proposed work also benefits these wider goals of the Karelian community and revitalization.

Work plan[edit]

Before the work is started the existing resources will be evaluated and inspected, and the initial version that will be used as the starting point in following work is created in collaboration with other specialists of Uralic lexicography (Jack Rueter, Tommi Pirinen).

  • Week 1: Importing 1500 word articles to the Olonets Karelian - Finnish dictionary

  • Week 2: Importing 1500 word articles to the Olonets Karelian - Finnish dictionary

  • Week 3: Importing 1500 word articles to the Olonets Karelian - Finnish dictionary

  • Week 4: Documenting the work that has been done, and evaluating the result

Deliverable #1[edit]

  • Week 5: Importing 1500 word articles to the Karelian - Finnish dictionary

  • Week 6: Importing 1500 word articles to the Karelian - Finnish dictionary

  • Week 7: Importing 1500 word articles to the Karelian - Finnish dictionary

  • Week 8: Documenting the work that has been done, and evaluating the result

Deliverable #2[edit]

  • Week 9: Importing 1500 word articles to the Olonets Karelian - Karelian dictionary

  • Week 10: Importing 1500 word articles to the Olonets Karelian - Karelian dictionary

  • Week 11: Importing 1500 word articles to the Olonets Karelian - Karelian dictionary

  • Week 12: Documenting the work that has been done, and evaluating the result

While the dictionaries are completed, I will also tag according to the plan various grammatical properties of Karelian lexemes. This will be documented, and benefits thereby work on Constraint Grammar, disambiguation and syntactic description. This documentation will be a part of deliverables in last week of each working phase.

Project completed.[edit]

List any non-Summer-of-Code plans you have for the Summer[edit]

I can participate in the project during most of the spring and summer, I plan to use two long weekends for agriculture in May and one long weekend for a holiday with my family around the Midsummer Day in June. I plan to use for this dictionary work at least 30 hrs per week, due to the solid amount of the word articles needed. Since the Karelian language is also a personally important topic for me, I will most likely work more than the minimum.

References[edit]

Alnajjar, K., Hämäläinen, M., Rueter, J., & Partanen, N. (2020, December). Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement. In Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations (pp. 1-6).

Pirinen, T. A. (2019). Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in Karelian treebanking. In Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019) (pp. 132-136).