- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in Apertium?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Why Google and Apertium should sponsor it?
- 6 How and who will benefit from this project?
- 7 Workplan
- 8 Skills
- 9 Coding Challenge
- 10 Non-Summer-of-Code plans for the Summer
Name: Alexandra Kellner
Why is it you are interested in machine translation?
I have been working with topics related to linguistics and translation since I began my studies at the University of Toronto in 2004, majoring in Finnish studies and French-to-English translation. After graduating from the University of Toronto in 2008, I began working towards my master’s degree at the University of Helsinki, where I majored in Finnish language and culture but also focused on other Finno-Ugric languages, particularly North Saami, Udmurt and Komi-Zyrian.
I have been working as a translator of Finnish into English since 2009, and have studied the constructions of English-speaking learners of Finnish as the topic of my master’s thesis. The topics of computer-assisted and machine translation have come up in both my research and other work, but mostly in relation to larger languages like English, French and Finnish. When studying and working with Finno-Ugric minority languages (and other smaller languages), I have often wanted to use similar translation tools and resources as those I use for my work with English and Finnish, but the resources available are usually quite limited.
Over the past years I have gained sufficient proficiency in Komi and Udmurt to begin working on developing such tools. My familiarity with translation and related resources, as well as with cross-linguistic differences and construction grammar, will help me in my work on this project. I am familiar with the Giellatekno infrastructure, and want to learn the Apertium system in depth as well.
My own experience with minority language tools and resources has been that of a researcher and translator from outside the community, but I feel that the tools developed in this project would be most beneficial to the communities of speakers of the languages themselves. It is my hope that the machine translation resources for Komi and Udmurt will benefit speakers of these languages when writing in their local languages online and in other domains, so that they do not have to resort to Russian just because it is a larger language with a better technical infrastructure. Speakers of Udmurt and Komi could also benefit from each other’s local-language media and online forums and channels without having to use Russian as an intermediate language. This would promote the use of the languages in wider domains and prevent further disruption in their use and transmission.
Why is it that you are interested in Apertium?
The existing implementations of Apertium infrastructure demonstrate that a useful level of translation accuracy can be reached with language pairs comparable to Udmurt and Komi. The machine translation tools are built upon the infrastructure I am already familiar with, which allows me to develop my knowledge within this field.
Which of the published tasks are you interested in? What do you plan to do?
Adding a new language pair for Udmurt and Komi-Zyrian.
Why Google and Apertium should sponsor it?
Komi and Udmurt are relatively large minority languages, with various already existing resources and demonstrably maintain significant online presence. There is weekly media and various blogs in both languages, but the interaction between speakers from the two communities takes place mainly in Russian or through Russian-mediated channels.
How and who will benefit from this project?
Speakers of Komi and Udmurt will benefit, as well as researchers working on these languages.
Komi and Udmurt are closely related Permic languages, but due to divergent historical developments and different contact influences, they are not mutually intelligible. As all speakers of these languages are bilingual in Russian, the dominant language in much of the region, the language of communication when the speakers interact is usually Russian.
Komi and Udmurt have a large amount of local media, but there are also differences in the kinds of media available for each. The local media in Komi and Udmurt discusses largely themes that are relevant for these communities, such as minority community experience in Russian context, language maintenance and native literatures. At the moment, Komi and Udmurt speakers have no access to each other’s media due the language barrier. There are also differences in the types and scope of the local media in the two languages, for example, the Udmurt newspaper Dart is targeted at teens and often discusses topics relevant for this group, but there is no equivalent in Komi media. It is obvious that many topics would transcede very well between these cultural contexts, and machine translation directly between Komi and Udmurt would be one way to increase the availability of existing and continuously growing cultural content.
From the scientific point of view, the differences between Komi and Udmurt are relatively well known, and probably described most thoroughly in Raija Barten’s monograph Permiläiskielten rakenne ja kehitys [The Development and Structure of the Permic Languages]. However, despite its many merits, this book has been written from very historical point of view and within the framework of a relatively difficult-to-access Finno-Ugric research tradition. The situation is similar with other available comparative works. They contain very minimal information, but there would be a need for different treatment.
Using the existing description and currently available corpora, both monolingual and parallel, would allow translating the described differences into a more formal description in the form of machine translation rules. It also seems obvious that with the contemporary materials new differences and features can be discovered, but within the work done during the summer the focus will be in implementing the already known morpho-syntactic structures.
Post Application Period Getting more familiar with Giellatekno and Apertium infrastructure
Community Bonding Period Working closely with mentor Tommi Pirinen. Converting the current catalogues of Komi-Udmurt parallel titles in Fenno-Ugrica collection into a machine-readable format. The National Library of Finland has agreed to add this kind of enhanced information into their public metadata, which makes finding the parallel titles easier and querying it through public APIs possible. Creating a larger parallel corpus from these texts is not the main goal of the proposed project, but identification of these titles is the starting point for that work.
Week 1 Converting existing Udmurt-Komi and Komi-Udmurt lexicons into Apertium format with an existing script
Week 2 Testing the lexicon with Public Domain parallel text from Fenno-Ugrica, editing entries and removing those which are not suitable for use in translation (i.e. descriptive dictionary entries, entries with long lists of potential translations [rules needed to select preferred item out of list based on context]).
Week 3 Continuing to test and edit the lexicon, getting familiarized with morphological work.
Week 4 Testing the system developed in weeks 1-3 using a larger text corpus.
Deliverable 1: Existing lexical resources and core vocabulary added.
Week 5 Collaboration with mentors to learn more about the implementation of morphological rules and to review the work from weeks 1-4.
Week 6 Based on learning from week 5, adding and testing rules to account for basic morphosyntactic differences.
Week 7 Building on work from previous week, implementing more complex rules, including selection of preferred lexical item from definitions with multiple translations (see week 2 on dictionaries).
Week 8 Evaluating output from complex rules and interactions between rules, attempts to resolve conflicts.
Deliverable 2: Rules developed to account for major morphological differences.
Week 9 Reviewing the differences between Udmurt and Komi and selecting aspects (together with mentors) that can be dealt with in the remaining weeks, depending on the success of rule development and implementation over weeks 5-8.
Weeks 10-11 Starting documentation of aspects that fall outside the scope of the project (determined based on meeting in week 9) and documenting the development and functionality of the system itself.
Week 12 Final project documentation, working out any remaining issues and documenting those that are not able to be resolved. Plans for carrying over material for use in translation from Komi into Udmurt (additional challenges due to word-order constraints, etc.)
Final deliverable: A machine translation system taking into account preferred translations based on the text corpus. The word order in Komi translations will often be unusually focused due to Udmurt structure, but the text is understandable and readable.
Language skills: English (native), Finnish (excellent, C2), French (excellent), Swedish (very good), Komi (very good), Udmurt (good), North Saami (basics), German (good)
Technical skills: Familiarity with Giellatekno infrastructure, experience with Linux and basic use of Terminal. Interested in programming. Experience working with machine and computer-assisted translation tools. Compiling and analysing a corpus of learner Finnish.
Work experience: Working as a professional translator using computer-assisted translation tools since 2009. Collection of speech data for Google Word of Mouth in summer 2009. Research on construction grammar, cross-linguistic interference in language acquisition and morphosyntax of Komi-Zyrian, among others.
Implemented changes in lexicon for existing apertium-kpv-udm repository (https://github.com/alexandra-sk/apertium-udm-kpv). Tested with a variety of sentences that exhibit morphological similarities between Udmurt and Komi. The example sentences are documented in the README file of the repository.
At this point extensive addition of new vocabulary was not done, but the script giella-xml2dix.py, included in the dev folder of the repository, was tested and evaluated to work satisfactorily.
Non-Summer-of-Code plans for the Summer
Conference on language contacts in the Volga-Ural region in Cheboksary, travel to Canada in August, possibly some other conferences. These should not affect my work on the project.