Difference between revisions of "User:Hectoralos/GSOC 2020 proposal: French-Arpitan"
Hectoralos (talk | contribs) |
Hectoralos (talk | contribs) |
||
Line 62: | Line 62: | ||
The morphology of the (standardised version of the) language is pretty well described in Stich’s works, although still there is often more than one possibility in verb inflection. Clearly, despite Stich's hard work, many decisions will have to be made when generating Arpitan in Apertium, so I’d make maximum use of the corpus and my informants. |
The morphology of the (standardised version of the) language is pretty well described in Stich’s works, although still there is often more than one possibility in verb inflection. Clearly, despite Stich's hard work, many decisions will have to be made when generating Arpitan in Apertium, so I’d make maximum use of the corpus and my informants. |
||
It should be added that dictionaries cover general vocabulary well, but in the current state of language standardization there is a clear lack of specialized terms. If we take into consideration the two fields for which Apertium translators are usually designed (encyclopedic and journalistic articles), this lack will be noticed in the coverage results. To give an example of a highly topical matter, not only are the words "coronavirus" and "Covid" (logically) missing from the dictionaries, but also the Arpitan equivalents of "virus", "contagion", "contagieux", "pandémie", "confinement", "confiner", "épidémiologie", "épidémiologiste/logue". Hopefully, with the help of the collaborators, we will be able to help propose some new terms, although this issue takes up a lot of time and exceeds the |
It should be added that dictionaries cover general vocabulary well, but in the current state of language standardization there is a clear lack of specialized terms. If we take into consideration the two fields for which Apertium translators are usually designed (encyclopedic and journalistic articles), this lack will be noticed in the coverage results. To give an example of a highly topical matter, not only are the words "coronavirus" and "Covid" (logically) missing from the dictionaries, but also the Arpitan equivalents of "virus", "contagion", "contagieux", "pandémie", "confinement", "confiner", "épidémiologie", "épidémiologiste/logue". Hopefully, with the help of the collaborators, we will be able to help propose some new terms, although this issue takes up a lot of time and exceeds the goals of this three-month project. |
||
===Design decisions=== |
===Design decisions=== |
Revision as of 12:05, 26 March 2020
Contents
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in Apertium?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 My proposal
- 6 List your skills and give evidence of your qualifications
- 7 Coding challenge
- 8 List any non-Summer-of-Code plans you have for the Summer
Contact Information
Name: Hèctor Alòs i Font
Location: Shupashkar, Chuvashia, Russia
E-mail: hectoralos@gmail.com
IRC: hector2
GitHub: hectoralos
Telegram: hectoralos
Skype: hectoralos
Why is it you are interested in machine translation?
I’m a sociolinguist working on language maintenance and shift. I'm very interested in creating resources for minoritised languages.
Why is it that you are interested in Apertium?
- Because Apertium is free/open-source software.
- Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
- Because there is a lot of good work done and being done in it.
- Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.
Which of the published tasks are you interested in? What do you plan to do?
Adopt an unreleased language pair: I'd like to create the pair French-Arpitan (Franco-Provençal).
Arpitan does not currently have any support in Apertium (and elsewhere its support is very scarce).
My proposal
Title
Adopting the French-Arpitan language pair.
Major goals
- Creating an Arpitan repository in Apertium using the ORB orthography. This should include:
- A morphological dictionary with a naive coverage of at least 92% of non-Wikipedia texts written in ORB Arpitan
- A morphological disambiguation tool using both statistical disambiguation and Constraint Grammar Tools
- A set of post-generation rules
- Developing a translator from French to ORB Arpitan with a WER below 20% and a translator from ORB Arpitan to French with a WER below 25%.
Reasons why Google and Apertium should sponsor it
Arpitan is an endangered and heavily under-resourced language. This would be the first translator for it and, as far as I know, its first morphological analyser. From this project a first spell-checker for the language could be generated.
Arpitan (often called Franco-Provençal) is a Gallo-Romance language spoken in France, Switzerland and Italy. According to the UNESCO Atlas of the World’s Languages in Danger, it has c. 100,000 speakers and is “definitely endangered”. In Switzerland and France the family transmission of the language is almost, if not completely, broken. The result is that more than one third of the speakers are concentrated in the relatively small Aosta Valley, in Italy, and this proportion is increasing. In fact, only in Italy does Arpitan enjoy a certain institutional support, although recently also Switzerland has included it among the languages under the protection of the European Charter for Regional or Minority Languages. In France, it is the only non-oil regional language that is not officially taught in schools (in practice, it is taught in a handful of them in Savoy, on the pretext that Alpine Provençal is taught, and also Occitan teachers examine the students, allegedly for this variety of Occitan, in the French Baccalauréat; this is the only possible way to circumvent the French authorities' complete lack of interest in Arpitan).
Arpitan was recognized by the Romanists as a language in its own right as early as the end of the 19th century. However, it has not been standardised. It is highly dialectalised, with numerous variants for each word and also multiple morphological and phonetic variants. In a context of centuries-old marginalisation, its written use has been scarce. Several phonetic orthographies have been developed and are used in small circles, each in a different region. As each one is based on a local variety, they do not really help communication across regions. The adoption of any of them by speakers of other varieties is virtually impossible.
In this context, at the end of the 1990s a supra-dialectal spelling was proposed, which was later followed by a dictionary that also tries to find forms that are as general as possible for the whole linguistic domain. These proposals are also accompanied by others on word inflection. This proposal, called ORB (reference spelling B), is the one I would use in this project.
The relatively small amount of texts in Arpitan, its great variety and the different competing spellings make it quite difficult to work on. However, it is a language with a dire need of modern linguistic resources to help increase its written use, as well as its status. To this end, Apertium is particularly suitable and ORB seems to be the best bet for the standardization of Arpitan.
Resources
I’m not a speaker of Arpitan, but I know French pretty well, and often read in Occitan too. ORB Arpitan is very easily if one knows these two languages. I have a long-standing friendship with two Arpitan activists (one from Savoy, the other from Switzerland), who are already helping me in this project. I have also contacted Dominique Stich, the creator of ORB, who has given me different resources, answered doubts on the language and is willing to continue helping me in the project.
From him I got an electronic version of his French-Arpitan and Arpitan-French dictionary together with his permission and that of his publisher (one of my friends: the world is small) to publish the content under a GPL licence. The site arpitan.eu is also helpful. Stich gave me also two translations of him into Arpitan (c. 65,000 words) which I began to use as a corpus. I’m also collecting other texts in ORB.
The morphology of the (standardised version of the) language is pretty well described in Stich’s works, although still there is often more than one possibility in verb inflection. Clearly, despite Stich's hard work, many decisions will have to be made when generating Arpitan in Apertium, so I’d make maximum use of the corpus and my informants.
It should be added that dictionaries cover general vocabulary well, but in the current state of language standardization there is a clear lack of specialized terms. If we take into consideration the two fields for which Apertium translators are usually designed (encyclopedic and journalistic articles), this lack will be noticed in the coverage results. To give an example of a highly topical matter, not only are the words "coronavirus" and "Covid" (logically) missing from the dictionaries, but also the Arpitan equivalents of "virus", "contagion", "contagieux", "pandémie", "confinement", "confiner", "épidémiologie", "épidémiologiste/logue". Hopefully, with the help of the collaborators, we will be able to help propose some new terms, although this issue takes up a lot of time and exceeds the goals of this three-month project.
Design decisions
- Dictionaries will not support non-ORB Arpitan since spelling conventions are quite different and so is inflection.
- I plan to use the traditional system of structural transfer in three steps. I consider the new recursive transfer very promising, and I indeed want to use in the new version of the French-Catalan pair in which I regularly work, but for such an under-resourced language as Arpitan, I prefer to use a proven and safe technology, which I am familiar with and which I know it will work well with two (very) close-related languages.
Workplan
Week | Dates | Goals | Bidix (excluding proper names) |
WER | Coverage |
---|---|---|---|---|---|
Post-application period | 1 April - 17 May |
|
|||
1 | 18 May - 24 May |
|
|||
2 | 25 May - 31 May |
|
~1,500 | ||
3 | 1 June - 7 June |
|
~3,000 | ||
4 | 8 June - 14 June |
|
~4,500 | ||
5 | 15 June - 21 June |
First evaluation (15-19 June) |
~6,000 | >80% (fra > frp) >85% (frp > fra) | |
6 | 22 June - 28 June |
|
~7,500 | ||
7 | 29 June - 5 July |
|
~8,500 | ||
8 | 6 July - 12 July |
|
~9,500 | ||
9 | 13 July - 19 July |
Second evaluation (13-17 July) |
~10,500 | <25% (fra-frp) | ~89% (fra > frp) ~92% (frp > fra) |
10 | 20 July - 26 July |
|
~11,500 | ||
11 | 27 July - 2 August |
|
~12,500 | ||
12 | 3 August - 9 August |
|
~12,750 | ||
13 | 10 August - 16 August |
Final evaluation (10-17 August) |
~13,000 | <20% (fra > frp) <25% (frp > fra) |
~90.0% (fra > frp) ~93.0% (frp > fra) |
List your skills and give evidence of your qualifications
I once got a computer engineering (Universitat Politècnica de Catalunya, 1988), but I’ve forgotten almost everything on programming (except vi, regular expressions and writing short Perl scripts). I also got a BA on linguistics (Universitat de Barcelona, 2008). I’ve been working on Apertium for several years. In 2011, I created the Esperanto-French pair and new releases of the Esperanto-Catalan and Esperanto-Spanish pairs. I’ve also been working on new releases of the French-Catalan pair (2017, 2018, 2019 and 2020).
I’ve mentored and was strongly involved in the GSoC projects on Italian-Sardinian (2016), Catalan-Sardinian (2017) and French-Occitan (2018). In all of them we released new one-way language pairs just after the end of the GSoC. In 2019, I got myself a GSoC stipend and created new releases for the Catalan-Italian and Catalan-Portuguese pairs. These new releases included the direction Catalan > Italian, which was new, and the generation of three “flavours” of Portuguese.
I’m a fluent speaker of French, as well as a fluent reader of most of Romance languages, including ORB Arpitan. I’ll have help for the evaluation of the translations into Arpitan, since I don’t speak it and have become familiar with it only in the last few months, so I can't see only part of the errors when generating texts in it.
Coding challenge
I have created apertium-frp and apertium-fra-frp. You can test the challenge by cloning, compiling and typing in apertium-fra-frp:
cat ../apertium-fra/texts/cuento.fra.txt | apertium -d . fra-frp
cat ../apertium-frp/texts/cuento.frp.txt | apertium -d . frp-fra
The translation into Arpitan is here.
The translation into French is here.
There are still a few unknown words. Almost all are irregular verbs, for which and I haven’t yet entered the paradigm.
List any non-Summer-of-Code plans you have for the Summer
I can guarantee at least 30 hours per week of work during the whole Summer. As I love this kind of work, I'm sure I'll be engaged quite more.
I have to submit the last two short works for two subjects of the master I am studying on June 1 and 8, but I should be able to finish the last one in the beginning of May. Due to coronavius, it is hardly possible that I will travel on holiday with my family for a couple of weeks in July, as I did last year (and I still was able to work 30 hours or more in the project). In the unlikely event that one week I do not reach 30 hours of work, I would recover these hours.