User:Hectoralos/GSOC 2020 proposal: French-Arpitan

1 Contact Information
2 Why is it you are interested in machine translation?
3 Why is it that you are interested in Apertium?
4 Which of the published tasks are you interested in? What do you plan to do?
5 My proposal
6 List your skills and give evidence of your qualifications
7 Coding challenge
8 List any non-Summer-of-Code plans you have for the Summer

Contact Information

Name: Hèctor Alòs i Font

Location: Shupashkar, Chuvashia, Russia

E-mail: hectoralos@gmail.com

IRC: hector2

GitHub: hectoralos

Telegram: hectoralos

Skype: hectoralos

Why is it you are interested in machine translation?

I’m a sociolinguist working on language maintenance and shift. I'm very interested in creating resources for minoritised languages.

Why is it that you are interested in Apertium?

Because Apertium is free/open-source software.
Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
Because there is a lot of good work done and being done in it.
Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.

Which of the published tasks are you interested in? What do you plan to do?

Adopt an unreleased language pair: I'd like to create the pair French-Arpitan (Franco-Provençal). Arpitan does not currently have any support in Apertium (and elsewhere its support is very scarce).

My proposal

Title

Adopting the French-Arpitan language pair.

Major goals

Creating an Arpitan repository in Apertium using the ORB orthography. This should include:
- A morphological dictionary with a naive coverage of at least 92% of non-Wikipedia texts written in ORB Arpitan
- A morphological disambiguation tool using both statistical disambiguation and Constraint Grammar Tools
- A set of post-generation rules
Developing a translator from French to ORB Arpitan with a WER below 20% and a translator from ORB Arpitan to French with a WER below 25%.

Reasons why Google and Apertium should sponsor it

Arpitan is an endangered and heavily under-resourced language. This would be the first translator for it and, as far as I know, the first morphological analyser. From this project a first spell-checker for the language could be generated.

Arpitan (often called Franco-Provençal) is a Gallo-Romance language spoken in France, Switzerland and Italy. According to the UNESCO Atlas of the World’s Languages in Danger, it has c. 100,000 speakers and is “definitely endangered”. In Switzerland and France the family transmission of the language is almost, if not completely, broken. The result is that in the relatively small Aosta Valley, in Italy, more than one third of the speakers are concentrated, and this proportion is increasing. In fact, only in Italy does Arpitan enjoy a certain institutional support, although recently also Switzerland has included it among the languages under the protection of the European Charter for Regional or Minority Languages. In France, it is the only non-oil regional language that is not officially taught in schools (in practice, it is taught in a handful of them in Savoy, on the pretext that Alpine Provençal is taught, and also Occitan teachers examine the students, allegedly for this variety of Occitan, in the French Baccalauréat; this is the only possible way to circumvent the French authorities' complete lack of interest in Arpitan).

Arpitan was recognized by the Romanists as a language in its own right as early as the end of the 19th century. However, it has not been standardised. It is highly dialectalised, with numerous variants for each word and also multiple morphological and phonetic variants. In a context of centuries-old marginalisation, its written use has been scarce. Several phonetic orthographies have been developed and are used in small circles, each in a different region. As each one is based on a local variety, they do not really help communication across regions. The adoption of any of them by speakers of other varieties is virtually impossible.

In this context, at the end of the 1990s a supra-dialectal spelling was proposed, which was later followed by a dictionary that also tries to find forms that are as general as possible for the whole linguistic domain. These proposals are also accompanied by others on word inflection. This proposal, called ORB (reference spelling B), is the one I would use in this project.

The relatively small amount of texts in Arpitan, its great variety and the different competing spellings make it quite difficult to work on. However, it is a language with a dire need of modern linguistic resources to help increase its written use, as well as its status. To this end, Apertium is particularly suitable and ORB seems to be the best bet for the standardization of Arpitan.

Resources

I’m not a speaker of Arpitan, but I know French pretty well, and often read in Occitan. ORB Arpitan is very easily if one knows these two languages. I have a long-standing friendship with two Arpitan activists (one from Savoy, the other from Switzerland), who are already helping me in this project. I have also contacted Dominique Stich, the creator of ORB, who has given me different resources, answered doubts on the language and is willing to continue helping me in the project.

From him I got an electronic version of his French-Arpitan dictionary together with his permission and that of his publisher (one of my friends: the world is small) to publish the content under a GPL licence. The site arpitan.eu is also helpful. Stich gave me also two translations of him into Arpitan (c. 65,000 words) which I began to use as a corpus. I’m also collecting other texts in ORB. The morphology of the (standardised version of the) language is pretty well described in Stich’s works, although still there is often more than one possibility in verb inflection. Clearly, despite Stich's hard work, many decisions will have to be made when generating Arpitan in Apertium, so I’d make maximum use of the corpus and my informants.

Design decisions

Dictionaries will not support non-ORB Arpitan since spelling conventions are quite different and so is inflection.
I plan to use the traditional system of structural transfer in three steps. I consider the new recursive transfer very promising, and I indeed want to use in the new version of the French-Catalan pair in which I regularly work, but for such an under-resourced language as Arpitan, I prefer to use a proven and safe technology, which I am familiar with and which I know it will work well with two (very) close-related languages.

Workplan

Week	Dates	Goals	Bidix (excluding proper names)	WER	Coverage
Post-application period	10 March - 26 May	Find language resources (Wiktionary et al.) Build frequency lists for Italian-Catalan Build frequency lists for Portuguese-Catalan Construct pending tests for the 4 directions Study in more detail Using weights for ambiguous rules	Current situation ~9,000 (cat-ita) ~7,500 (cat-por)	Current situation ~30% (cat > ita) ~30% (cat > por) ~30% (por > cat)	Current situation ~88% (cat > ita) ~82% (ita > cat) ~88% (cat > por) ~84% (por > cat)
1	27 May - 2 June	ita > cat Expand bilingual dictionary cat-ita Transfer and lexical selection rules (ita > cat)	~11,000 (cat-ita)		~85.5% (ita > cat)
2	3 June- 9 June	Expand bilingual dictionary cat-ita Transfer and lexical selection rules (ita > cat)	~13,000 (cat-ita)		~87.5% (ita > cat)
3	10 June - 16 June	Expand bilingual dictionary cat-ita Transfer and lexical selection rules (ita > cat)	~14,000 (cat-ita)	<20% (ita > cat)	~89% (ita > cat)
4	17 June - 23 June	cat > ita Expand bilingual dictionary cat-ita Transfer and lexical selection rules (cat > ita) Testvoc cat-ita, ita-cat: closed categories	~15,000 (cat-ita)		~90% (cat > ita) ~90% (ita > cat)
5	24 June - 30 June	Expand bilingual dictionary cat-ita Transfer and lexical selection rules (cat > ita) Testvoc cat-ita, ita-cat: vblex First evaluation (28 June)	~16,000 (cat-ita)		~90.5% (cat > ita) ~90.5% (ita > cat)
6	1 July - 7 July	Expand bilingual dictionary cat-ita Transfer and lexical selection rules (cat > ita) Testvoc cat-ita, ita-cat: adj, adv, np	~17,000 (cat-ita)		~91% (cat > ita) ~91% (ita > cat)
7	8 June - 14 July	Transfer and lexical selection rules (cat > ita) Testvoc cat-ita, ita-cat: n Write documentation	~18,000 (cat-ita)	<15% (cat > ita) <15% (ita > cat)	~91.5% (cat > ita) ~91.5% (ita > cat)
8	15 July - 21 July	por > cat Expand bilingual dictionary cat-por Disambiguation rules (por > cat) Work on Portuguese proper names	~9,500 (cat-por)		~87% (por > cat)
9	22 July - 28 July	Expand bilingual dictionary Disambiguation rules (por > cat) Work on Portuguese proper names Transfer and lexical selection rules (por > cat) Testvoc cat-por, por-cat: np Second evaluation (26 July)	~11,500 (cat-por)		~89% (por > cat)
10	29 July - 4 August	Expand bilingual dictionary Disambiguation rules (por > cat) Transfer and lexical selection rules (por > cat)	~13,000 (cat-por)	<20% (por > cat)	~89.5% (por > cat)
11	5 August - 11 August	cat > por Expand bilingual dictionary Transfer and lexical selection rules (cat > por) Testvoc cat-por, por-cat: closed categories, vblex	~14,500 (cat-por)		~90% (cat > por) ~90% (por > cat)
12	12 August - 18 August	Expand bilingual dictionary Transfer and lexical selection rules (cat > por) Testvoc cat-por, por-cat: adj, adv	~16,000 (cat-por)		~90.5% (cat > por) ~90.5% (por > cat)
13	18 August - 25 August	Expand bilingual dictionary Transfer and lexical selection rules (cat > por) Testvoc cat-por, por-cat: n Final evaluation (26 August)	~17,000 (cat-por)	<15% (cat > por) <15% (por > cat)	~91.0% (cat > por) ~91.0% (por > cat)

List your skills and give evidence of your qualifications

I once got a computer engineering (Universitat Politècnica de Catalunya, 1988), but I’ve forgotten almost everything on programming (except vi, regular expressions and writing short Perl scripts). I also got a BA on linguistics (Universitat de Barcelona, 2008). I’ve been working on Apertium for several years. In 2011, I created the Esperanto-French pair and new releases of the Esperanto-Catalan and Esperanto-Spanish pairs. I’ve also been working on new releases of the French-Catalan pair (2017, 2018, 2019 and 2020).

I’ve mentored and was strongly involved in the GSoC projects on Italian-Sardinian (2016), Catalan-Sardinian (2017) and French-Occitan (2018). In all of them we released new one-way language pairs just after the end of the GSoC. In 2019, I got myself a GSoC stipend and created new releases for the Catalan-Italian and Catalan-Portuguese pairs. These new releases included the direction Catalan > Italian, which was new, and the generation of three “flavours” of Portuguese.

I’m a fluent speaker of French, as well as a fluent reader of most of Romance languages, including ORB Arpitan. I’ll have help for the evaluation of the translations into Arpitan, since I don’t speak it and have become familiar with it only in the last few months, so I can't see only part of the errors when generating texts in it.

Coding challenge

I have created apertium-frp and apertium-fra-frp. You can test the challenge by cloning, compiling and typing in apertium-fra-frp: cat ../apertium-fra/texts/cuento.fra.txt | apertium -d . fra-frp cat ../apertium-frp/texts/cuento.frp.txt | apertium -d . frp-fra There are still a few unknown words. Almost all are irregular verbs, for which and I haven’t yet entered the paradigm.

List any non-Summer-of-Code plans you have for the Summer

I can guarantee at least 30 hours per week of work during the whole Summer. As I love this kind of work, I'm sure I'll be engaged quite more.

I have to submit the last two short works for two subjects of the master I am studying on June 1 and 8, but I should be able to finish the last one in the beginning of May. Due to coronavius, it is hardly possible that I will travel on holiday with my family for a couple of weeks in July, as I did last year (and I still was able to work 30 hours or more in the project). In the unlikely event that one week I do not reach 30 hours of work, I would recover these hours.

User:Hectoralos/GSOC 2020 proposal: French-Arpitan

Contents

Contact Information

Why is it you are interested in machine translation?

Why is it that you are interested in Apertium?

Which of the published tasks are you interested in? What do you plan to do?

My proposal

Title

Major goals

Reasons why Google and Apertium should sponsor it

Resources

Design decisions

Workplan

List your skills and give evidence of your qualifications

Coding challenge

List any non-Summer-of-Code plans you have for the Summer

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools