Difference between revisions of "User:Capsot/Proposal oci-fra/fra-oci Translator"

From Apertium
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
* [[User:Capsot/Proposicion|Version occitana]]
* [[User:Capsot/Proposicion|Version occitana]]
* [[User:Capsot/Proposition|Version française]]
* [[User:Capsot/Proposition|Version française]]
== Contact info ==
<b>Name/Nom:</b> Claudi Balaguer <br/>

<b>Location:</b> Millars (Northern Catalonia), France <br/>

<b>IRC:</b> capsot<br/>

<b>E-mail:</b> ratapenada@yahoo.com <br/>

<b>Github:</b> http://github.com/capsot<br/>

<b>Timezone:</b> UTC+1<br/>

<p><b>Possible Mentor:</b> [[User:Hectoralos | Hèctor Alòs]]
<p><b>Possible Mentor:</b> [[User:Hectoralos | Hèctor Alòs]]


Line 15: Line 28:


== Why is it you are interested in machine translation? ==
== Why is it you are interested in machine translation? ==
I have been a Wikipedia editor (mainly on the Occitan and Catalan versions) for a long time and witnessed how machine translation can help expand their content in the Catalan Viquipèdia, which has very good translating tools. Automated translation can thus provide a helpful hand in acquiring articles from other Wikipedias and prove to be an amazing gain of time and energies for small communities like the Occitan one.
I have been a Wikipedia editor (mainly on the Occitan and Catalan versions) for a long time and witnessed how machine translation can help expand their content in the Catalan Viquipèdia, which has very good translating tools. Automated translation can thus provide a helpful hand in acquiring articles from other Wikipedias and proves to be an amazing gain of time and energies for small communities like the Occitan one.


== Why is it that you are interested in Apertium? ==
== Why is it that you are interested in Apertium? ==
I have known the Apertium translation project many years ago while collaborating as a linguist to the first Occitan translating tools in the Val d’Aran which was working on an Occitan translator using two linguistic varieties (a standard Occitan and an Aranese one). The Apertium community seems to already have many good translating tools; people there share genuine interest towards any languages, and treat every one of these as equal, without any real hierarchy whether dominant or minoritized, which I particularly appreciate. Then the collaborative atmosphere is really pleasant; many people have helped me with the technical issues quickly and kindly.
I have known the Apertium translation project many years ago while collaborating as a linguist to the first Occitan translating tools in the Val d’Aran, which was working then on an Occitan translator using two linguistic varieties (a standard Occitan and an Aranese one). The Apertium community seems to already have many good translating tools; people there share genuine interest towards any languages, and treat every one of these as equal, without any real hierarchy whether dominant or minoritized, which I particularly appreciate. Then the collaborative atmosphere is really pleasant; many people have helped me with the technical issues quickly and kindly.


I hope I can contribute and enrich the projects of the Apertium community with my knowledge and command of languages.
I hope I can contribute and enrich the projects of the Apertium community with my knowledge and command of languages.
Line 32: Line 45:
My chief priority will be translating from French to Occitan in order to allow the production of texts in Occitan.
My chief priority will be translating from French to Occitan in order to allow the production of texts in Occitan.


Regarding the vocabulary size, the bilingual dictionary currently has about 5700 entries, probably extracted after crossing Apertium dictionaries. The Wiktionnaire possesses some thousands translations from French to Occitan. They generally are in standard Occitan, nonetheless it is essential to review each and every translation. Even in such a case, it will faster than a personal translation. Later, the next step will to translate on my own words taken from a list of missing items classified according to a decreasing order of frequency (firstly in the French to Occitan direction and afterwards from Occitan to French with a second list).
Regarding the vocabulary size, the bilingual dictionary currently has about 5700 entries, probably extracted after crossing Apertium dictionaries. The Wiktionnaire possesses some thousands translations from French to Occitan. They generally are in standard Occitan, nonetheless it is essential to review each and every translation. Even in such a case, it will prove faster than a personal translation. The next step will be translating on my own words taken from a list of missing items classified according to a decreasing order of frequency (firstly in the French to Occitan direction and afterwards from Occitan to French with a second list).


== Reasons why Google and Apertium should sponsor it ==
=== Why Google and Apertium should sponsor it? A description of how and who it will benefit in society ===
Even though Occitan possesses a rich literature and was one of the most flourishing language during the Middle Ages, it is now a minoritized language, and even an endangered one. Occitan is the autochtonous language of most of the southern part of France, and is also spoken in Italy, in its Piedmontese area (Valadas Occitanas) and an enclave in the south (Guardia Piemontese) and also in a Pyrenean valley of Catalonia (Val d’Aran).


Since the beginning of the 20th century it has suffered a heavy loss of locutors and it is, nowadays, hardly understood by the vast majority of the inhabitants of its native area, since the estimated number of speakers is currently way under 1 million people.
== A description of how and who it will benefit in society ==


In spite of its brilliant past, Occitan cruelly lacks the means to help foster its transmission and preservation. Bilingual dictionaries are few and mainly address French-Occitan/Occitan-French translation. So far, there is no real encyclopaedia in Occitan. The only realization closest to such a project is the Wikipedia
== Work plan ==

However, so far the Occitan Wikipedia has not been able to attract, or retain, potential editors and has even met with many technical issues that have hampered its growth. Having tools that could help translate contents from the other Wikipedias would certainly bring much improvement, many changes and encourage people to contribute and create more articles.

The elaboration of the Apertium automatic translations towards Catalan and Spanish can be seen as the first step in this opening of Occitan to other languages and cultures and now the second phase of the divulgation, internationalization, and let’s hope revival, of the language of the Troubadors could be expanded with Google and Apertium’s help.

=== Non-Summer-of-Code plans you have for the Summer ===
I will be completely available during summertime since my vacations begin on July 8th and can work on the project more than 30 hours a week if necessary. Even though I will be working in June and early July, my rather light work schedule will allow me to work between 25 and 30 hours a week on the translating project. I also plan to work before, just to make sure the schedule is respected.

== Work schedule ==
*Note: The French → Occitan part of the project is the main direction.
*Note: The French → Occitan part of the project is the main direction.
*<small>Nòta: La part francés → occitan del projècte es la direccion principala.</small>
*<small>Nòta: La part francés → occitan del projècte es la direccion principala.</small>
*<small>Note : La partie français → occitan du projet est la direction principale.</small>
*<small>Note : La partie français → occitan du projet est la direction principale.</small>


=== Community bonding period (and before) ===
* Improve my knowledge and understanding of Apertium (chiefly the lexical selection and transfer rules).
* Think about ways to improve the monolingual dictionary so it can be efficient in its translations to standard Occitan, and at the same time flexible enough to accept dialectal variation.
* Create a list of pending tests

=== Work plan ===
{|class=wikitable
{|class=wikitable
! Setmana !! Datas !! Descripcion !! Bidix<br/>(sens np)<br/>previst !!(%) Cobertura<br/>prevista !! (%) WER<br/>previst !! Testvoc `
! Setmana !! Datas !! Descripcion !! Bidix<br/>(sens np)<br/>previst !!(%) Cobertura<br/>prevista !! (%) WER<br/>previst !! Testvoc `
Line 56: Line 85:
| 4 || 4 junh&mdash;10 junh || Adding words<br/>Transfer rules fra > oci || ~16,000 || ~89.0% || ||
| 4 || 4 junh&mdash;10 junh || Adding words<br/>Transfer rules fra > oci || ~16,000 || ~89.0% || ||
|-
|-
| 5 || <b>11 junh&mdash;15 junh<br>Deliverable #1: <br>French to Occitan translator</b> || Adding words<br/>Transfer rules fra > oci || <b>~18,000</b> || <b>~89.5%</b> || <b>~25%</b> ||
| 5 || <b>11 junh&mdash;17 junh<br>Deliverable #1: <br>French to Occitan translator</b> || Adding words<br/>Transfer rules fra > oci || <b>~18,000</b> || <b>~89.5%</b> || <b>~25%</b> ||
|-
|-
| 6 || 18 junh&mdash;24 junh || Adding words<br/>Transfer rules fra > oci<br/>Begin testvoc fra > oci || ~20,000 || ~90.0% || || pr, cnj*, adv, prn, det
| 6 || 18 junh&mdash;24 junh || Adding words<br/>Transfer rules fra > oci<br/>Begin testvoc fra > oci || ~20,000 || ~90.0% || || pr, cnj*, adv, prn, det
Line 64: Line 93:
| 8 || 2 julhet&mdash;8 julhet || Adding words<br/>Transfer rules fra > oci<br/>Testvoc fra > oci || ~22,000 || ~91.0% || || adj
| 8 || 2 julhet&mdash;8 julhet || Adding words<br/>Transfer rules fra > oci<br/>Testvoc fra > oci || ~22,000 || ~91.0% || || adj
|-
|-
| 9 || <b>9 julhet&mdash;13 julhet<br>Deliverable #2:<br> French to Occitan translator</b> || Transfer rules fra > oci<br/>Testvoc fra > oci || <b>~22,000</b> || <b>~91.0%</b> || <b>~15%</b> || n
| 9 || <b>9 julhet&mdash;15 julhet<br>Deliverable #2:<br> French to Occitan translator</b> || Transfer rules fra > oci<br/>Testvoc fra > oci || <b>~22,000</b> || <b>~91.0%</b> || <b>~15%</b> || n
|-
|-
| 0 || <b>occitan > français</b> || || ~22,000 || || ||
| 0 || <b>occitan > français</b> || || ~22,000 || || ||
Line 70: Line 99:
| 10 || 16 julhet&mdash;22 julhet || Adding missing words in decreasing order of frequency oci > fra<br/>Transfer rules oci > fra<br/>Testvoc oci > fra || ~22,500 || ~88.0% || || pr, cnj*, adv, prn, det
| 10 || 16 julhet&mdash;22 julhet || Adding missing words in decreasing order of frequency oci > fra<br/>Transfer rules oci > fra<br/>Testvoc oci > fra || ~22,500 || ~88.0% || || pr, cnj*, adv, prn, det
|-
|-
| 13 || 23 julhet&mdash;29 julhet || Adding words<br/>Transfer rules oci > fra <br/>Testvoc oci > fra || ~23,000 || ~89.0% || || n, adj
| 11 || 23 julhet&mdash;29 julhet || Adding words<br/>Transfer rules oci > fra <br/>Testvoc oci > fra || ~23,000 || ~89.0% || || n, adj
|-
|-
| 11 || 30 julhet&mdash;5 agost || Adding words<br/>Transfer rules oci > fra <br/>Testvoc oci > fra || ~23,500 || ~90.0% || || vblex
| 12 || 30 julhet&mdash;5 agost || Adding words<br/>Transfer rules oci > fra <br/>Testvoc oci > fra || ~23,500 || ~90.0% || || vblex
|-
|-
| 12* || 6 agost&mdash;9 agost || Final improvements || || || ||
| 13* || 6 agost&mdash;9 agost || Final improvements || || || ||
|-
|-
| 12** || <b>10 agost&mdash;14 agost<br>Deliverable #3:<br> Occitan to French translator</b> || Final evalution|| <b>~23,500</b> || <b>~90.0%</b> || <b>~30%</b> || ||
| 13** || <b>10 agost&mdash;14 agost<br>Deliverable #3:<br> Occitan to French translator</b> || Final evalution|| <b>~23,500</b> || <b>~90.0%</b> || <b>~30%</b> || ||
|-
|-
|}
|}
Line 87: Line 116:
Currently, the translator can generate Aranese and standard Occitan translations.<br>
Currently, the translator can generate Aranese and standard Occitan translations.<br>


However most of the problems encountered so far have been in trying to find the best referencial forms to be used in the bilingual file and sort of all this out; since Occitan possesses much dialectal variation it is a difficult task to make sure you choose the right word. One of the main objectives will be improving and expanding the monolingual Occitan dictionary so it can produce texts in standard Occitan, avoiding the mixing of different Occitan dialectal solutions. Albeit, at the same time it will have to be flexible enough so it can accept other varieties and even produce later diverse dialectal translations.
However most of the problems encountered so far have been in trying to find the best referencial forms to be used in the bilingual file and sort of all this out; since Occitan possesses much dialectal variation it is a difficult task to make sure you choose the right word. One of the main objectives will be improving and expanding the monolingual Occitan dictionary so it can produce texts in standard Occitan, avoiding the mixing of different Occitan dialectal solutions. Albeit, at the same time it will have to be flexible enough so it can accept other varieties and even produce later diverse dialectal translations. [https://github.com/Capsot/apertium-oci-fra/commits?author=Capsot| A short glimpse of part of the work made so far]

Latest revision as of 20:05, 26 March 2018

Contact info[edit]

Name/Nom: Claudi Balaguer

Location: Millars (Northern Catalonia), France

IRC: capsot

E-mail: ratapenada@yahoo.com

Github: http://github.com/capsot

Timezone: UTC+1

Possible Mentor: Hèctor Alòs

Skills and experience[edit]

My native languages are Catalan and French but I also master Occitan and Spanish to a high professional level. I have a very good command of Italian and English too. Furthermore I can understand and speak some basic Ukrainian.

Besides my diverse teaching (High School and University) and translating experience (many translations from Catalan or French to Occitan for instance), I am a linguist interested in many languages, especially the Romance family, and I am specialized in Occitan and Catalan dialectology. I coauthored a Catalan-Occitan/Occitan-Catalan dictionary with Patrici Pojada in 2005.

I should complete my thesis, which is already close to completion, in Catalan and Occitan dialectology in 2018. I am currently (since 2016) member of the Acadèmia Aranesa dera Lengua Occitana of the Aran Valley (Val d’Aran, Catalonia) and previously in the Grop de Lingüistica Occitana, which asked me to elaborate an Occitan lexicon about new technologies with the TERMCAT.

Moreover I have contributed previously in the Aranese Comission deth Traductor (2008), which helped shaping the linguistic stockword included in Gema Ramírez and Carme Armentano’s Apertium Occitan translator.

Though I did not have real previous experience in coding, I have made many contributions in the Occitan and Catalan Wikipedias, therefore I already had some knowledge of the wikisyntaxcode. During the last weeks, I have learned the basic commands while working on the oci-fra file with Hèctor Alòs.

Why is it you are interested in machine translation?[edit]

I have been a Wikipedia editor (mainly on the Occitan and Catalan versions) for a long time and witnessed how machine translation can help expand their content in the Catalan Viquipèdia, which has very good translating tools. Automated translation can thus provide a helpful hand in acquiring articles from other Wikipedias and proves to be an amazing gain of time and energies for small communities like the Occitan one.

Why is it that you are interested in Apertium?[edit]

I have known the Apertium translation project many years ago while collaborating as a linguist to the first Occitan translating tools in the Val d’Aran, which was working then on an Occitan translator using two linguistic varieties (a standard Occitan and an Aranese one). The Apertium community seems to already have many good translating tools; people there share genuine interest towards any languages, and treat every one of these as equal, without any real hierarchy whether dominant or minoritized, which I particularly appreciate. Then the collaborative atmosphere is really pleasant; many people have helped me with the technical issues quickly and kindly.

I hope I can contribute and enrich the projects of the Apertium community with my knowledge and command of languages.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I am interested in adding a new pair, namely Occitan-French, which are both closely related since they belong to the Romance languages family.

Previous work on translation to and from Occitan has already been completed towards Catalan and Spanish and there is a sound and interesting word-stock in the monolingual Occitan dictionary. I have already explored what is available in the files linked to the Occitan translation files during the last weeks, and also worked on the bilingual Occitan-French file.

Since the French-Catalan pair is the closest to the French-Occitan one (because of the great proximity between Catalan and Occitan) we have run some analyses with Hèctor. Fra to Cat has a PER of c. 9%, while Cat to Fra reaches 32%. Fra-cat uses a single level transfer, which appears to be insufficient, particularly from Catalan to French (in which you have to generate subject pronouns, many articles, etc.). It seems thus better to use a 3 levels strategy in the French-Occitan pair, even if this will make it somewhat difficult to reuse the rules of the French-Catalan pair.

My chief priority will be translating from French to Occitan in order to allow the production of texts in Occitan.

Regarding the vocabulary size, the bilingual dictionary currently has about 5700 entries, probably extracted after crossing Apertium dictionaries. The Wiktionnaire possesses some thousands translations from French to Occitan. They generally are in standard Occitan, nonetheless it is essential to review each and every translation. Even in such a case, it will prove faster than a personal translation. The next step will be translating on my own words taken from a list of missing items classified according to a decreasing order of frequency (firstly in the French to Occitan direction and afterwards from Occitan to French with a second list).

Why Google and Apertium should sponsor it? A description of how and who it will benefit in society[edit]

Even though Occitan possesses a rich literature and was one of the most flourishing language during the Middle Ages, it is now a minoritized language, and even an endangered one. Occitan is the autochtonous language of most of the southern part of France, and is also spoken in Italy, in its Piedmontese area (Valadas Occitanas) and an enclave in the south (Guardia Piemontese) and also in a Pyrenean valley of Catalonia (Val d’Aran).

Since the beginning of the 20th century it has suffered a heavy loss of locutors and it is, nowadays, hardly understood by the vast majority of the inhabitants of its native area, since the estimated number of speakers is currently way under 1 million people.

In spite of its brilliant past, Occitan cruelly lacks the means to help foster its transmission and preservation. Bilingual dictionaries are few and mainly address French-Occitan/Occitan-French translation. So far, there is no real encyclopaedia in Occitan. The only realization closest to such a project is the Wikipedia

However, so far the Occitan Wikipedia has not been able to attract, or retain, potential editors and has even met with many technical issues that have hampered its growth. Having tools that could help translate contents from the other Wikipedias would certainly bring much improvement, many changes and encourage people to contribute and create more articles.

The elaboration of the Apertium automatic translations towards Catalan and Spanish can be seen as the first step in this opening of Occitan to other languages and cultures and now the second phase of the divulgation, internationalization, and let’s hope revival, of the language of the Troubadors could be expanded with Google and Apertium’s help.

Non-Summer-of-Code plans you have for the Summer[edit]

I will be completely available during summertime since my vacations begin on July 8th and can work on the project more than 30 hours a week if necessary. Even though I will be working in June and early July, my rather light work schedule will allow me to work between 25 and 30 hours a week on the translating project. I also plan to work before, just to make sure the schedule is respected.

Work schedule[edit]

  • Note: The French → Occitan part of the project is the main direction.
  • Nòta: La part francés → occitan del projècte es la direccion principala.
  • Note : La partie français → occitan du projet est la direction principale.

Community bonding period (and before)[edit]

  • Improve my knowledge and understanding of Apertium (chiefly the lexical selection and transfer rules).
  • Think about ways to improve the monolingual dictionary so it can be efficient in its translations to standard Occitan, and at the same time flexible enough to accept dialectal variation.
  • Create a list of pending tests

Work plan[edit]

Setmana Datas Descripcion Bidix
(sens np)
previst
(%) Cobertura
prevista
(%) WER
previst
Testvoc `
0 français > occitan ~5,700
1 14 mai—20 mai Improving Occitan monodix
Adding prn, pr, cnj*, basic adv to bidix
~6,000 ~84,0%
2 21 mai—27 mai Adding n, adj, adv to the bidix from the French Wictionary ~12,000 ~86,0%
3 28 mai—3 junh Adding vblex to the bidix from the French Wictionary
Beginning to add missing words in decreasing order of frequency fra > oci
~14,000 ~88.0%
4 4 junh—10 junh Adding words
Transfer rules fra > oci
~16,000 ~89.0%
5 11 junh—17 junh
Deliverable #1:
French to Occitan translator
Adding words
Transfer rules fra > oci
~18,000 ~89.5% ~25%
6 18 junh—24 junh Adding words
Transfer rules fra > oci
Begin testvoc fra > oci
~20,000 ~90.0% pr, cnj*, adv, prn, det
7 25 junh—1 julhet Adding words
Transfer rules fra > oci
Testvoc fra > oci
~21,000 ~90.5% vblex
8 2 julhet—8 julhet Adding words
Transfer rules fra > oci
Testvoc fra > oci
~22,000 ~91.0% adj
9 9 julhet—15 julhet
Deliverable #2:
French to Occitan translator
Transfer rules fra > oci
Testvoc fra > oci
~22,000 ~91.0% ~15% n
0 occitan > français ~22,000
10 16 julhet—22 julhet Adding missing words in decreasing order of frequency oci > fra
Transfer rules oci > fra
Testvoc oci > fra
~22,500 ~88.0% pr, cnj*, adv, prn, det
11 23 julhet—29 julhet Adding words
Transfer rules oci > fra
Testvoc oci > fra
~23,000 ~89.0% n, adj
12 30 julhet—5 agost Adding words
Transfer rules oci > fra
Testvoc oci > fra
~23,500 ~90.0% vblex
13* 6 agost—9 agost Final improvements
13** 10 agost—14 agost
Deliverable #3:
Occitan to French translator
Final evalution ~23,500 ~90.0% ~30%

Coding Challenge[edit]

As I said before, I have already begun studying the files and how everything works. I have worked and made significant changes on the apertium-oci-fra file. My potential mentor Hèctor Alòs has been a great guide and superviser. I have learnt much of the syntax used and how it works, and we even went through many technical problems together. I had then much appreciated help from Shardul Chiplunkar (shardulc; धन्यवाद), Jacob Nordfalk (JacobEo), Tino Didriksen and Ilnar Salimzianov (selimcan; Räxmät!) from the Apertium community. I think that I have finally managed to catch a decent grasp of many of the commands and much of the syntax, even though I guess much more remains to be acquired!

As mentioned previously I have worked mostly on the oci-fra file, trying to understand how things worked, and then added many words trying to fill the gaps that the translation of the James and Mary text gave at first. It is not finished yet but it looks much better and even though some sentences still make trouble I am confident it will be completed soon.

Currently, the translator can generate Aranese and standard Occitan translations.

However most of the problems encountered so far have been in trying to find the best referencial forms to be used in the bilingual file and sort of all this out; since Occitan possesses much dialectal variation it is a difficult task to make sure you choose the right word. One of the main objectives will be improving and expanding the monolingual Occitan dictionary so it can produce texts in standard Occitan, avoiding the mixing of different Occitan dialectal solutions. Albeit, at the same time it will have to be flexible enough so it can accept other varieties and even produce later diverse dialectal translations. A short glimpse of part of the work made so far