Difference between revisions of "Chinese and Spanish"

From Apertium
Jump to navigation Jump to search
 
(10 intermediate revisions by one other user not shown)
Line 3: Line 3:
==Segmentadors==
==Segmentadors==


{|class=wikitable
===LRLM===
! Nom !! Rendiment
|-
| LRLM ||
|-
| Cobertura òptima ||
|-
| zhseg ||
|-
| Stanford ||
|-
|}


==Pla de treball==
===Cobertura òptima===


{|class=wikitable
|-
! Week !! Dates !! Trimmed coverage !! Achieved !! Testvoc !! Evaluation !! Notes !! Achieved
|-
| 0 || 21/05&mdash;16/06 || 45% || ?|| || 500 words || '''Preliminary evaluation'''. Translate the story total coverage and without diagnostics. Get a baseline WER. Create <code>zho.dix</code> by: (a) extracting word + POS from Wiktionary. Test and evaluate segmentation strategies and produce report. || WER:&nbsp;85.55%,<br/>BLEU:&nbsp;0.1184,<br/>Cov:&nbsp;?
|-
| 1 || 17/06&mdash;23/06 || 50% || ?|| {{tag|num}} || - || Numerals should be added and testvoc clean. ||
|-
| 2 || 24/06&mdash;30/06 || 53% || ?|| {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}} || - || ||
|-
| 3 || 01/07&mdash;07/07 || 59% |||| {{tag|adv}} || 200 words || ||
|-
| 4 || 08/07&mdash;14/07 || 63% || || {{tag|prn}} {{tag|det}} || - || ||
|-
| 5 || 15/07&mdash;21/07 || 68% |||| {{tag|adj}} || - || ||
|-
| 6 || 22/07&mdash;28/07 || 70% |||| {{tag|n}} || 500 words || '''Midterm evaluation'''.||
|-
| 7 || 29/07&mdash;04/08 || 73% |||| - || - || ||
|-
| 8 || 05/08&mdash;11/08 || 75% |||| - || - || ||
|-
| 9 || 12/08&mdash;18/08 || 77% || || - || 200 words || ||
|-
| 10 || 19/08&mdash;25/08 || 80% |||| {{tag|vblex}} || - || ||
|-
| 11 || 26/08&mdash;01/09 || 82% |||| - || - || ||
|-
| 12 || 02/09&mdash;08/09 || 83% |||| - || - || ||
|-
| 13 || 09/09&mdash;15/09 || 85% |||| ''all categories clean'' || 500 words || '''Final evaluation'''. Tidying up, releasing ||
|-
|}

==Vease también==

* [[/Pending tests|Pending tests]]

==Zho-Spa Report==


==Description==
The goal of this project was to implement a Chinese to Spanish translator using the Apertium framework. Despite the grammatical and morfological differences between both languages the translation, which represents a high challenge for machine translation, the translation output is quite understandable. Specially, we have to take into account the limitation in the time period.
The translation output is not very fluent. However, the translator offers a high coverage when testing the translator with different domain Chinese texts and that is already a great success.
We have developed the analysis dictionary, the bilingual dictonary, the three levels of transfer rules.

Below there is the translation of the following Chinese story:
<pre>
小明和小红在花园里面。 今天天气好,很暖。 不过昨天好冷哦!他们不能在外面玩。 小明和小红很喜欢玩耍, 他们常常在大屋子前的花园一起玩耍。
小明是一个小男孩。他今年六岁。 小女孩是他的妹妹。她今年五岁。 小明有一只狗,那只狗现在在花园里面。 小狗很喜欢跟小明和小红玩。狗儿现在很开心。
小红有狗吗? 不,她没有狗,她有只猫。 可是,猫儿在屋子里面睡觉。
他们的妈妈和猫儿在屋子里面。 她从窗口看到小明和小红玩耍。 小明赶快跑到大老树后面,因为他不想小红看到他。你知道为什么吗? 小红坐了下来,手都摆在眼睛前面。 她看不到,正在倒数。 为什么呢? 小明又在树附近做什么?

Jaime y María en jardín interior . hoy clima bien ,#muy cálido . solamente ayer bien frío! ellos no pueden en salida jugar . Jaime y María #muy gusta jugar , ellos #partes en casa gran #ex de jardín juntos juega .
Jaime es un niño .él ahora #seis #año . niño pequeño es hermana de él .ella ahora #cinco #año . Jaime tiene #uno sólo perro ,@那 sólo perro ahora en jardín interior . perro pequeño #muy gusta jugar con Jaime y María .perro ahora #muy #feliz .
María tiene perro? no ,ella no perro ,ella tiene sólo gato . 可es ,gato en casa interior duerme .
#suyo madre y gato en casa interior . ella desde ventana ve Jaime y María jugar . Jaime corre rápidamente detrás a árbol gran y viejo ,respuesta a él no desea María ve él .tú sabes por qué? María sienta abajo ,mano #ambos pone en frente a ojo . ella ve nada ,#marcha cuenta . por qué? Jaime además en árbol cerca hace qué ?
</pre>

===Dictionaries===

There has been two ways for developing the analysis and bilingual dictionaries: At the beginning we started using Chinese and Spanish corpora in order to obtain lots of Chinese-Spanish word pairs. Using the Stanford Segmenter and Giza++ we developed the first approach of a bilingual dictionary. After that, we extracted only the Chinese words and its tags and obtained the analysis dictionary.
Although it was a good method, we realized that it was necessary to start developing both dictionaries manually due to the limitations of that method. So we did it. With the help of a very complete website (http://www.yellowbridge.com), we started including new words and its Spanish translation to the dictionary. At the end, we added 5.000 words manually and with both methods we nearly have 9.000 words.
For the generation dictionary we used the Apertium's en-es dictionary.

===Transfer rules===
It is not easy to develop rules to this pair of languages due to the grammatical difference between them. Despite this, we created 28 rules for the first level, 34 for the t2x level and one for the t3x.
Apart from this, we also have been experimenting with a software of extraction of transfer rules automatically from paralle corpus. However, the automatic rules alone do not achieve better performance than the manual rules. Another option is to combine manual and automatic rules.
This work is ongoing. As well as the introducing lexical rules to solve lexical ambiguities.
===Segmenter===
One of the main characteristics of the Chinese languages is that the whole sentence is written together. There is no separation between words. Due to this, we find desirable to implement a segmenter in the apertium framework. However, we confirmed that the WER evaluation using a Chinese segmenter was quite similar to the WER using only the analytical dictionary(LRLM) as a segmenter .


==Acknowledgements==
First of all I would thank Marta and Gema for his technical and motivational support. But I also want to thank Francis because even he was not my tutor, without his help and patience this project would not be what it is. And at last but not least, I would like to thank Apertium and Google for believing in this project.


===zhseg===


===Stanford===


[[Category:Chinese and Spanish|*]]
[[Category:Chinese and Spanish|*]]

Latest revision as of 21:33, 30 September 2013

Segmentadors[edit]

Nom Rendiment
LRLM
Cobertura òptima
zhseg
Stanford

Pla de treball[edit]

Week Dates Trimmed coverage Achieved Testvoc Evaluation Notes Achieved
0 21/05—16/06 45% ? 500 words Preliminary evaluation. Translate the story total coverage and without diagnostics. Get a baseline WER. Create zho.dix by: (a) extracting word + POS from Wiktionary. Test and evaluate segmentation strategies and produce report. WER: 85.55%,
BLEU: 0.1184,
Cov: ?
1 17/06—23/06 50% ? <num> - Numerals should be added and testvoc clean.
2 24/06—30/06 53% ? <cnjcoo> <cnjadv> <cnjsub> -
3 01/07—07/07 59% <adv> 200 words
4 08/07—14/07 63% <prn> <det> -
5 15/07—21/07 68% <adj> -
6 22/07—28/07 70% <n> 500 words Midterm evaluation.
7 29/07—04/08 73% - -
8 05/08—11/08 75% - -
9 12/08—18/08 77% - 200 words
10 19/08—25/08 80% <vblex> -
11 26/08—01/09 82% - -
12 02/09—08/09 83% - -
13 09/09—15/09 85% all categories clean 500 words Final evaluation. Tidying up, releasing

Vease también[edit]

Zho-Spa Report[edit]

Description[edit]

The goal of this project was to implement a Chinese to Spanish translator using the Apertium framework. Despite the grammatical and morfological differences between both languages the translation, which represents a high challenge for machine translation, the translation output is quite understandable. Specially, we have to take into account the limitation in the time period. The translation output is not very fluent. However, the translator offers a high coverage when testing the translator with different domain Chinese texts and that is already a great success. We have developed the analysis dictionary, the bilingual dictonary, the three levels of transfer rules.

Below there is the translation of the following Chinese story:

小明和小红在花园里面。 今天天气好,很暖。 不过昨天好冷哦!他们不能在外面玩。 小明和小红很喜欢玩耍, 他们常常在大屋子前的花园一起玩耍。
小明是一个小男孩。他今年六岁。 小女孩是他的妹妹。她今年五岁。 小明有一只狗,那只狗现在在花园里面。 小狗很喜欢跟小明和小红玩。狗儿现在很开心。
小红有狗吗? 不,她没有狗,她有只猫。 可是,猫儿在屋子里面睡觉。
他们的妈妈和猫儿在屋子里面。 她从窗口看到小明和小红玩耍。 小明赶快跑到大老树后面,因为他不想小红看到他。你知道为什么吗? 小红坐了下来,手都摆在眼睛前面。 她看不到,正在倒数。 为什么呢? 小明又在树附近做什么?

Jaime y María en jardín interior . hoy clima bien ,#muy cálido . solamente ayer bien frío! ellos no pueden en salida jugar . Jaime y María #muy gusta jugar , ellos #partes en casa gran #ex de jardín juntos juega .
Jaime es un niño .él ahora #seis #año . niño pequeño es hermana de él .ella ahora #cinco #año . Jaime tiene #uno sólo perro ,@那 sólo perro ahora en jardín interior . perro pequeño #muy gusta jugar con Jaime y María .perro ahora #muy #feliz .
María tiene perro? no ,ella no perro ,ella tiene sólo gato . 可es ,gato en casa interior duerme .
#suyo madre y gato en casa interior . ella desde ventana ve Jaime y María jugar . Jaime corre rápidamente detrás a árbol gran y viejo ,respuesta a él no desea María ve él .tú sabes por qué? María sienta abajo ,mano #ambos pone en frente a ojo . ella ve nada ,#marcha cuenta . por qué? Jaime además en árbol cerca hace qué ?

Dictionaries[edit]

There has been two ways for developing the analysis and bilingual dictionaries: At the beginning we started using Chinese and Spanish corpora in order to obtain lots of Chinese-Spanish word pairs. Using the Stanford Segmenter and Giza++ we developed the first approach of a bilingual dictionary. After that, we extracted only the Chinese words and its tags and obtained the analysis dictionary. Although it was a good method, we realized that it was necessary to start developing both dictionaries manually due to the limitations of that method. So we did it. With the help of a very complete website (http://www.yellowbridge.com), we started including new words and its Spanish translation to the dictionary. At the end, we added 5.000 words manually and with both methods we nearly have 9.000 words. For the generation dictionary we used the Apertium's en-es dictionary.

Transfer rules[edit]

It is not easy to develop rules to this pair of languages due to the grammatical difference between them. Despite this, we created 28 rules for the first level, 34 for the t2x level and one for the t3x. Apart from this, we also have been experimenting with a software of extraction of transfer rules automatically from paralle corpus. However, the automatic rules alone do not achieve better performance than the manual rules. Another option is to combine manual and automatic rules. This work is ongoing. As well as the introducing lexical rules to solve lexical ambiguities.

Segmenter[edit]

One of the main characteristics of the Chinese languages is that the whole sentence is written together. There is no separation between words. Due to this, we find desirable to implement a segmenter in the apertium framework. However, we confirmed that the WER evaluation using a Chinese segmenter was quite similar to the WER using only the analytical dictionary(LRLM) as a segmenter .


Acknowledgements[edit]

First of all I would thank Marta and Gema for his technical and motivational support. But I also want to thank Francis because even he was not my tutor, without his help and patience this project would not be what it is. And at last but not least, I would like to thank Apertium and Google for believing in this project.