Difference between revisions of "User:JCentelles/GSoCapplication"

Revision as of 05:14, 3 May 2013

Title

Chinese-to-Spanish Apertium System

General Questions

Why is it you are interested in machine translation?

The quantity of information in our society is increasing incredibly quickly, and most of this information is multilingual, which means that translation plays a very important role to deal with this information. Companies, governments and most people have to face the challenge of communicating in other language that their own. So the question is, who is not interested in having access to a great machine translation system? So, I think developing machine translation systems is a challenge which should have a big impact on society. I am interested in participating on it.

'Why is it that you are interested in the Apertium project?

I am interested in Apertium mainly because it is a popular, well-known and active open source project and it uses a rule-based translation engine which I want to further familiarize with.

Which of the published tasks are you interested in? What do you plan to do?

Introducing Chinese to Spanish translation. About my work plan, you can see further information in the description of the project. Here I just summarize the main ideas. I will start to familiarize and incorporate into the Apertium architecture the given list of GPL resources (I provide) to analyze Chinese and generate Spanish (well, this language is already in Apertium). Then, I will work with an statistical system + postedition to complete the transfer approach.

Project description

Motivation

Chinese and Spanish are two of the most spoken languages in the world and there are many economical reasons to pursue this language pair translations (i.e. many China companies interested in expanding to Latin America).

There are not many Chinese-to-Spanish available in the website: google translator is one of them. However, the translation quality is not extremely good. It seems they have to produce translation through English pivoting to compensate the lack of Chinese-Spanish parallel corpora.

Therefore, corpus-based approaches are possible with this language pair, but these approaches do not have extremely good results as other pairs such as English-Spanish because there is not huge parallel data for this language pair. Furthermore, the differences between the two languages increases the difficulty of the translation.

Our expectations of the Chinese-Spanish rule-based system in comparison to other machine translation approaches are the following ones:

1)It will be able to better manage the difference in morphology from Chinese to Spanish. Chinese is an isolating language, which means that there is a one-to-one correspondance between words and morphemes. Whereas, Spanish is a fusional language, which means that words and morphemes are mixed together without clear limits. Making analysis and generation of morphology may be nice to approach Chinese and Spanish.

2)I will experiment if reordering from Chinese and Spanish can benefit from reordering transfer rules.

3)The rule-based can exploite the use of linguistic tools which are available separately for Chinese and Spanish.

Description

We will introduce the Chinese-to-Spanish translation in the Apertium open source project. We will do this integration using the GPL available tools for the pair of languages. This tools will be used for the analysis and generation steps. Some of the resources that we have include parallel corpus (from which we can extract bilingual dictionaries in case we do not find dictionaries), statistical postaggers and segmenter (from Stanford, for Chinese, see references later in the Apertium questions).

For the transfer phase, we propose to generate rules automatically from a parallel corpus using [3]. We will experimentally validate and post-edit before introducing them into the transfer-based approach. We have already a baseline phrase-based system. As parallel corpus, we will use the United Nations corpus and the Holy Bible corpus [2] in addition to an automatically created parallel corpus from the EPPS (which contains 250 thousand sentences). Further corpus may include some manuals provided by the TAUS company (here, we have to ask if the corpus can be used for this project). Finally, we will put this new language pair available on-line.

WHO ARE THE BENEFICIARIES AND HOW WILL THEY BENEFIT?

We are focusing on making a first approach from Chinese to Spanish rule-based translation. Also, as the translator will be available in the web, tourists traveling across China and Latin America can use it. Spanish students from Chinese (or the other way round) can use the translator to test their knowledge.

Work Plan

Coding Challenge

For the coding challenge I have built a small Rule-based Machine Translation system (based on Apertium) from Chinese to Spanish. The analysis part consists of a Chinese Segmenter and PoStagger, both open source. I have written some transfer rules and a small bilingual dictionary. Finally, for the generation step, I have used the Spanish dictionary from the Apertium English-Spanish translation .

As follows, the command lines to translate some Chinese sentences:

$ "^我<PN>$ ^买<VV>$ ^一个<CD>$ ^车<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin

^prpers<prn><tn><p1><mf><sg>$ ^comprar<vblex><pri><p1><sg>$ ^un<num><m><sg>$ ^coche<n><m><sg>$

$ echo "^我<PN>$ ^买<VV>$ ^一辆<CD>$ ^车<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin | lt-proc -g zh-es.autogen.bin

yo compro un coche

$ echo "^我<PN>$ ^买<VV>$ ^一个<CD>$ ^苹果<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin | lt-proc -g zh-es.autogen.bin

yo compro una manzana

$ echo "^我们<PN>$ ^买<VV>$ ^一个<CD>$ ^苹果<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin | lt-proc -g zh-es.autogen.bin

nosotros compramos una manzana

The bilingual dictionary and the transfer rules are in: https://github.com/jCentelles/Apertium-zh-es

Note: the Chinese Segmenter and Postagger will be integrated in Apertium during the development of the project.

Resources

As follows we name the resources that we will be using:

(1) Parallel corpora: Bible corpus, UN corpus [2] and EPPS corpus translated automatically (no references for this, but it contains 250k parallel sentences). (2) Stanford segmenter (http://nlp.stanford.edu/software/segmenter.shtml) (3) Stanford postagger (http://nlp.stanford.edu/software/tagger.shtm) (4) For Spanish, we will be using reources from Apertium. (5) Transfer-based rules extractor [3].

Apertium Questions

Can I do a pair with language x and language y ?

— Yes, there are no restrictions. But you should take the following into consideration: (a) Are there existing machine translation (MT) systems for this pair? Yes there are. The most popular are the Google translator (statistical), Yahoo Babelfish (rule-based) and Microsoft Bing (hybrid).

(b) If there are existing systems, how good are they? -- Could you do better in three months? Even the existing machine translation systems are well-known, their Chinese to Spanish translations are quite poor. However, I may not build a better translator in three months, but what I can do is to settle down a consistent Apertium baseline of (Simplified) Chinese to Spanish rule-based translation system.

(c) How closely related is the pair? In terms of morphology and syntax, the languages involved are quite far away.

(d) How many resources already exist for the pair? We have at our disposal many Chinese and Spanish data corpus, see [2], additional corpus may be added if we get the authorization, as the ones provided by TAUS (more than 3 million sentences). Furthermore, we also have Chinese and Spanish segmenters and postaggers (Stanford for Chinese [1]). We have other resources in http://www.mandarintools.com/ The project will be based in results from [3]

(e) Are there any mentors who can evaluate your work? Francis Tyers and Felipe Sánchez-Martínez. In addition, Marta Ruiz Costa-jussà will support me with the resources and the statistical part of the work.

References

[1] Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning. 2009. Discriminative Reordering with Chinese Grammatical Relations Features. In Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation. [2] Marta R. Costa-jussà, Carlos Henríquez, and Rafael E. Banchs, Evaluating Indirect Strategies for Chinese-Spanish statistical machine translation JAIR Journal of Artificial Intelligence Research, 2012, vol 45, pages 761-780 [3] Felipe Sánchez-Martínez. (2008). Using unsupervised corpus-based methods to build rule-based machine translation systems. PhD thesis, Universitat d’Alacant.

About me

email: Jordi.Centelles.Sabater@gmail.com nick IRC: jCentelles

Personal skills

I am a Telecommunication engineer by the UPC (Universitat Politècnica de Catalunya) and I am currently pursuing economics in the UB (Universitat de Barcelona).

Programming skills: Knowledge in C++,Python, Perl, Java, PHP and Xml. Also Knowledge in Statistical Machine Translation (Moses), and in building apps and webs (Html and Javascript).

Language skills: Native in Spanish. Very basic knowledge of Chinese, but I will develop most of the project in Singapore, where I have accessible Chinese colleagues.

Open Source projects

I have experience in working with Moses. I have been in charge of programming the Spanish-Chinese web-translator and its applications for android and Apple's iOS for the Chinese company Baidu and the Institute for Infocomm Research in Singapore.

list any non-Summer-of-Code plans you have for the Summer

I am working on my final project from my Telecom's bachelor at Institute for Infocomm Research in Singapore. My project goal is to work on Chinese-Spanish machine translation. I will be full-time dedicated to the Google Summer of Code, and it will be a big part of my final project.

Difference between revisions of "User:JCentelles/GSoCapplication"

Revision as of 05:14, 3 May 2013

Contents

Title

General Questions

Why is it you are interested in machine translation?

'Why is it that you are interested in the Apertium project?

Which of the published tasks are you interested in? What do you plan to do?

Project description

Motivation

Description

Work Plan

Coding Challenge

Resources

Apertium Questions

Can I do a pair with language x and language y ?

References

About me

Personal skills

Open Source projects

list any non-Summer-of-Code plans you have for the Summer

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 49: / Line 49: @@
 [[File:grantt.jpg]]
+== Coding Challenge ==
+For the coding challenge I have built a small Rule-based Machine Translation system (based on Apertium) from Chinese to Spanish.  The analysis part consists of a  Chinese Segmenter and PoStagger, both open source. I have written some transfer rules and a small bilingual dictionary. Finally, for the generation step, I have used the Spanish dictionary from the Apertium English-Spanish translation .
+As follows, the command lines to translate some Chinese sentences:
+$ "^我<PN>$ ^买<VV>$ ^一个<CD>$  ^车<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin
+'''^prpers<prn><tn><p1><mf><sg>$ ^comprar<vblex><pri><p1><sg>$ ^un<num><m><sg>$ ^coche<n><m><sg>$'''
+$ echo "^我<PN>$ ^买<VV>$ ^一辆<CD>$  ^车<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin | lt-proc -g zh-es.autogen.bin
+'''yo compro un coche'''
+$ echo "^我<PN>$ ^买<VV>$ ^一个<CD>$  ^苹果<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin | lt-proc -g zh-es.autogen.bin
+'''yo compro una manzana'''
+$ echo "^我们<PN>$ ^买<VV>$ ^一个<CD>$  ^苹果<NN>$ " | apertium-transfer apertium-zh-es.zh-es.t1x zh-es.t1x.bin zh-es.autobil.bin | lt-proc -g zh-es.autogen.bin
+'''nosotros compramos una manzana'''
+The bilingual dictionary and the transfer rules are in: https://github.com/jCentelles/Apertium-zh-es
+Note: the Chinese Segmenter and Postagger will be integrated in Apertium during the development of the project.
 == Resources ==