Difference between revisions of "Ideas for Google Summer of Code/Adopt a language pair"

From Apertium
Jump to navigation Jump to search
(Fixed a small typo)
 
(27 intermediate revisions by 6 users not shown)
Line 8: Line 8:
 
# Install Apertium (see [[Minimal installation from SVN]])
 
# Install Apertium (see [[Minimal installation from SVN]])
 
# Go through the [[:Category:HOWTO|HOWTO]]
 
# Go through the [[:Category:HOWTO|HOWTO]]
# Go through the MT course [http://wiki.apertium.eu/index.php/Programme_overview here] (или [[Курсы машинного перевода для языков России/Программа|здесь]])
+
# Go through the MT course [[Helsinki_Apertium_Workshop/Programme|here]] (или [[Курсы машинного перевода для языков России/Программа|здесь]])
# Write a translator that translates as much of [http://www.unilang.org/ulrview.php?res=394,387 this story] as possible — Minimum one sentence. (Другие переводы рассказа [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/xupaixkar/rasskaz здесь].)
+
# Write a translator that translates as much of [https://sourceforge.net/p/apertium/svn/HEAD/tree/branches/xupaixkar/rasskaz/ this story] (or [https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam from here] ) as possible — Minimum one sentence.
 
#* If there is no translation, translate it into the languages of your language pair first.
 
#* If there is no translation, translate it into the languages of your language pair first.
# Upload your work to Apertium [[SVN]].
+
# Upload your work to github (or equivalent).
   
 
If you don't complete it all, don't worry! We take many things into account when assessing your application. However, the URL to any work you do for the coding challenge work should be included in your application.
 
If you don't complete it all, don't worry! We take many things into account when assessing your application. However, the URL to any work you do for the coding challenge work should be included in your application.
Line 19: Line 19:
 
* '''Can I do a pair with language <math>x</math> and language <math>y</math> ? '''
 
* '''Can I do a pair with language <math>x</math> and language <math>y</math> ? '''
 
:&mdash; Yes, there are no restrictions. But you should take the following into consideration: (a) Are there existing machine translation (MT) systems for this pair? (b) If there are existing systems, how good are they? -- Could you do better in three months? (c) How closely related is the pair? (d) How many resources already exist for the pair? (e) Are there any mentors who can evaluate your work?''
 
:&mdash; Yes, there are no restrictions. But you should take the following into consideration: (a) Are there existing machine translation (MT) systems for this pair? (b) If there are existing systems, how good are they? -- Could you do better in three months? (c) How closely related is the pair? (d) How many resources already exist for the pair? (e) Are there any mentors who can evaluate your work?''
  +
:: ''As an example, we're '''very''' unlikely to accept applicants to <code>eng-hin</code>, which (a) has support from existing major systems (b) where the existing systems are much better than you'd get in 3 months and (c) which is not closely related. Too many students have tried this pair in the past, and not gotten anywhere. Fortunately, many students who speak Hindi and English also speak a third language – we're much more likely to support that.''
 
* '''Do I need to have GNU/Linux installed, or can I use another operating system ?'''
 
* '''Do I need to have GNU/Linux installed, or can I use another operating system ?'''
 
:&mdash; In theory you can use any operating system. In practice unless you are using GNU/Linux or Mac/OS you are going to have a hard time as the mentors cannot offer you support with alternative operating systems. You may want to check out [https://www.virtualbox.org/ Virtualbox] if you are using Windows.
 
:&mdash; In theory you can use any operating system. In practice unless you are using GNU/Linux or Mac/OS you are going to have a hard time as the mentors cannot offer you support with alternative operating systems. You may want to check out [https://www.virtualbox.org/ Virtualbox] if you are using Windows.
Line 24: Line 25:
 
:&mdash; For making a language pair, you don't need to know any specific programming language. Knowing a scripting language will be really helpful, but most of the work is done in Apertium's own linguistic formalisms, which are based on XML. To get an idea of what these formalisms look like, you should do the [[HOWTO|new language pair HOWTO]].
 
:&mdash; For making a language pair, you don't need to know any specific programming language. Knowing a scripting language will be really helpful, but most of the work is done in Apertium's own linguistic formalisms, which are based on XML. To get an idea of what these formalisms look like, you should do the [[HOWTO|new language pair HOWTO]].
 
* '''Do I have to know both language <math>x</math> and language <math>y</math> ? '''
 
* '''Do I have to know both language <math>x</math> and language <math>y</math> ? '''
:&mdash; You don't have to be a native speaker of ''both'' languages, but you should be a native speaker of one, and have a good knowledge of the other.
+
:&mdash; You don't have to be a native speaker of the languages, but you should be a fluent speaker of one, and have a good knowledge of the other.
 
* '''Do I have to be able to speak English well?'''
 
* '''Do I have to be able to speak English well?'''
:&mdash; No, it isn't necessary to speak English well &mdash; if you and your mentor can communicate, we are happy. You should have a basic reading level of English and be able to ask questions when you don't understand.
+
:&mdash; No, it isn't necessary to speak English well. If you and your mentor can communicate, then we are happy. However, you should have a basic reading level of English and be able to ask questions when you don't understand.
  +
* '''What are the criteria for a language pair to be considered "finished" for GSOC purposes ?'''
  +
:&mdash; This is something to discuss with your mentor, but a general idea might be that the pair is [[testvoc]] clean, and has a coverage of around 80% or more on a range of free corpora.
  +
* '''If I take on a language pair, do I have to do both directions, e.g. <math>x</math> → <math>y</math> and <math>y</math> → <math>x</math>?'''
  +
:&mdash; This will depend on your language pair, you will need to discuss it with your mentor. As a rough guide, if the pair is for dissemination, e.g. translating between related languages, then both directions will be made. If it is for assimilation, then perhaps only one.
   
 
==Previous GSOC projects==
 
==Previous GSOC projects==
   
And pairs which were adopted in past years:
+
Here is a list of pairs which were adopted in past years. In brackets following the pair is the current status in the repository.
  +
  +
* 2012
  +
** [[Indonesian and Malaysian]] ([[trunk]])
  +
** [[Slovenian and Serbo-Croatian]] ([[trunk]])
  +
** [[Maltese and Arabic]] ([[staging]])
  +
** [[Kazakh and Tatar]] ([[trunk]])
  +
** [[Quechua cuzqueño y castellano]] ([[nursery]])
  +
** [[Turkish and Tatar]] ([[incubator]])
   
 
* 2011
 
* 2011
** [[Serbo-Croatian and Macedonian]]
+
** [[Serbo-Croatian and Macedonian]] ([[trunk]])
** [[Turkish and Azerbaijani]]
+
** [[Turkish and Azerbaijani]]
** [[Turkish and Kyrgyz]]
+
** [[Turkish and Kyrgyz]] ([[staging]])
** [[Maltese and Hebrew]]
+
** [[Maltese and Hebrew]] ([[staging]])
** [[Slovenian and Spanish]]
+
** [[Slovenian and Spanish]] ([[nursery]])
 
* 2010
 
* 2010
** [[Macedonian and Bulgarian]]
+
** [[Macedonian and Bulgarian]] ([[trunk]])
** [[French and Portuguese]]
+
** [[French and Portuguese]] ([[staging]])
** [[North Sámi and Finnish]]
+
** [[North Sámi and Finnish]] ([[nursery]])
 
** [[Afrikaans and Dutch]] (GCI)
 
** [[Afrikaans and Dutch]] (GCI)
 
* 2009
 
* 2009
** [[Swedish and Danish]]
+
** [[Swedish and Danish]] ([[trunk]])
** [[Norwegian Nynorsk and Norwegian Bokmål]]
+
** [[Norwegian Nynorsk and Norwegian Bokmål]] ([[trunk]])
  +
  +
==Some pairs with suggestions==
  +
  +
{|class="wikitable"
  +
|-
  +
|align=center| '''Pair'''||align=center| '''Current state'''||'''What needs to be done'''||align=center| '''Who ?'''<br/><small>(mentors)</small>
  +
|-
  +
|eo-en (trunk) || The Esperanto->English pair was never really done properly. Therefore strange synonyms are selected, like 'novaĵo' (news item) it translated into 'departure' and 'loĝi' (to live) into 'populate'. Moreover the coverage is low because Esperanto has a grammar where you can effortlessly transform any word into any other word class (like making a verb into a noun) and do compound words || It would require a lot of cleanup (coding of scripts), introduce lex-tools scripts, introduce robust compounding, including do some pondering/experimenting on how to handle compounding. The result is expected to be applied on creating eo->es, eo->fr and/or eo->ca directions as well. Experimenting with a better English tagger disambiguation would also be nice if there is time left. || [[User:Jacob Nordfalk|Jacob Nordfalk]]
  +
|-
  +
|}
   
 
==See also==
 
==See also==
   
 
* [[List of language pairs]]
 
* [[List of language pairs]]
  +
* an example work plan for a language pair: http://wiki.apertium.org/wiki/Maltese_and_Arabic/Work_plan
   
 
[[Category:Ideas for Google Summer of Code|Adopt a language pair]]
 
[[Category:Ideas for Google Summer of Code|Adopt a language pair]]

Latest revision as of 15:22, 20 April 2021

This project will involve writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language. A good intro would be to look through Apertium New Language Pair HOWTO, see also Contributing to an existing pair. If the pair has OK dictionaries but a bad tagger (disambiguator), a GsoC project might include writing a good Constraint Grammar for the pair.

Coding challenge[edit]

The coding challenge for this task is to:

  1. Install Apertium (see Minimal installation from SVN)
  2. Go through the HOWTO
  3. Go through the MT course here (или здесь)
  4. Write a translator that translates as much of this story (or from here ) as possible — Minimum one sentence.
    • If there is no translation, translate it into the languages of your language pair first.
  5. Upload your work to github (or equivalent).

If you don't complete it all, don't worry! We take many things into account when assessing your application. However, the URL to any work you do for the coding challenge work should be included in your application.

Frequently asked questions[edit]

  • Can I do a pair with language and language  ?
— Yes, there are no restrictions. But you should take the following into consideration: (a) Are there existing machine translation (MT) systems for this pair? (b) If there are existing systems, how good are they? -- Could you do better in three months? (c) How closely related is the pair? (d) How many resources already exist for the pair? (e) Are there any mentors who can evaluate your work?
As an example, we're very unlikely to accept applicants to eng-hin, which (a) has support from existing major systems (b) where the existing systems are much better than you'd get in 3 months and (c) which is not closely related. Too many students have tried this pair in the past, and not gotten anywhere. Fortunately, many students who speak Hindi and English also speak a third language – we're much more likely to support that.
  • Do I need to have GNU/Linux installed, or can I use another operating system ?
— In theory you can use any operating system. In practice unless you are using GNU/Linux or Mac/OS you are going to have a hard time as the mentors cannot offer you support with alternative operating systems. You may want to check out Virtualbox if you are using Windows.
  • What programming languages do I need to know ?
— For making a language pair, you don't need to know any specific programming language. Knowing a scripting language will be really helpful, but most of the work is done in Apertium's own linguistic formalisms, which are based on XML. To get an idea of what these formalisms look like, you should do the new language pair HOWTO.
  • Do I have to know both language and language  ?
— You don't have to be a native speaker of the languages, but you should be a fluent speaker of one, and have a good knowledge of the other.
  • Do I have to be able to speak English well?
— No, it isn't necessary to speak English well. If you and your mentor can communicate, then we are happy. However, you should have a basic reading level of English and be able to ask questions when you don't understand.
  • What are the criteria for a language pair to be considered "finished" for GSOC purposes ?
— This is something to discuss with your mentor, but a general idea might be that the pair is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.
  • If I take on a language pair, do I have to do both directions, e.g. and ?
— This will depend on your language pair, you will need to discuss it with your mentor. As a rough guide, if the pair is for dissemination, e.g. translating between related languages, then both directions will be made. If it is for assimilation, then perhaps only one.

Previous GSOC projects[edit]

Here is a list of pairs which were adopted in past years. In brackets following the pair is the current status in the repository.

Some pairs with suggestions[edit]

Pair Current state What needs to be done Who ?
(mentors)
eo-en (trunk) The Esperanto->English pair was never really done properly. Therefore strange synonyms are selected, like 'novaĵo' (news item) it translated into 'departure' and 'loĝi' (to live) into 'populate'. Moreover the coverage is low because Esperanto has a grammar where you can effortlessly transform any word into any other word class (like making a verb into a noun) and do compound words It would require a lot of cleanup (coding of scripts), introduce lex-tools scripts, introduce robust compounding, including do some pondering/experimenting on how to handle compounding. The result is expected to be applied on creating eo->es, eo->fr and/or eo->ca directions as well. Experimenting with a better English tagger disambiguation would also be nice if there is time left. Jacob Nordfalk

See also[edit]