Difference between revisions of "Task ideas for Google Code-in"

From Apertium
Jump to navigation Jump to search
(adding tasks galore)
(tasking adds... err... adding tasks...)
Line 45: Line 45:
 
| {{sc|code}} || How much of a given sentence pair is explained by Apertium? || Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words). || [[User:Mlforcada]]
 
| {{sc|code}} || How much of a given sentence pair is explained by Apertium? || Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words). || [[User:Mlforcada]]
 
|-
 
|-
| {{sc|quality}} || Compare Apertium with another MT system and improve it || (1) Install the Apertium language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Collect a corpus of text (newspaper, wikipedia)
+
| {{sc|quality}} || Compare Apertium with another MT system and improve it || This tasks aims at improving an Apertium language pair when a web-accessible system exists for it in the 'net. Particularly good if the system is (approximately) rule-based such as [http://www.lucysoftware.com/english/machine-translation/lucy-lt-kwik-translator-/ Lucy], [http://www.reverso.net/text_translation.aspx?lang=EN Reverso], [http://www.systransoft.com/free-online-translation Systran] or [http://www.freetranslation.com/ SDL Free Translation]: (1) Install the Apertium language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Collect a corpus of text (newspaper, wikipedia)
 
Segment it in sentences (using e.g., libsegment-java or a similar processor and a [https://en.wikipedia.org/wiki/Segmentation_Rules_eXchange SRX] segmentation rule file borrowed from e.g. OmegaT) and put each sentence in a line. Run the corpus through Apertium and through the other system
 
Segment it in sentences (using e.g., libsegment-java or a similar processor and a [https://en.wikipedia.org/wiki/Segmentation_Rules_eXchange SRX] segmentation rule file borrowed from e.g. OmegaT) and put each sentence in a line. Run the corpus through Apertium and through the other system
 
Select those sentences where both outputs are very similar (e.g, 90% coincident). Decide which one is better. If the other language is better than Apertium, think of what modification could be done for Apertium to produce the same output, and make 3 such modifications.|| [[User:Mlforcada]] (alternative mentors welcome)
 
Select those sentences where both outputs are very similar (e.g, 90% coincident). Decide which one is better. If the other language is better than Apertium, think of what modification could be done for Apertium to produce the same output, and make 3 such modifications.|| [[User:Mlforcada]] (alternative mentors welcome)

Revision as of 14:41, 24 October 2013

This is the task ideas page for Google Code-in (http://www.google-melange.com/gci/homepage/google/gci2013), here you can find ideas on interesting tasks that will improve your knowledge of Apertium and help you get into the world of open-source development.

The people column lists people who you should get in contact with to request further information. All tasks are 2 hours maximum estimated amount of time that would be spent on the task by an experienced developer, however:

  1. this does not include time taken to install / set up apertium.
  2. this is the time expected to take by an experienced developer, you may find that you spend more time on the task because of the learning curve.

Categories:

  • code: Tasks related to writing or refactoring code
  • documentation: Tasks related to creating/editing documents and helping others learn more
  • research: Tasks related to community management, outreach/marketting, or studying problems and recommending solutions
  • quality: Tasks related to testing and ensuring code is of high quality.
  • interface: Tasks related to user experience research or user interface design and interaction

Task list

Category Title Description Mentors
code, quality Add 50 words to the vocabulary of a language pair (1) select a language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works and/or getApertium VirtualBox and update, check out & compile the language pair. (3) Using a large enough corpus of the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect unknown words (source words which are not in the dictionaries of the language pair). (4) add 50 such words (preferably the most frequent unknown words) to the source dictionary, add the correspondence to the bilingual dictionary, and add the word to the target dictionary if not already there. (5) Compile and test again(6) Submit a patch to your mentor (or commit it if you have already gained developer access) User:Mlforcada (alternative mentors welcome)
code, quality Add/correct one structural transfer rule to an existing language pair (1) select a language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works and/or getApertium VirtualBox and update, check out & compile the language pair. (3) Using a large enough corpus of the source language (e.g. plain text taken from Wikipedia, detect one structural transfer rule (.t1x, .t2x, .t3x) wrong or missing (local agreement -gender, number, etc.- is inadequate, local word order in a phrase is inadequate, there is a word too much or a word missing, etc.); (4) write a new rule or correct the existing rule. (5) Compile and test again. (6) Submit a patch to your mentor (or commit it if you have already gained developer access). User:Mlforcada (alternative mentors welcome)
code, quality Write 10 lexical selection rules for a language pair already set up with lexical selection (1) select a language pair that is already set up for lexical selection, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works and/or getApertium VirtualBox and update, check out & compile the language pair. (3) Using a large enough corpus of the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.), detect cases of inadequate lexical choice, that is, the translation is grammatical but the translation selected for one word is not correct (because the source word is polysemous or has more than one meaning). (4) Add entries to the bilingual dictionary if needed and write 10 lexical selection rules that select the correct translation in the relevant context. (5) Compile and test again. (6) Submit a patch to your mentor (or commit it if you have already gained developer access). User:Mlforcada, User:Francis Tyers (more mentors welcome)
code Set up a language pair to use lexical selection and write 5 rules (1) select a language pair that is not yet set up for lexical selection, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works and/or getApertium VirtualBox and update, check out & compile the language pair. (3) Using a large enough corpus of the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.), detect cases of inadequate lexical choice, that is, the translation is grammatical but the translation selected for one word is not correct (because the source word is polysemous or has more than one meaning). (4) Set up lexical selection for the language pair, add entries to the bilingual dictionary if needed, and write 5 lexical selection rules that select the correct translation in the relevant context. User:Mlforcada, User:Francis Tyers (more mentors welcome)
code - - -
code, quality Write constraint grammar rules to repair part-of-speech tagging errors (1) select a language pair that already uses constraint grammar for part-of-speech tagging, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works and/or getApertium VirtualBox and update, check out & compile the language pair. (3) Using a large enough corpus of the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect part-of-speech tagging errors (the translation is not adequate because the part-of-speech tagger in Apertium has selected the wrong morphological analysis for a word that had more than one); (4) write 10 constraint grammar rules that select the desired part of speech in the relevant context(s); (5) Compile and test again, possibly after retraining the statistical part-of-speech tagger (6) Submit a patch to your mentor (or commit it if you have already gained developer access) User:Mlforcada, User:Francis Tyers (more mentors welcome)
code Set up a language pair such that it uses constraint grammar for part-of-speech tagging (1) select a language pair that does not yet use constraint grammar for part-of-speech tagging, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works and/or getApertium VirtualBox and update, check out & compile the language pair. (3) Using a large enough corpus of the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect part-of-speech tagging errors (the translation is not adequate because the part-of-speech tagger in Apertium has selected the wrong morphological analysis for a word that had more than one); (4) set it up so that it uses it (get inspiration from constraint grammar files from other languages) and write 5 constraint grammar rules that select the desired part of speech in the relevant context(s); (5) Compile and test again, possibly after retraining the statistical part-of-speech tagger (6) Submit a patch to your mentor (or commit it if you have already gained developer access) User:Mlforcada, User:Francis Tyers (more mentors welcome)
code Localised available languages function in apertium-apy Make a new function for apertium-apy, is takes as input a language code, and as output gives the list of available pairs, and their translations in the language specified by the language code. You will probably need to know JavaScript and Python. User:Firespeaker User:Unhammer User:Francis Tyers
code Fix the highlighting in simple-html language selection boxes. The language selection box in the simple-html interface should highlight the supported target languages when a user clicks on a source language. At the moment this does not work properly. For this task you will need to know Javascript. User:Firespeaker User:Francis Tyers
code Language detection in apertium-apy Make a new function for apertium-apy, that allows the language of some input text to be identified. For this task you will also need to train models for the language identifier. User:Firespeaker User:Unhammer User:Francis Tyers
code SSL in apertium-apy Make apertium-apy optionally use SSL. (If you put simple-html on an ssl domain, new browsers won't let you do plaintext/non-ssl ajax). User:Firespeaker User:Unhammer User:Francis Tyers
code How much of a given sentence pair is explained by Apertium? Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words). User:Mlforcada
quality Compare Apertium with another MT system and improve it This tasks aims at improving an Apertium language pair when a web-accessible system exists for it in the 'net. Particularly good if the system is (approximately) rule-based such as Lucy, Reverso, Systran or SDL Free Translation: (1) Install the Apertium language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Collect a corpus of text (newspaper, wikipedia)

Segment it in sentences (using e.g., libsegment-java or a similar processor and a SRX segmentation rule file borrowed from e.g. OmegaT) and put each sentence in a line. Run the corpus through Apertium and through the other system Select those sentences where both outputs are very similar (e.g, 90% coincident). Decide which one is better. If the other language is better than Apertium, think of what modification could be done for Apertium to produce the same output, and make 3 such modifications.|| User:Mlforcada (alternative mentors welcome)

documentation Check that the Apertium guide for Windows users still works We have an Apertium guide for Windows users, to help them install on Windows. Check that it works, and if not, report any bugs you find. User:Francis Tyers
documentation Installation instructions for missing GNU/Linux distributions or versions Adapt installation instructions for a particular GNU/Linux or Unix-like distribution if the existing instructions in the Apertium wiki do not work or have bugs of some kind. Prepare it in your user space in the Apertium wiki. It may be uploaded to the main wiki when approved. User:Mlforcada (alternative mentors welcome)
documentation Installing Apertium in lightweight GNU/Linux distributions Give instructions on how to install Apertium in one of the small or lightweight GNU/Linux distributions such as Damn Small Linux, so that may be used in older machines User:Mlforcada (alternative mentors welcome)
documentation - - -
documentation What's difficult about this language pair? For a language pair that is not in trunk or staging such that you know well the two languages involved, write a document describing the main problems that Apertium developers would encounter when developing that language pair (for that, you need to know very well how Apertium works). Note that there may be two such documents, one for A→B and the other for B→A Prepare it in your user space in the Apertium wiki.It may be uploaded to the main wiki when approved. User:Mlforcada (alternative mentors welcome)
documentation Video guide to installation Prepare a screencast or video about installing Apertium; make sure it uses a format that may be viewed with Free software. When approved by your mentor, upload it to youtube, making sure that you use the HTML5 format which may be viewed by modern browsers without having to use proprietary plugins such as Adobe Flash. [[User:Mlforcada] (alternative mentors welcome)
documentation Apertium in 5 slides Write a 5-slide HTML presentation (only needing a modern browser to be viewed and ready to be effectively "karaoked" by some else in 5 minutes or less: you can prove this with a screencast) in the language in which you write more fluently, which describes Apertium, how it works, and what makes it different from other machine translation systems. User:Mlforcada (alternative mentors welcome)
documentation Improved "Become a language-pair developer" document Read the document Become_a_language_pair_developer_for_Apertium and think of ways to improve it (don't do this if you have not done any of the language pair tasks). Send comments to your mentor and/or repare it in your user space in the Apertium wiki. There will be a chance to change the document later in the Apertium Wiki. User:Mlforcada
documentation An entry test for Apertium Write 20 multiple-choice questions about Apertium. Each question will give 3 options of which only one is true, so that we can build an "Apertium exam" for future GSoC/GCI/developers. Optionally, add an explanation for the correct answer. User:Mlforcada
research Hand annotate 250 words of running text. Use apertium annotatrix to hand-annotate 250 words of running text from Wikipedia for a language of your choice. User:Francis Tyers
research The most frequent Romance-to-Romance transfer rules Study the .t1x transfer rule files of Romance language pairs and distill 5-10 common rules that are common to all of them, perhaps by rewriting them into some equivalent form User:Mlforcada
research Tag and align Macedonian--Bulgarian corpus Take a Macedonian--Bulgarian corpus, for example SETimes, tag it using the apertium-mk-bg pair, and word-align it using GIZA++. User:Francis Tyers
code Write a program to extract Bulgarian inflections Write a program to extract Bulgarian inflection information for nouns from Wiktionary, see Category:Bulgarian nouns User:Francis Tyers
quality - - -
quality Improve the quality of a language pair by adding entries to it Improve the quality of a language pair by (a) running a large amount of representative text through it, (b) determining the 30 most frequent unknown words and (c) adding them to the dictionaries so that they are not unknown anymore User:Mlforcada
quality Improve the quality of a language pair by allowing for alternative translations Improve the quality of a language pair by (a) detecting 5 cases where the (only) translation provided by the bilingual dictionary is not adequate in a given context, (b) adding the lexical selection module to the language, and (c) writing effective lexical selection rules to exploit that context to select a better translation User:Francis Tyers User:Mlforcada
interface Abstract the formatting for the simple-html interface. The simple-html interface should be easily customisable so that people can make it look how they want. The task is to abstract the formatting and make one or more new stylesheets to change the appearance. This is basically making a way of "skinning" the interface. User:Francis Tyers
interface Update the Apertium guide for Windows users with new language pairs Make sure that the Apertium guide for Windows users and the Apertium Windows installer is up to date with all the new language pairs. User:Francis Tyers
code Write a program to extract Faroese inflections Write a program to extract Faroese inflection information for nouns from Wiktionary, see Category:Faroese nouns User:Francis Tyers
code Light Apertium bootable ISO for small machines Using Damn Small Linux or a similar lightweight GNU/Linux, produce the minimum-possible bootable live ISO or live USB image that contains the OS, minimum editing facilities, Apertium, and a language pair of your choice. Make sure no package that is not strictly necessary for Apertium to run is included. User:Mlforcada (alternative mentors welcome)
code Apertium in XLIFF workflows Write a shell script and (if possible, using the filter definition files found in the documentation) a filter that takes an XLIFF file such as the ones representing a computer-aided translation job and populates with translations of all segments that are not translated, marking them clearly as machine-translated. User:Mlforcada (alternative mentors welcome)
testing Examples of minimum files where an Apertium language pair messes up (X)HTML formatting Sometimes, an Apertium language pair takes a valid HTML/XHTML source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of HTML/XHTML files for testing purposes. Make sure they are valid using an HTML/XHTML validator (4) translate the valid files with the language pair (5) check if the translated files are also valid HTML/XHTML files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. User:Mlforcada (alternative mentors welcome)
code Make sure an an Apertium language pair does not mess up (X)HTML formatting (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up (X)HTML formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks () are output and are in the same order as in the source file. This may involve introducing new simple blanks () and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) User:Mlforcada (alternative mentors welcome)
testing Examples of minimum files where an Apertium language pair messes up wordprocessor (ODT, RTF) formatting Sometimes, an Apertium language pair takes a valid ODT or RTF source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of ODT or RTF files for testing purposes. Make sure they are opened using LibreOffice/OpenOffice.org (4) translate the valid files with the language pair (5) check if the translated files are also valid ODT or RTF files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. User:Mlforcada (alternative mentors welcome)
code Make sure an an Apertium language pair does not mess up wordprocessor (ODT, RTF) formatting (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up wordprocessor formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks () are output and are in the same order as in the source file. This may involve introducing new simple blanks () and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) User:Mlforcada (alternative mentors welcome)
code Start a language pair involving Interlingua Start a new language pair involving Interlingua using the Apertium new language HOWTO. Interlingua is the second most used "artificial" language, after Esperanto). As Interlingua is basically a Romance language, you can use a Romance language as the other language, and Romance-language dictionaries rules may be easily adapted. Include at least 50 very frequent words (including some grammatical words) and at least one noun--phrase transfer rule in the ia→X direction. User:Mlforcada (will reach out also to the interlingua community)
code Generating 'machine translation memories' Write a shell script and (using the filter definition files found in the documentation) a filter that takes a plain text file, segments it in sentences using the program segment and an SRX specification (which can be borrowed from OmegaT) and writes a TMX file in which each segment is paired with its Apertium translation, ready to be used with OmegaT as a "machine translation memory" User:Mlforcada (alternative mentors welcome)