Difference between revisions of "Task ideas for Google Code-in/Add words"
Jump to navigation
Jump to search
(Created page with '# select a language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). # Install Apertium loca…') |
Firespeaker (talk | contribs) m |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
# |
# Select a [[List of language pairs|language pair]], ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). |
||
# Install Apertium locally from the Subversion repository; install the language pair; make sure that it works |
# Install Apertium locally from the Subversion repository; install the language pair; make sure that it works ''AND/OR'' get [http://wiki.apertium.org/wiki/Apertium_VirtualBox Apertium VirtualBox] and update, check out & compile the language pair. |
||
# Using a large enough corpus of representative text in the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect the 50 most frequent unknown words (source words which are not in the dictionaries of the language pair). |
# Using a large enough corpus of representative text in the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect the 50 most frequent unknown words (source words which are not in the dictionaries of the language pair). |
||
# add these words to the source dictionary (so that they are not unknown anymore), add the correspondence to the bilingual dictionary, and add the word to the target dictionary if not already there. |
# add these words to the source dictionary (so that they are not unknown anymore), add the correspondence to the bilingual dictionary, and add the word to the target dictionary if not already there. |
||
# Compile and test again |
# Compile and test again |
||
# Submit a patch to your mentor (or commit it if you have already gained developer access) |
# Submit a patch to your mentor (or commit it if you have already gained developer access) |
||
==How to find the most frequent unknowns== |
|||
<pre> |
|||
<Unhammer> translate your corpus, make it one word per line, grab only the |
|||
ones with * at the start, sort, count number of hits per word, sort |
|||
again |
|||
<Unhammer> e.g. [11:25] |
|||
<Unhammer> zcat corpus.txt.gz | apertium -d . ron-fra | tr ' ' '\n' | grep |
|||
'^\*' | sort |uniq -d |sort -n >hitlist |
|||
<asusAndrei> awesome! [11:26] |
|||
<Unhammer> hitlist will be unknowns sorted by frequency, but you might have to |
|||
skip a couple that are "strange" or difficult to add |
|||
<Unhammer> and that's ok, as long as you start from the most frequent and work |
|||
your way down |
|||
</pre> |
|||
[[Category:Tasks for Google Code-in|Add words]] |
[[Category:Tasks for Google Code-in|Add words]] |
Latest revision as of 05:34, 17 December 2015
- Select a language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁).
- Install Apertium locally from the Subversion repository; install the language pair; make sure that it works AND/OR get Apertium VirtualBox and update, check out & compile the language pair.
- Using a large enough corpus of representative text in the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect the 50 most frequent unknown words (source words which are not in the dictionaries of the language pair).
- add these words to the source dictionary (so that they are not unknown anymore), add the correspondence to the bilingual dictionary, and add the word to the target dictionary if not already there.
- Compile and test again
- Submit a patch to your mentor (or commit it if you have already gained developer access)
How to find the most frequent unknowns[edit]
<Unhammer> translate your corpus, make it one word per line, grab only the ones with * at the start, sort, count number of hits per word, sort again <Unhammer> e.g. [11:25] <Unhammer> zcat corpus.txt.gz | apertium -d . ron-fra | tr ' ' '\n' | grep '^\*' | sort |uniq -d |sort -n >hitlist <asusAndrei> awesome! [11:26] <Unhammer> hitlist will be unknowns sorted by frequency, but you might have to skip a couple that are "strange" or difficult to add <Unhammer> and that's ok, as long as you start from the most frequent and work your way down