Difference between revisions of "Task ideas for Google Code-in/Grow bilingual"

From Apertium
Jump to navigation Jump to search
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
 
# '''Select a language pair''', ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁), such that it has rather good monolingual dictionaries in Apertium but no reasonable bilingual dictionary (these language pairs are usually in the incubator), for instance apertium-spa-pol
 
# '''Select a language pair''', ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁), such that it has rather good monolingual dictionaries in Apertium but no reasonable bilingual dictionary (these language pairs are usually in the incubator), for instance apertium-spa-pol
# '''Install Apertium''' locally from nightlies [[Installation#Installing:_a_summary|instructions here]]; clone the relevant language modules and pair from GitHub; make sure that it works. Alternatively, get [http://wiki.apertium.org/wiki/Apertium_VirtualBox Apertium VirtualBox] and update, check out & compile the language pair.
+
# '''Install Apertium''' locally from nightlies ([[Installation#Installing:_a_summary|instructions here]]); clone the relevant language modules and pair from GitHub; make sure that it works. Alternatively, get [http://wiki.apertium.org/wiki/Apertium_VirtualBox Apertium VirtualBox] and update, check out & compile the language pair.
 
# Using a large enough corpus of representative text in the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) '''detect the 200 most frequent unknown words''' (words in the source document which are not in the bilingual dictionaries of the language pair). See below for information about how to do this. Note: the beginner version of this task only requires 50 words.
 
# Using a large enough corpus of representative text in the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) '''detect the 200 most frequent unknown words''' (words in the source document which are not in the bilingual dictionaries of the language pair). See below for information about how to do this. Note: the beginner version of this task only requires 50 words.
 
# '''Add these correspondences to the bilingual dictionary''' (the appropriate <code>.dix</code> file) in [[bidix]] format (so that they are not unknown anymore), as well as the monolingual analysers if needed. Make sure to categorise stems correctly.
 
# '''Add these correspondences to the bilingual dictionary''' (the appropriate <code>.dix</code> file) in [[bidix]] format (so that they are not unknown anymore), as well as the monolingual analysers if needed. Make sure to categorise stems correctly.
Line 11: Line 11:
 
ones with * at the start, sort, count number of hits per word, sort
 
ones with * at the start, sort, count number of hits per word, sort
 
again
 
again
  +
<Unhammer> e.g.
<Unhammer> e.g. [11:25]
 
 
<Unhammer> zcat corpus.txt.gz | apertium -d . ron-fra | tr ' ' '\n' | grep
 
<Unhammer> zcat corpus.txt.gz | apertium -d . ron-fra | tr ' ' '\n' | grep
'^\*' | sort |uniq -d |sort -n >hitlist
+
'^\*' | sort | uniq -c | sort -n >hitlist
<asusAndrei> awesome! [11:26]
+
<asusAndrei> awesome!
 
<Unhammer> hitlist will be unknowns sorted by frequency, but you might have to
 
<Unhammer> hitlist will be unknowns sorted by frequency, but you might have to
 
skip a couple that are "strange" or difficult to add
 
skip a couple that are "strange" or difficult to add

Latest revision as of 15:01, 19 January 2020

  1. Select a language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁), such that it has rather good monolingual dictionaries in Apertium but no reasonable bilingual dictionary (these language pairs are usually in the incubator), for instance apertium-spa-pol
  2. Install Apertium locally from nightlies (instructions here); clone the relevant language modules and pair from GitHub; make sure that it works. Alternatively, get Apertium VirtualBox and update, check out & compile the language pair.
  3. Using a large enough corpus of representative text in the source language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect the 200 most frequent unknown words (words in the source document which are not in the bilingual dictionaries of the language pair). See below for information about how to do this. Note: the beginner version of this task only requires 50 words.
  4. Add these correspondences to the bilingual dictionary (the appropriate .dix file) in bidix format (so that they are not unknown anymore), as well as the monolingual analysers if needed. Make sure to categorise stems correctly.
  5. Compile and test again
  6. Submit a pull request to the GitHub repositories

How to find the most frequent unknowns[edit]

<Unhammer> translate your corpus, make it one word per line, grab only the
           ones with * at the start, sort, count number of hits per word, sort
           again
<Unhammer> e.g.
<Unhammer> zcat corpus.txt.gz | apertium -d . ron-fra | tr ' ' '\n' | grep
           '^\*' | sort | uniq -c | sort -n >hitlist
<asusAndrei> awesome!
<Unhammer> hitlist will be unknowns sorted by frequency, but you might have to
           skip a couple that are "strange" or difficult to add
<Unhammer> and that's ok, as long as you start from the most frequent and work
           your way down