Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Task ideas for Google Code-in/Add words to monolingual dictionary

From Apertium
< Task ideas for Google Code-in(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
 
# '''Select a [https://github.com/search?q=topic%3Aapertium-languages language module]''', ideally such that the language is a language you know.
 
# '''Select a [https://github.com/search?q=topic%3Aapertium-languages language module]''', ideally such that the language is a language you know.
 
# '''Install Apertium''' locally from nightlies ([[Installation#Installing:_a_summary|instructions here]]); clone the relevant language module from GitHub; compile it; and check that it works. Alternatively, get [http://wiki.apertium.org/wiki/Apertium_VirtualBox Apertium VirtualBox] and update, check out & compile the language pair.
 
# '''Install Apertium''' locally from nightlies ([[Installation#Installing:_a_summary|instructions here]]); clone the relevant language module from GitHub; compile it; and check that it works. Alternatively, get [http://wiki.apertium.org/wiki/Apertium_VirtualBox Apertium VirtualBox] and update, check out & compile the language pair.
# Using a large enough corpus of representative text in the language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) '''detect the 200 most frequent unknown words''' (words in the source document which are not in the dictionary). See below for information about how to do this. Note: the beginner version of this task only requires 50 words.
+
# Using a large enough corpus of representative text in the language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) '''detect the 250 most frequent unknown words''' (words in the source document which are not in the dictionary). See below for information about how to do this. Note: the beginner version of this task only requires 100 words.
 
# '''Add the words to the monolingual dictionary''' (the appropriate <code>.dix</code> or <code>.lexc</code> file) so that they are not unknown anymore. Make sure to categorise stems correctly (this can be hard, so please check with your mentor if you're unsure about anything!).
 
# '''Add the words to the monolingual dictionary''' (the appropriate <code>.dix</code> or <code>.lexc</code> file) so that they are not unknown anymore. Make sure to categorise stems correctly (this can be hard, so please check with your mentor if you're unsure about anything!).
 
# '''Compile and test again'''
 
# '''Compile and test again'''

Revision as of 01:26, 9 December 2019

  1. Select a language module, ideally such that the language is a language you know.
  2. Install Apertium locally from nightlies (instructions here); clone the relevant language module from GitHub; compile it; and check that it works. Alternatively, get Apertium VirtualBox and update, check out & compile the language pair.
  3. Using a large enough corpus of representative text in the language (e.g. plain text taken from Wikipedia, newspapers, literature, etc.) detect the 250 most frequent unknown words (words in the source document which are not in the dictionary). See below for information about how to do this. Note: the beginner version of this task only requires 100 words.
  4. Add the words to the monolingual dictionary (the appropriate .dix or .lexc file) so that they are not unknown anymore. Make sure to categorise stems correctly (this can be hard, so please check with your mentor if you're unsure about anything!).
  5. Compile and test again
  6. Submit a pull request to the GitHub repository with your updates.

How to find the most frequent unknowns

Paraphrased from Unhammer:

  • analyse your corpus, make it one word per line, grab only the ones with * at the start, sort, count number of hits per word, sort again
  • e.g. zcat corpus.txt.gz | apertium -d . ron-morph | tr ' ' '\n' | grep '^\*' | sort |uniq -d |sort -n >hitlist
<Unhammer> hitlist will be unknowns sorted by frequency, but you might have to
           skip a couple that are "strange" or difficult to add
<Unhammer> and that's ok, as long as you start from the most frequent and work
           your way down
Personal tools