Difference between revisions of "Building dictionaries"

Revision as of 21:13, 22 December 2011

I want to share some lessons I have learned after building some dictionaries: the importance of frequency estimates. For the new pairs to have the best possible coverage with a minimum of effort, it is very important to add words and rules in decreasing frequency, starting with the most frequent words and phenomena.

The reason that words should be added in order of frequency is quite intuitive: the higher the frequency, the more likely the word is to appear in the text you are trying to translate (see below for Zipf's law).

For example, in English you can almost be sure that the words "the" or "a" will appear in all but the most basic sentences; however, how many times have you seen "hypothyroidism" or "obelisk" written? The higher the frequency of the word, the more you "gain" from adding it.

Frequency

A person's intuition on which words are important or frequent can be very deceptive. Therefore, the best one can do is collect a lot of text (millions of words, if possible) which is representative of what one wants to translate, and study the frequencies of words and phenomena. Get it from Wikipedia or from a newspaper archive, or write a robot that harvests it from the Web.

It is quite easy to make a crude "hit parade" of words using a simple Unix command sequence (a single line):

$ cat mybigrepresentative.txt | tr ' ' '\012' | sort -f | uniq -c | sort -nr > hitparade.txt

[I took this from Unix for Poets, I think.]

Of course, this may be improved a lot but serves for illustration purposes.

Word frequency vs. Word rank: A plot of word frequency in Wikipedia. The plot is in log-log coordinates. X is the rank of a word in the frequency table; Y is the total number of the word’s occurences. Zipf's law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.

You will find interesting properties in this list. One is that in multiplying the rank of a word by its frequency, you get a number which is pretty constant. That's called Zipf's Law.

Another one is that half of the list are hapax legomena (words that appear only once).

Third, with about 1,000 words you may have 75% of the text covered.

So use lists like these when you are building dictionaries.

If one of your languages is English, there are interesting lists:

Ogden's Basic English (850 words)
Voice of America's Special English

Bear in mind, of course, that these lists are also based on a particular usage model of English, which is not "naturally occurring" English.

The same applies for other linguistic phenomena. Linguists tend to focus on very infrequent phenomena which are key to the identity of a language, or on what is different between languages. But these "jewels" are usually not the "building blocks" you would use to build translation rules. So do not get carried away. Trust only frequencies and lots of real text.

Corpus catcher

http://translate.sourceforge.net/wiki/corpuscatcher/index

Wikipedia dumps

http://download.wikimedia.org/backup-index.html

For help in processing them, see:

http://meta.wikimedia.org/wiki/Help:Export

The dumps need cleaning up (removing Wiki syntax and XML etc.), but can provide a substantial amount of text — both for frequency analysis and as a source of sentences for POS tagger training. It can take some work, and isn't as easy as getting a nice corpus, but on the other hand they're available in some 275 languages with at least 100 articles written in each.

You'll want the one entitled "Articles, templates, image descriptions, and primary meta-pages. -- This contains current versions of article content, and is the archive most mirror sites will probably want."

Something like (for Afrikaans):

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g'

This will give you approximately useful lists of one sentence per line (stripping out most of the extraneous formatting). Note, this presumes that your language uses the Latin alphabet; if it uses another writing system, you'll need to change that.

Try something like (for Afrikaans):

$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed 's/$/\n/g' | 
sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' | sed 's/&.*;/ /g' | tr ' ' '\012' | 
sort -f | uniq -c | sort -nr > hitparade.txt

Once you have this 'hitparade' of words, it is first probably best to skim off the top 20,000–30,000 into a separate file.

$ cat hitparade.txt | head -20000 > top.lista.20000.txt

Now, if you already have been working on a dictionary, chances are that there will exist in this 'top list' words you have already added. You can remove word forms you are already able to analyse using (for example Afrikaans):

$ cat top.lista.20000.txt | apertium-destxt | lt-proc af-en.automorf.bin  | apertium-retxt | grep '\/\*' > words_to_be_added.txt

(here lt-proc af-en.automorf.bin will analyse the input stream of Afrikaans words and put an asterisk * on those it doesn't recognise)

For every 10 words or so you add, it's probably worth going back and repeating this step, especially for highly inflected languages — as one lemma can produce many word forms, and the wordlist is not lemmatised.

Getting cheap bilingual dictionary entries

A cheap way of getting bilingual dictionary entries between a pair of languages is as follows:

First grab yourself a wordlist of nouns in language x; for example, grab them out of the Apertium dictionary you are using:

$ cat <monolingual dictionary> | grep '<i>' | grep '__n\"' | awk -F'"' '{print $2}'

Next, write a basic script, something like:

#!/bin/sh

#language to translate from
LANGF=$2 
#language to translate to
LANGT=$3
#filename of wordlist
LIST=$1

for LWORD in `cat $LIST`; do 
        TEXT=`wget -q http://$LANGF.wikipedia.org/wiki/$LWORD -O - | grep 'interwiki-'$LANGT`; 
        if [ $? -eq '0' ]; then
                RWORD=`echo $TEXT |  
                cut -f4 -d'"' | cut -f5 -d'/' | 
                python -c 'import urllib, sys; print urllib.unquote(sys.stdin.read());' |
                sed 's/(\w*)//g'`;
                echo '<e><p><l>'$LWORD'<s n="n"/></l><r>'$RWORD'<s n="n"/></r></p></e>'; 
        fi;
        sleep 8;
done

Note: The "sleep 8" is so that we don't put undue strain on the Wikimedia servers.

If you save this as iw-word.sh, then you can use it at the command line:

$ sh iw-word.sh <wordlist> <language code from> <language code to>

Fr example, to retrieve a bilingual wordlist from English to Afrikaans, use:

$ sh iw-word.sh en-af.wordlist en af

The method is of variable reliability. Reports of between 70% and 80% accuracy are common. It is best for unambiguous terms, but works all right where terms retain ambiguity through languages.

Any correspondences produced by this method must be checked by native or fluent speakers of the language pairs in question.

Monodix

Main article: Monodix

If the language you're working with is fairly regular, and noun inflection is quite easy (for example English or Afrikaans), then the following script may be useful:

You'll need a large wordlist (of all forms, not just lemmata) and some existing paradigms. It works by first taking all singular forms out of the list, then looking for plural forms, then printing out those which have both singular and plural forms in Apertium format.

Note: These will need to be checked, as no language except Esperanto is that regular.

# set this to the location of your wordlist
WORDLIST=/home/spectre/corpora/afrikaans-meester-utf8.txt

# set the paradigm, and the singular and plural endings.
PARADIGM=sa/ak__n
SINGULAR=aak
PLURAL=ake
# set this to the number of characters that need to be kept from the singular form.
# e.g. [0:-1] means 'cut off one character', [0:-2] means 'cut off two characters' etc.
ECHAR=`echo -n $SINGULAR | python -c 'import sys; print sys.stdin.read().decode("utf8")[0:-1];'

PLURALS=`cat $WORDLIST | grep $PLURAL$`
SINGULARS=`cat $WORDLIST | grep $SINGULAR$`
CROSSOVER=""

for word in $PLURALS; do 
        SFORM=`echo $word | sed "s/$PLURAL/$SINGULAR/g"`
        cat $WORDLIST | grep ^$SFORM$ > /dev/null
        # if the form is found then append it to the list
        if [ $? -eq 0 ]; then
                CROSSOVER=$CROSSOVER" "$SFORM
        fi
done

# print out the list
for pair in $CROSSOVER; do
        echo '    <e lm="'$pair'"><i>'`echo $pair | sed "s/$SINGULAR/$ECHAR/g"`'</i><par n="'$PARADIGM'"/></e>';
done

Difference between revisions of "Building dictionaries"

Revision as of 21:13, 22 December 2011

Contents

Frequency

Corpus catcher

Wikipedia dumps

Getting cheap bilingual dictionary entries

Monodix

See also

Further reading

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 2: / Line 2: @@
 Some of you have been brave enough to start to write new language pairs
 for Apertium. That makes me (and all of the Apertium troop) very happy
-and thankful, but, most importantly, makes Apertium useful to more
+and thankful, but more importantly, it makes Apertium useful to more
 people.
-This time I want to share some lessons I have learned after building
+I want to share some lessons I have learned after building
 some dictionaries: the importance of frequency estimates. For the new
 pairs to have the best possible coverage with a minimum of effort, it is
@@ Line 11: / Line 11: @@
 with the most frequent words and phenomena.
-The reason that words should be added in order of frequency is quite intuitive,
+The reason that words should be added in order of frequency is quite intuitive:
 the higher the frequency, the more likely the word is to appear in the text you are
 trying to translate (see below for Zipf's law).
-For example in English you can almost be sure that the words "the" or
+For example, in English you can almost be sure that the words "the" or
-"a" will appear in all but the most basic sentences, however how many
+"a" will appear in all but the most basic sentences; however, how many
-times have you seen "hypothyroidism" or "obelisk" written? The higher the frequency
+times have you seen "hypothyroidism" or "obelisk" written? The higher the frequency of
 the word, the more you "gain" from adding it.
 ==Frequency==
-A person's intuition on which words are important of frequent can be
+A person's intuition on which words are important or frequent can be
 very deceptive. Therefore, the best one can do is collect a lot of text
-(millions of words if possible) which is representative of what one
+(millions of words, if possible) which is representative of what one
 wants to translate, and study the frequencies of words and phenomena.
-Get it from Wikipedia, or from newspaper, or write a robot that harvests
+Get it from Wikipedia or from a newspaper archive, or write a robot that harvests
-it from the web.
+it from the Web.
 It is quite easy to make a crude "hit parade" of words using a simple
-Unix command sequence (a single line)
+Unix command sequence (a single line):
 <pre>
 $ cat mybigrepresentative.txt | tr ' ' '\012' | sort -f | uniq -c | sort -nr > hitparade.txt
 </pre>
-[I took this from Unix for Poets I think]
+[I took this from ''Unix for Poets'', I think.]
 Of course, this may be improved a lot but serves for illustration
 purposes.
-[[Image:Wikipedia-n-zipf.png|thumb|300px|right|'''Word frequency vs. Word rank''': A plot of word frequency in Wikipedia. The plot is in log-log coordinates. x  is rank of a word in the frequency table; y  is the total number of the word’s occurences. Zipf's law corresponds to the upper linear portion of the curve, roughly following the green (1/x)  line.]]
+[[Image:Wikipedia-n-zipf.png|thumb|320px|right|'''Word frequency vs. Word rank''': A plot of word frequency in Wikipedia. The plot is in log-log coordinates. ''X'' is the rank of a word in the frequency table; ''Y'' is the total number of the word’s occurences. Zipf's law corresponds to the upper linear portion of the curve, roughly following the green (1/''x'') line.]]
-You will find interesting properties in this list.
-One is that multiplying the rank of a word by its frequency, you get a
+You will find interesting properties in this list. One is that in multiplying the rank of a word by its frequency, you get a number which is pretty constant. That's called [http://en.wikipedia.org/wiki/Zipf%27s_law Zipf's Law].
-number which is pretty constant. That's called [http://en.wikipedia.org/wiki/Zipf%27s_law Zipf's Law].
-The other one is that half of the list are "hapax legomena" (words that
+Another one is that '''half of the list''' are ''hapax legomena'' (words that appear only once).
-appear only once).
-And third, with about 1000 words you may have 75% of the text covered.
+Third, with about 1,000 words you may have 75% of the text covered.
 So use lists like these when you are building dictionaries.
-If one of your language is English, there are interesting lists:
+If one of your languages is English, there are interesting lists:
 * [http://ogden.basic-english.org/words.html Ogden's Basic English] (850 words)
 * [http://www.voanews.com/specialenglish Voice of America's Special English]
-But bear in mind that these lists are also based on a particular usage
+Bear in mind, of course, that these lists are also based on a particular usage model of English, which is not "naturally occurring" English.
-model of English, which is not "natural occurring" English.
 The same applies for other linguistic phenomena. Linguists tend to focus
@@ Line 64: / Line 61: @@
 are usually not the "building blocks" you would use to build translation
 rules. So do not get carried away. Trust only frequencies and lots of
-real text...
+real text.
 ==Corpus catcher==
@@ Line 74: / Line 71: @@
 * http://download.wikimedia.org/backup-index.html
-For help in processing them see:
+For help in processing them, see:
 * http://meta.wikimedia.org/wiki/Help:Export
 The dumps need cleaning up (removing Wiki syntax and XML etc.), but can
-provide a ''substantial'' amount of text for both frequency analysis, and
+provide a ''substantial'' amount of text &mdash; both for frequency analysis and
-sentences for POS [[tagger training]]. It can take some work, and isn't as
+as a source of sentences for POS [[tagger training]]. It can take some work, and isn't as
 easy as getting a nice corpus, but on the other hand they're available
+in some [http://meta.wikimedia.org/wiki/List_of_Wikipedias 275 languages] with at least 100 articles written in each.
-in ~270 languages.
 You'll want the one entitled "Articles, templates, image descriptions,
@@ Line 96: / Line 93: @@
 </pre>
-Will give you approximately useful lists of one sentence per line
+This will give you approximately useful lists of one sentence per line
 (stripping out most of the extraneous formatting). Note, this presumes that your
-language uses the Latin alphabet, if it uses another writing system,
+language uses the Latin alphabet; if it uses another writing system,
 you'll need to change that.
@@ Line 110: / Line 107: @@
 Once you have this 'hitparade' of words, it is first probably best to skim
-off the top 20&mdash;30,000. Into a separate file.
+off the top 20,000&ndash;30,000 into a separate file.
 <pre>
@@ Line 116: / Line 113: @@
 </pre>
-Now, if you already have been working on a dictionary then the chances are that there
+Now, if you already have been working on a dictionary, chances are that there
 will exist in this 'top list' words you have already added. You can remove word forms
 you are already able to analyse using (for example Afrikaans):
@@ Line 124: / Line 121: @@
 </pre>
-(here <code>lt-proc af-en.automorf.bin</code> will analyse input stream of Afrikaans words and put an asterisk * on those it doesn't recognise)
+(here <code>lt-proc af-en.automorf.bin</code> will analyse the input stream of Afrikaans words and put an asterisk * on those it doesn't recognise)
-For every 10 words or so you add, its probably worth going back and repeating this step, especially
+For every 10 words or so you add, it's probably worth going back and repeating this step, especially
-for highly inflected languages &mdash; as one lemma can produce many word forms and the wordlist
+for highly inflected languages &mdash; as one lemma can produce many word forms, and the wordlist
 is not lemmatised.
@@ Line 135: / Line 132: @@
 languages is as follows:
-First grab yourself a wordlist of ''nouns'' in language ''x'', for
+First grab yourself a wordlist of ''nouns'' in language ''x''; for
 example, grab them out of the Apertium dictionary you are using:
@@ Line 167: / Line 164: @@
 </pre>
-''Note: The "sleep 8" is so that we don't put undue strain on the Wikimedia servers''
+''Note: The "sleep 8" is so that we don't put undue strain on the Wikimedia servers.''
-And save it as <code>iw-word.sh</code>, then you can use it at the command line:
+If you save this as <code>iw-word.sh</code>, then you can use it at the command line:
 <pre>
 $ sh iw-word.sh <wordlist> <language code from> <language code to>
 </pre>
+Fr example, to retrieve a bilingual wordlist from English to Afrikaans, use:
-e.g. to retrieve a bilingual wordlist from English to Afrikaans, use:
 <pre>
@@ Line 182: / Line 178: @@
 The method is of variable reliability. Reports of between 70% and 80%
-accuracy are common. It is best for unambiguous terms, but works ok where
+accuracy are common. It is best for unambiguous terms, but works all right where
 terms retain ambiguity through languages.
@@ Line 191: / Line 187: @@
 {{main|Monodix}}
-If the language you're working with is fairly regular, and noun inflection is quite easy (for example English or Afrikaans) then the following script may be useful:
+If the language you're working with is fairly regular, and noun inflection is quite easy (for example English or Afrikaans), then the following script may be useful:
 You'll need a large wordlist (of all forms, not just lemmata) and some existing paradigms. It works by first taking all singular forms out of the list, then looking for plural forms, then printing out those which have both singular and plural forms in Apertium format.
-''Note: These will need to be checked, as no language is that regular.''
+''Note: These will need to be checked, as no language except Esperanto is that regular.''
 <pre>