Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Finding errors in dictionaries

From Apertium
(Difference between revisions)
Jump to: navigation, search
(Expand the monodix)
(Make a list of correctly spelled words)
Line 18: Line 18:
 
The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier you would like to get rid off as much hay as possible, without throwing away any needles. An easy way is to simply drop all words that are spelled correctly. This can be done by filtering the list against a list of correctly spelled words.
 
The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier you would like to get rid off as much hay as possible, without throwing away any needles. An easy way is to simply drop all words that are spelled correctly. This can be done by filtering the list against a list of correctly spelled words.
   
You can get a list of correct words from Aspell. The following command gets a list of English words:
+
You can get a list of correct words from Aspell. The following command gets a list of Swedish words:
   
aspell -d en dump master | aspell -l en expand > aspellwords.en
+
aspell -d sv dump master | aspell -l sv expand > aspellwords.sv
   
Just change the language code for the language you are working with. For e.g. Swedish it would be:
+
Just change the language code for the language you are working with. For e.g. English it would be:
   
aspell -d sv dump master | aspell -l sv expand > aspellwords.sv
+
aspell -d en dump master | aspell -l en expand > aspellwords.en
  +
  +
You'll find more info in the Aspell manual if needed.
   
 
This list is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Highly frequent words are most probably correctly spelled. If they are not they will probably be the new standard for spelling :-)
 
This list is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Highly frequent words are most probably correctly spelled. If they are not they will probably be the new standard for spelling :-)
   
You can download a corpus from eg. OPUS [http://opus.lingfil.uu.se/ OPUS Uppsala University]. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages.
+
You can download a corpus from eg. OPUS [http://opus.lingfil.uu.se/ OPUS Uppsala University, Sweden]. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages.
   
 
You can get a frequency list for instance with the following command:
 
You can get a frequency list for instance with the following command:

Revision as of 19:03, 15 February 2015

Contents

Summary

1. Expand the monodix 2. Exclude a list of correctly spelled words 3. Spell-check the rest of the words in a Word processing program of your choice. 4. Edit the monodix for the misspelled words you find.


Expand the monodix

Move to the folder where the dictionary is kept. The following command expands the dictionary, i.e. creates all forms of every word according to the assigned paradigm. Only the forms that are not marked by any LR or RL tag are expanded and the erroneous entries causes by a long known bug (NON_ANALYSIS) are filtered away. The example is for the Swedish monolingual dictionary:

lt-expand apertium-swe.swe.dix | grep -v ':[<>]:' | cut -f1 -d:| fgrep -v 'NON_ANALYSIS' > swe.expanded

Change to the dictionary you would like to correct, i.e. change "*.swe.dix" to the name of your dictionary and change the output name from "swe.expanded" to something appropriate.

Make a list of correctly spelled words

The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier you would like to get rid off as much hay as possible, without throwing away any needles. An easy way is to simply drop all words that are spelled correctly. This can be done by filtering the list against a list of correctly spelled words.

You can get a list of correct words from Aspell. The following command gets a list of Swedish words:

aspell -d sv dump master | aspell -l sv expand > aspellwords.sv

Just change the language code for the language you are working with. For e.g. English it would be:

aspell -d en dump master | aspell -l en expand > aspellwords.en

You'll find more info in the Aspell manual if needed.

This list is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Highly frequent words are most probably correctly spelled. If they are not they will probably be the new standard for spelling :-)

You can download a corpus from eg. OPUS OPUS Uppsala University, Sweden. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages.

You can get a frequency list for instance with the following command:

cat my_english_corpus.txt | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' | sort | uniq -c | sort -rn > frequency.en

You can read more about getting a corpus and making a frequency list at the page Building_dictionaries.

Exclude the correctly spelled words

When you've got a nice long list of correctly spelled words to exclude, filter the expanded wordlist from Apertium. This is easy to accomplish with grep. The following command would for instance filter the expanded the Swedish monolingual dictionary:

cat apertium-swe.swe.dix.expanded | grep -v -wFf top_frequency.sv | grep -v -wFf aspellwords.sv > apertium-swe.sv.felstavade

Now you've got the suspected errors in the file apertium-swe.sv.felstavade (felstavade = misspelled).

Spell-check the rest of the expanded dictionary

The easiest way to quickly find the errors is to check the remaining words in a word processing program of your choice. When you find a misspelled word, try to figure out what's the ground form of the word. Look for it in the Apertium monodix and correct the entry. Very often the error is due to one of the following mistakes:

1. Wrong stem. 2. Wrong paradigm. 3. A new paradigm is needed.


--Tunedal (talk) 16:24, 11 February 2015 (CET)

Personal tools