Difference between revisions of "Finding errors in dictionaries"
Line 11: | Line 11: | ||
== Make a list of correctly spelled words == |
== Make a list of correctly spelled words == |
||
The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier we will try to get rid off as much hay as possible, without throwing away any needles. Lets get rid of all words that are spelled correctly. This can be done by filtering the list against a list of correctly spelled words. |
|||
text |
|||
You can get a list of correct words from Aspell. The following command gets a list of English words: |
|||
aspell -d en dump master | aspell -l en expand > aspellwords.en |
|||
Just change the language code for the language you are working with. For e.g. Swedish it would be: |
|||
aspell -d sv dump master | aspell -l sv expand > aspellwords.sv |
|||
This list is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Highly frequent words are most probably correctly spelled. If they are not they will probably be the new standard for spelling :-) |
|||
You can download a corpus from eg. OPUS [http://opus.lingfil.uu.se/ OPUS Uppsala University]. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages. |
|||
You can get a frequency list for instance with the following command: |
|||
== Exclude the correctly spelled words == |
== Exclude the correctly spelled words == |
Revision as of 15:56, 15 February 2015
Contents
Summary
1. Expand the monodix 2. Exclude a list of correctly spelled words 3. Spell-check the rest of the words in a Word processing program of your choice. 4. Edit the monodix for the misspelled words you find.
Expand the monodix
text
Make a list of correctly spelled words
The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier we will try to get rid off as much hay as possible, without throwing away any needles. Lets get rid of all words that are spelled correctly. This can be done by filtering the list against a list of correctly spelled words.
You can get a list of correct words from Aspell. The following command gets a list of English words:
aspell -d en dump master | aspell -l en expand > aspellwords.en
Just change the language code for the language you are working with. For e.g. Swedish it would be:
aspell -d sv dump master | aspell -l sv expand > aspellwords.sv
This list is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Highly frequent words are most probably correctly spelled. If they are not they will probably be the new standard for spelling :-)
You can download a corpus from eg. OPUS OPUS Uppsala University. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages.
You can get a frequency list for instance with the following command:
Exclude the correctly spelled words
text
Spell-check the rest of the expanded dictionary
text