Difference between revisions of "Finding errors in dictionaries"

From Apertium
Jump to navigation Jump to search
(wikifiying presentation)
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Trouver_des_erreurs_dans_des_dictionnaires|En français]]

{{TOCD}}

== Introduction ==
It's easy to commit errors when creating or editing dictionaries. This page presents an easy way to find many serious errors.


== Summary ==
== Summary ==
First expand the monodix. This will produce a very large file.
* First expand the monodix. This will produce a very large file.

* Continue by making a list of correctly spelled words and exclude them from the expanded dictionary.


* Finally check the remaining words in a word processing program of your choice to quickly find the errors. Open the original dictionary file in an editor and correct the errors you have found.
Continue by making a list of correctly spelled words and exclude them from the expanded dictionary.


* Option: Check for duplicate entries.
Finally check the remaining words in a word processing program of your choice to quickly find the errors. Open the original dictionary file in an editor and correct the errors you have found.


== Expand the monodix ==
== Expand the monodix ==
Move to the folder where the dictionary is kept. The following command expands the dictionary, i.e. creates all forms of every word according to the assigned paradigm. Only the forms that are not marked by any LR or RL tag are expanded and the erroneous entries causes by a long known bug (NON_ANALYSIS) are filtered away. The example is for the Swedish monolingual dictionary:
Move to the folder where the dictionary is kept. The following command expands the dictionary, i.e. creates all forms of every word according to the assigned paradigm. Only the forms that are not marked by any LR or RL tag are expanded and the erroneous entries causes by a long known bug (NON_ANALYSIS) are filtered away. The example below expands the Swedish monolingual dictionary:


lt-expand apertium-swe.swe.dix | grep -v ':[<>]:' | cut -f1 -d:|
lt-expand apertium-swe.swe.dix | grep -v ':[<>]:' | cut -f1 -d:| fgrep -v 'NON_ANALYSIS' > swe.expanded
fgrep -v 'NON_ANALYSIS' > swe.expanded


Change to the dictionary you would like to correct, i.e. change "*.swe.dix" to the name of your dictionary and change the output name from "swe.expanded" to something appropriate.
Change to the dictionary you would like to correct, i.e. change "<code>apertium-swe.swe.dix</code>" to the name of your dictionary and change the output name from "<code>swe.expanded</code>" to something appropriate.


== Make a list of correctly spelled words ==
== Make a list of correctly spelled words ==
The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier you would like to get rid off as much hay as possible, without throwing away any needles. An easy way is to simply drop all words that are spelled correctly. This can be done by filtering the list against a list of correctly spelled words.
The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier you would like to get rid off as much hay as possible, without throwing away any needles. An easy way is to simply drop all words that are spelled correctly. This can be done by filtering the list, excluding all words in a list of correctly spelled words.


You can get a list of correct words from Aspell. The following command gets a list of Swedish words:
You can get a list of correct words from Aspell. The following command gets a list of Swedish words:


aspell -d sv dump master | aspell -l sv expand > aspellwords.sv
aspell -d sv dump master | aspell -l sv expand > aspellwords.sv


Just change the language code for the language you are working with. For e.g. English it would be:
Just change the language code for the language you are working with. For e.g. English it would be:


aspell -d en dump master | aspell -l en expand > aspellwords.en
aspell -d en dump master | aspell -l en expand > aspellwords.en


You'll find more info in the Aspell manual if needed.
You'll find more info in the Aspell manual if needed.


This list is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Thus highly frequent words are most probably correctly spelled. If they are not, they will probably be the new standard for spelling :-)
The list from Aspell is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Thus highly frequent words are most probably correctly spelled. If they are not, they will probably be the new standard for spelling :-)


You can download a corpus from eg. OPUS [http://opus.lingfil.uu.se/ OPUS Uppsala University, Sweden]. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages.
You can download a corpus from eg. OPUS [http://opus.lingfil.uu.se/ OPUS Uppsala University, Sweden]. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages.
Line 34: Line 41:
You can get a frequency list for instance with the following command:
You can get a frequency list for instance with the following command:


<pre>
cat my_swedish_corpus.txt | tr ' ' '\n' |
cat my_swedish_corpus.txt | tr ' ' '\n' |
tr '[:upper:]' '[:lower:]' |
tr '[:upper:]' '[:lower:]' |
tr -d '[:punct:]' | grep -v '[^a-z]' |
tr -d '[:punct:]' | grep -v '[^a-z]' |
sort | uniq -c | sort -rn > frequency.sv
sort | uniq -c | sort -rn > frequency.sv
</pre>


Open the frequency list in an editor of your choice. Browsing the list you'll find that the frequency drops quickly. Simply delete all words that not are common and save the rest as e.g. top_frequency.sv (or whatever appropriate). Change the command above to what suits the name of your corpus and the language you are working with.
Open the frequency list in an editor of your choice. Browsing the list you'll find that the frequency drops quickly. Simply delete all words that not are common and save the rest as e.g. top_frequency.sv (or whatever appropriate). Change the command above to what suits the name of your corpus and the language you are working with.


You can read more about getting a corpus and making a frequency list at the page [[Building_dictionaries]].
You can read more about getting a corpus and making a frequency list at the page [[Building_dictionaries|Building dictionaries]].


== Exclude the correctly spelled words ==
== Exclude the correctly spelled words ==
When you've got a nice long list of correctly spelled words to exclude, filter the expanded wordlist from Apertium. This is easy to accomplish with grep. The following command would for instance filter the expanded the Swedish monolingual dictionary:
When you've got a nice long list of correctly spelled words to exclude, filter the expanded wordlist from Apertium. This is easy to accomplish with grep. The following command would for instance filter the expanded Swedish monolingual dictionary:


<pre>
cat swe.expanded |
grep -v -wFf top_frequency.sv |
cat swe.expanded | grep -v -wFf top_frequency.sv |
grep -v -wFf aspellwords.sv > swe.expanded.felstavade
grep -v -wFf aspellwords.sv > swe.expanded.felstavade
</pre>


Now you've got the suspected errors in the file swe.expanded.felstavade (felstavade = misspelled).
You would find the suspected errors in the file swe.expanded.felstavade (felstavade = misspelled).


Please change the file names above to what's appropriate in your case.
Please change the file names above to what's appropriate in your case.


== Spell-check the rest of the expanded dictionary ==
== Spell-check the rest of the expanded dictionary ==
The easiest way to quickly find the errors is to check the remaining words in a word processing program of your choice. When you find a misspelled word, try to figure out what's the ground form of the word. Look for it in the Apertium monodix and correct the entry. Very often the error is due to one of the following mistakes:
The easiest way to quickly find the errors is to check the remaining words in a word processing program of your choice. When you find a misspelled word, try to figure out what's the lexical form of the word. Look for it in the Apertium monodix and correct the entry. Very often the error is due to one of the following mistakes:


1. Wrong stem.
1. Wrong stem.

2. Wrong paradigm.
2. Wrong paradigm.

3. A new paradigm is needed.
3. A new paradigm is needed.


Take a word at a time. If you have difficulties to figure out the stem of a very strange misspelled word, try to search for it in the original expanded dictionary. You will find the other forms of the word close to the misspelled word. Sometimes this makes it clearer what word it actually should be. If you cannot figure out anyway: go to the next misspelled word. There will be plenty of errors to correct. Take the easy ones first!
Take a word at a time. If you have difficulties to figure out the lexical form of a very strange misspelled word, try to search for it in the original expanded dictionary. You will find the other forms of the word close to the misspelled word. Sometimes this makes it clearer what word it actually should be. If you cannot figure out anyway: go to the next misspelled word.


There will be plenty of errors to correct. Take the easy ones first!

== Option: Check for duplicate entries ==

It might happen that there are duplicate entries for the same word. You can easily find them if you make a frequency list out of the expanded dictionary. This will make a frequency list for the Swedish expanded dictionary:

<pre>
lt-expand *.swe.dix | grep -v ':[<>]:' | cut -f1 -d:|
fgrep -v 'NON_ANALYSIS' | sort |
uniq -c | sort -rn > swe.expanded.freq
</pre>

Start at the top and check if there might be duplicates. Please note two common false alarms:

1. Some forms of a word might be similar, causing a high frequency.

2. Some similar words are actually forms of two different words that happen to have the same spelling.


--[[User:Tunedal|Tunedal]] ([[User talk:Tunedal|talk]]) 16:24, 11 February 2015 (CET)
--[[User:Tunedal|Tunedal]] ([[User talk:Tunedal|talk]]) 16:24, 11 February 2015 (CET)

==See also==

* [[Lttoolbox#Expansion]]
*[[Contributing_to_an_existing_pair#Detecting_errors|Contributing to an existing pair#Detecting errors]]
*[[Building dictionaries]]
*[[Apertium New Language Pair HOWTO]]
*[[Contributing to an existing pair]]


[[Category:Documentation]]
[[Category:Documentation]]
[[Category:Documentation in English]]
[[Category:HOWTO]]
[[Category:Writing dictionaries]]
[[Category:Quickstart]]

Latest revision as of 13:14, 15 March 2015

En français

Introduction[edit]

It's easy to commit errors when creating or editing dictionaries. This page presents an easy way to find many serious errors.

Summary[edit]

  • First expand the monodix. This will produce a very large file.
  • Continue by making a list of correctly spelled words and exclude them from the expanded dictionary.
  • Finally check the remaining words in a word processing program of your choice to quickly find the errors. Open the original dictionary file in an editor and correct the errors you have found.
  • Option: Check for duplicate entries.

Expand the monodix[edit]

Move to the folder where the dictionary is kept. The following command expands the dictionary, i.e. creates all forms of every word according to the assigned paradigm. Only the forms that are not marked by any LR or RL tag are expanded and the erroneous entries causes by a long known bug (NON_ANALYSIS) are filtered away. The example below expands the Swedish monolingual dictionary:

lt-expand apertium-swe.swe.dix | grep -v ':[<>]:' | cut -f1 -d:| fgrep -v 'NON_ANALYSIS' > swe.expanded

Change to the dictionary you would like to correct, i.e. change "apertium-swe.swe.dix" to the name of your dictionary and change the output name from "swe.expanded" to something appropriate.

Make a list of correctly spelled words[edit]

The expanded word list is a very large haystack to look for needles in. To make the task somewhat easier you would like to get rid off as much hay as possible, without throwing away any needles. An easy way is to simply drop all words that are spelled correctly. This can be done by filtering the list, excluding all words in a list of correctly spelled words.

You can get a list of correct words from Aspell. The following command gets a list of Swedish words:

aspell -d sv dump master | aspell -l sv expand > aspellwords.sv

Just change the language code for the language you are working with. For e.g. English it would be:

aspell -d en dump master | aspell -l en expand > aspellwords.en

You'll find more info in the Aspell manual if needed.

The list from Aspell is however rather short. You might find it useful to filter on more words. One way of getting more correctly spelled words is to simply use the top of a word frequency list made on a large corpus. Rational: most people spell correctly most of the time. Thus highly frequent words are most probably correctly spelled. If they are not, they will probably be the new standard for spelling :-)

You can download a corpus from eg. OPUS OPUS Uppsala University, Sweden. Choose among Europarl, OpenOffice and OpenSubtitles etc in many languages.

You can get a frequency list for instance with the following command:

cat my_swedish_corpus.txt | tr ' ' '\n' | 
tr '[:upper:]' '[:lower:]' | 
tr -d '[:punct:]' | grep -v '[^a-z]' | 
sort | uniq -c | sort -rn > frequency.sv

Open the frequency list in an editor of your choice. Browsing the list you'll find that the frequency drops quickly. Simply delete all words that not are common and save the rest as e.g. top_frequency.sv (or whatever appropriate). Change the command above to what suits the name of your corpus and the language you are working with.

You can read more about getting a corpus and making a frequency list at the page Building dictionaries.

Exclude the correctly spelled words[edit]

When you've got a nice long list of correctly spelled words to exclude, filter the expanded wordlist from Apertium. This is easy to accomplish with grep. The following command would for instance filter the expanded Swedish monolingual dictionary:

cat swe.expanded | grep -v -wFf  top_frequency.sv |
grep -v -wFf  aspellwords.sv > swe.expanded.felstavade

You would find the suspected errors in the file swe.expanded.felstavade (felstavade = misspelled).

Please change the file names above to what's appropriate in your case.

Spell-check the rest of the expanded dictionary[edit]

The easiest way to quickly find the errors is to check the remaining words in a word processing program of your choice. When you find a misspelled word, try to figure out what's the lexical form of the word. Look for it in the Apertium monodix and correct the entry. Very often the error is due to one of the following mistakes:

1. Wrong stem.

2. Wrong paradigm.

3. A new paradigm is needed.

Take a word at a time. If you have difficulties to figure out the lexical form of a very strange misspelled word, try to search for it in the original expanded dictionary. You will find the other forms of the word close to the misspelled word. Sometimes this makes it clearer what word it actually should be. If you cannot figure out anyway: go to the next misspelled word.

There will be plenty of errors to correct. Take the easy ones first!

Option: Check for duplicate entries[edit]

It might happen that there are duplicate entries for the same word. You can easily find them if you make a frequency list out of the expanded dictionary. This will make a frequency list for the Swedish expanded dictionary:

lt-expand *.swe.dix | grep -v ':[<>]:' | cut -f1 -d:|
fgrep -v 'NON_ANALYSIS' | sort | 
uniq -c | sort -rn > swe.expanded.freq

Start at the top and check if there might be duplicates. Please note two common false alarms:

1. Some forms of a word might be similar, causing a high frequency.

2. Some similar words are actually forms of two different words that happen to have the same spelling.

--Tunedal (talk) 16:24, 11 February 2015 (CET)

See also[edit]