Difference between revisions of "Automatically generating compound bidix entries"

Revision as of 13:53, 9 March 2010

Likely compound wordlist

Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.

$ tail unk.is.txt 
Þýðingarstofu
þynnist
þynntur
þyrluflug
þyrluflugmaður
þyrlupall
þyrluslysi
þyrptust
þýskukennari
þýskumælandi

Decompound them with lttoolbox

$ cat dev/unk.is.txt | apertium-destxt | lt-proc -e is-en.cmpd.bin | apertium-retxt  > dev/unk.a.txt

Note: If you are using regular expressions (e.g. in is-en for proper name recognition) it might be an idea to disable these as sometimes they mess up formatting.

Check to see you get the same number of lines:

$ wc -l unk.*
  17181 unk.a.txt
  17181 unk.is.txt
  34362 total

Remove words which still don't have analyses

$ paste unk.is.txt unk.a.txt  | grep -v '*' > unk.is-a.txt

Generate correspondences

Between original surface form and analyses:

$ cat > correspondence.sh
for i in `cat $1 | sed 's/\t/;/g' | sed 's/ /_/g'`; do 
        surface=`echo $i | cut -f1 -d';'`;
        analyses=`echo $i | sed 's/\$//g'| cut -f2 -d';' | cut -f2- -d'/' | sed 's/\^//g' | sed 's/\$//g' | sed 's/$//g'`;
        for j in `echo $analyses | sed 's/\//\n/g'`; do
                echo $j | grep '\$' > /dev/null
                if [ $? -eq 1 ]; then
                        echo -e "["$surface"].[]\t^"$j"\$";
                else
                        echo -e "["$surface"].[]\t^"$j;
                fi
        done;
done


$ sh correspondence.sh unk.is-a.txt > unk.1line.txt

As we are considering all possible examples there will be many bad analyses along with good ones:

Good "bardagaíþróttir" = "combat sports":

[bardagaíþróttir].[]	^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$

Bad: "bikarmeistara" != "tar master":

[bikarmeistara].[]	^bika<vblex><actv><pri><p3><sg>+meistari<n><m><sg><acc><ind>$

So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model.

@@ Line 1: / Line 1: @@
+{{TOCD}}
-==Get a wordlist of unknown words likely to be compounds==
+==Likely compound wordlist==
 Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.
@@ Line 18: / Line 18: @@
 </pre>
-==Run them through lttoolbox decompounding==
+==Decompound them with lttoolbox==
 <pre>
@@ Line 41: / Line 41: @@
 </pre>
-==Generate correspondences between original surface form and analyses==
+==Generate correspondences==
+Between original surface form and analyses:
 <pre>
@@ Line 65: / Line 67: @@
 As we are considering all possible examples there will be many bad analyses along with good ones:
-Good "bardagaíþróttir" = "battle sports":
+Good "bardagaíþróttir" = "combat sports":
 <pre>
 [bardagaíþróttir].[]	^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$

Difference between revisions of "Automatically generating compound bidix entries"

Revision as of 13:53, 9 March 2010

Contents

Likely compound wordlist

Decompound them with lttoolbox

Remove words which still don't have analyses

Generate correspondences

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools