Automatically generating compound bidix entries

Likely compound wordlist

Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.

$ tail unk.is.txt 
Þýðingarstofu
þynnist
þynntur
þyrluflug
þyrluflugmaður
þyrlupall
þyrluslysi
þyrptust
þýskukennari
þýskumælandi

Decompound them with lttoolbox

$ cat dev/unk.is.txt | apertium-destxt | lt-proc -e is-en.cmpd.bin | apertium-retxt  > dev/unk.a.txt

Note: If you are using regular expressions (e.g. in is-en for proper name recognition) it might be an idea to disable these as sometimes they mess up formatting.

Check to see you get the same number of lines:

$ wc -l unk.*
  17181 unk.a.txt
  17181 unk.is.txt
  34362 total

Remove words which still don't have analyses

$ paste unk.is.txt unk.a.txt  | grep -v '*' > unk.is-a.txt

Generate correspondences

Between original surface form and analyses:

$ cat > correspondence.sh
for i in `cat $1 | sed 's/\t/;/g' | sed 's/ /_/g'`; do 
        surface=`echo $i | cut -f1 -d';'`;
        analyses=`echo $i | sed 's/\$//g'| cut -f2 -d';' | cut -f2- -d'/' | sed 's/\^//g' | sed 's/\$//g' | sed 's/$//g'`;
        for j in `echo $analyses | sed 's/\//\n/g'`; do
                echo $j | grep '\$' > /dev/null
                if [ $? -eq 1 ]; then
                        echo -e "["$surface"].[]\t^"$j"\$";
                else
                        echo -e "["$surface"].[]\t^"$j;
                fi
        done;
done


$ sh correspondence.sh unk.is-a.txt > unk.1line.txt

As we are considering all possible examples there will be many bad analyses along with good ones:

Good "bardagaíþróttir" = "combat sports":

[bardagaíþróttir].[]	^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$

Bad: "bikarmeistara" != "tar master":

[bikarmeistara].[]	^bika<vblex><actv><pri><p3><sg>+meistari<n><m><sg><acc><ind>$

So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model.

Automatically generating compound bidix entries

Contents

Likely compound wordlist

Decompound them with lttoolbox

Remove words which still don't have analyses

Generate correspondences

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools