Difference between revisions of "Asturian"

From Apertium
Jump to navigation Jump to search
Line 107: Line 107:


[[Category:Languages]]
[[Category:Languages]]
[[Category:Romance languages]]

Revision as of 08:04, 26 July 2010

Planning

See also: Building dictionaries and Apertium New Language Pair HOWTO
Overview
  1. Pre-requisites
    1. Frequency ordered wordlist
    http://xixona.dlsi.ua.es/~fran/wordlists/asturian.freqlist.txt
  2. Asturian morphological analyser
    1. All of the high frequency closed categories (freq. >= 50/23000) — pronouns, determiners, prepositions, etc.
    2. 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
      With ~8,000 high frequency words, we should have >85% coverage on open-domain text.
    3. Adding frequent multiwords
  3. Bilingual dictionary
    1. Translations of each Asturian word into Spanish (taking into account frequency and generality in translating ambiguous pairs)
    2. Translations of Asturian multiwords
  4. POS Tagger
    1. Identify useful restrictions (e.g. determiner cannot follow determiner — e.g. "the<det> a<det> cat<noun>" is invalid.
    2. Train a tagger in an unsupervised manner on an Asturian corpus.
Tasks
  1. Checking automatically generated lemma-paradigm pairs.
  2. Creating a translation dictionary of Asturian--Spanish
  3. Identifying frequent multiwords which cannot be translated word-for-word between Asturian and Spanish
  4. Identifying constraint/restriction rules for ambiguous sequences of words.

On top of this, at least one or two people should become familiar with how Apertium works, for example taking a look at an existing language pair (apertium-es-ca, apertium-es-gl or apertium-es-pt etc.) and seeing how it works, how things in there might apply to, or be adapted for apertium-es-ast.


Calculating coverage

# Compile the dictionary
$ lt-comp lr apertium-es-ast.ast.dix ast-es.automorf.bin apertium-es-ast.ast.acx
apostrophes@postblank 13 15
apostrophes@preblank 7 6
main@standard 3864 8604

# Calculate the number of tokenised words in the corpus
$ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin  | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | wc -l 
   954464

# Calculate the number of words that are not unknown
$ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin  | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep -v '\*' | wc -l
   489819

# Calculate the coverage
$ calc 489819/954464*100
   ~51.31875062862507124417

# Show the top-ten unknown words.
$ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin  | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep '\*' | sort -f | uniq -c | sort -gr | head -10
   3899 ^nun/*nun$
   2662 ^se/*se$
   2342 ^sos/*sos$
   1529 ^ta/*ta$
   1458 ^tien/*tien$
   1398 ^parte/*parte$
   1371 ^s/*s$
   1298 ^nome/*nome$
   1105 ^primer/*primer$
   1060 ^sieglu/*sieglu$

Resources

  • Asturian Wiktionary — 120 nouns + genders + plural forms
Retrieved all pages, converted into speling format, and derived paradigms.
  • Asturian Wikipedia:
# Make the file so that each line starts with a determiner
cat ast.crp.txt  | sed 's/el /\nel /g' | sed 's/la /\nla /g' | sed 's/lo /\nlo /g' | sed 's/las /\nlas /g' | sed 's/les /\nles /g' | 
sed 's/los /\nlos /g' > ast.dets.txt
# Grep out the determiners
cat ast.dets.txt  | grep -e '^el' -e '^la' -e '^lo' -e '^les' -e '^las' -e '^los' > dets.txt
# Grep out the lines starting with feminine determiners in plural followed by one word (hopefully a noun)
cat dets.txt | grep '^les' | sort  | grep -v 'les y' | cut -f1,2 -d' ' | sort -u > det.les.txt
# Grep out the lines starting with feminine determiners in singular followed by one word (hopefully a noun)
cat dets.txt | grep '^la' | sort | grep -v 'la súa' | cut -f1,2 -d' ' | sort -u > det.la.txt
# Combine the two previous files
cat det.la.txt det.les.txt > det.la_les.txt
# Get extract style paradigms from existing dictionary
python /home/fran/scripts/apertium2extract.py /home/fran/svnroot/apertium/trunk/incubator/apertium-es-ast.ast.dix  > EXT.PDMS.FEM.TXT
# Apply extract to the wordlist (hopefully) with only singular+plural feminine nouns
extract -nobad -utf8 -e -u -id EXT.PDMS.FEM.TXT det.la_les.txt  | awk -F' ' '{print $2"; "$1"; "$3}' | sort -u > extract.la_les.out.txt
# Grep out the lines where both singular + plural were found
cat extract.la_les.out.txt | grep ',' > EX.txt 
# Re-organise lines
cat EX.txt | sed 's/;/\t/g' | awk '{print $3"; "$2"; "$1}' | sed 's/;/\t\t\t/g'
Example output
(abdicación,abdicaciones)                        espresi/ón__n                           abdicaci
(abegosa,abegoses)                               páxin/a__n                              abegos
(abeya,abeyes)                                   páxin/a__n                              abey
(Academia,Academias)                             imaxe__n                                Academia
(academia,academies)                             páxin/a__n                              academi
(acción,acciones)                                espresi/ón__n                           acci

...