Asturian
Contents |
Planning
- See also: Building dictionaries and Apertium New Language Pair HOWTO
- Overview
- Pre-requisites
Frequency ordered wordlist
- Asturian morphological analyser
- All of the high frequency closed categories (freq. >= 50/23000) — pronouns, determiners, prepositions, etc.
- 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
- With ~8,000 high frequency words, we should have >85% coverage on open-domain text.
- Adding frequent multiwords
- Bilingual dictionary
- Translations of each Asturian word into Spanish (taking into account frequency and generality in translating ambiguous pairs)
- Translations of Asturian multiwords
- POS Tagger
- Identify useful restrictions (e.g. determiner cannot follow determiner — e.g. "the<det> a<det> cat<noun>" is invalid.
- Train a tagger in an unsupervised manner on an Asturian corpus.
- Tasks
- Checking automatically generated lemma-paradigm pairs.
- Creating a translation dictionary of Asturian--Spanish
- Identifying frequent multiwords which cannot be translated word-for-word between Asturian and Spanish
- Identifying constraint/restriction rules for ambiguous sequences of words.
On top of this, at least one or two people should become familiar with how Apertium works, for example taking a look at an existing language pair (apertium-es-ca
, apertium-es-gl
or apertium-es-pt
etc.) and seeing how it works, how things in there might apply to, or be adapted for apertium-es-ast
.
Calculating coverage
# Compile the dictionary $ lt-comp lr apertium-es-ast.ast.dix ast-es.automorf.bin apertium-es-ast.ast.acx apostrophes@postblank 13 15 apostrophes@preblank 7 6 main@standard 3864 8604 # Calculate the number of tokenised words in the corpus $ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | wc -l 954464 # Calculate the number of words that are not unknown $ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep -v '\*' | wc -l 489819 # Calculate the coverage $ calc 489819/954464*100 ~51.31875062862507124417 # Show the top-ten unknown words. $ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep '\*' | sort -f | uniq -c | sort -gr | head -10 3899 ^nun/*nun$ 2662 ^se/*se$ 2342 ^sos/*sos$ 1529 ^ta/*ta$ 1458 ^tien/*tien$ 1398 ^parte/*parte$ 1371 ^s/*s$ 1298 ^nome/*nome$ 1105 ^primer/*primer$ 1060 ^sieglu/*sieglu$
Resources
- Asturian Wiktionary — 120 nouns + genders + plural forms
- Retrieved all pages, converted into speling format, and derived paradigms.
- Asturian Wikipedia:
# Make the file so that each line starts with a determiner cat ast.crp.txt | sed 's/el /\nel /g' | sed 's/la /\nla /g' | sed 's/lo /\nlo /g' | sed 's/las /\nlas /g' | sed 's/les /\nles /g' | sed 's/los /\nlos /g' > ast.dets.txt # Grep out the determiners cat ast.dets.txt | grep -e '^el' -e '^la' -e '^lo' -e '^les' -e '^las' -e '^los' > dets.txt # Grep out the lines starting with feminine determiners in plural followed by one word (hopefully a noun) cat dets.txt | grep '^les' | sort | grep -v 'les y' | cut -f1,2 -d' ' | sort -u > det.les.txt # Grep out the lines starting with feminine determiners in singular followed by one word (hopefully a noun) cat dets.txt | grep '^la' | sort | grep -v 'la súa' | cut -f1,2 -d' ' | sort -u > det.la.txt # Combine the two previous files cat det.la.txt det.les.txt > det.la_les.txt # Get extract style paradigms from existing dictionary python /home/fran/scripts/apertium2extract.py /home/fran/svnroot/apertium/trunk/incubator/apertium-es-ast.ast.dix > EXT.PDMS.FEM.TXT # Apply extract to the wordlist (hopefully) with only singular+plural feminine nouns extract -nobad -utf8 -e -u -id EXT.PDMS.FEM.TXT det.la_les.txt | awk -F' ' '{print $2"; "$1"; "$3}' | sort -u > extract.la_les.out.txt # Grep out the lines where both singular + plural were found cat extract.la_les.out.txt | grep ',' > EX.txt # Re-organise lines cat EX.txt | sed 's/;/\t/g' | awk '{print $3"; "$2"; "$1}' | sed 's/;/\t\t\t/g'
- Example output
(abdicación,abdicaciones) espresi/ón__n abdicaci (abegosa,abegoses) páxin/a__n abegos (abeya,abeyes) páxin/a__n abey (Academia,Academias) imaxe__n Academia (academia,academies) páxin/a__n academi (acción,acciones) espresi/ón__n acci ...