Difference between revisions of "Asturian"
Jump to navigation
Jump to search
m (SVN -> Github links) |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
− | ==Planning== |
||
− | {{see-also|Building dictionaries|Apertium New Language Pair HOWTO}} |
||
− | ;Overview |
||
+ | ==External links== |
||
− | # '''Pre-requisites''' |
||
− | ## <s>Frequency ordered wordlist</s> |
||
− | #:http://xixona.dlsi.ua.es/~fran/wordlists/asturian.freqlist.txt |
||
− | # '''Asturian morphological analyser''' |
||
− | ## All of the high frequency closed categories (freq. >= 50/23000) — pronouns, determiners, prepositions, etc. |
||
− | ## 2,000 highest frequency words from each open category (noun, verb, adjective, adverb) |
||
− | ##:''With ~8,000 high frequency words, we should have >85% coverage on open-domain text.'' |
||
− | ## Adding frequent multiwords |
||
− | # '''Bilingual dictionary''' |
||
− | ## Translations of each Asturian word into Spanish (taking into account frequency and generality in translating ambiguous pairs) |
||
− | ## Translations of Asturian multiwords |
||
− | # '''POS Tagger''' |
||
− | ## Identify useful restrictions (e.g. determiner cannot follow determiner — e.g. "the<det> a<det> cat<noun>" is invalid. |
||
− | ## Train a tagger in an unsupervised manner on an Asturian corpus. |
||
− | |||
− | ;Tasks |
||
− | |||
− | # Checking automatically generated lemma-paradigm pairs. |
||
− | # Creating a translation dictionary of Asturian--Spanish |
||
− | # Identifying frequent multiwords which cannot be translated word-for-word between Asturian and Spanish |
||
− | # Identifying constraint/restriction rules for ambiguous sequences of words. |
||
− | |||
− | On top of this, at least one or two people should become familiar with how Apertium works, for example taking a look at an existing language pair (<code>apertium-es-ca</code>, <code>apertium-es-gl</code> or <code>apertium-es-pt</code> etc.) and seeing how it works, how things in there might apply to, or be adapted for <code>apertium-es-ast</code>. |
||
− | |||
− | |||
− | ==Calculating coverage== |
||
− | |||
− | <pre> |
||
− | # Compile the dictionary |
||
− | $ lt-comp lr apertium-es-ast.ast.dix ast.bin |
||
− | apostrophes@postblank 13 15 |
||
− | apostrophes@preblank 7 6 |
||
− | main@standard 3864 8604 |
||
− | |||
− | # Calculate the number of tokenised words in the corpus |
||
− | $ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | wc -l |
||
− | 954464 |
||
− | |||
− | # Calculate the number of words that are not unknown |
||
− | $ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep -v '\*' | wc -l |
||
− | 489819 |
||
− | |||
− | # Calculate the coverage |
||
− | $ calc 489819/954464*100 |
||
− | ~51.31875062862507124417 |
||
− | |||
− | # Show the top-ten unknown words. |
||
− | $ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep '\*' | sort -f | uniq -c | sort -gr | head -10 |
||
− | 3899 ^nun/*nun$ |
||
− | 2662 ^se/*se$ |
||
− | 2342 ^sos/*sos$ |
||
− | 1529 ^ta/*ta$ |
||
− | 1458 ^tien/*tien$ |
||
− | 1398 ^parte/*parte$ |
||
− | 1371 ^s/*s$ |
||
− | 1298 ^nome/*nome$ |
||
− | 1105 ^primer/*primer$ |
||
− | 1060 ^sieglu/*sieglu$ |
||
− | </pre> |
||
− | |||
− | ==Resources== |
||
− | |||
− | * Asturian Wiktionary — 120 nouns + genders + plural forms |
||
− | :Retrieved all pages, converted into [[speling format]], and derived paradigms. |
||
− | |||
− | * Asturian Wikipedia: |
||
− | <pre> |
||
− | # Make the file so that each line starts with a determiner |
||
− | cat ast.crp.txt | sed 's/el /\nel /g' | sed 's/la /\nla /g' | sed 's/lo /\nlo /g' | sed 's/las /\nlas /g' | sed 's/les /\nles /g' | |
||
− | sed 's/los /\nlos /g' > ast.dets.txt |
||
− | # Grep out the determiners |
||
− | cat ast.dets.txt | grep -e '^el' -e '^la' -e '^lo' -e '^les' -e '^las' -e '^los' > dets.txt |
||
− | # Grep out the lines starting with feminine determiners in plural followed by one word (hopefully a noun) |
||
− | cat dets.txt | grep '^les' | sort | grep -v 'les y' | cut -f1,2 -d' ' | sort -u > det.les.txt |
||
− | # Grep out the lines starting with feminine determiners in singular followed by one word (hopefully a noun) |
||
− | cat dets.txt | grep '^la' | sort | grep -v 'la súa' | cut -f1,2 -d' ' | sort -u > det.la.txt |
||
− | # Combine the two previous files |
||
− | cat det.la.txt det.les.txt > det.la_les.txt |
||
− | # Get extract style paradigms from existing dictionary |
||
− | python /home/fran/scripts/apertium2extract.py /home/fran/svnroot/apertium/trunk/incubator/apertium-es-ast.ast.dix > EXT.PDMS.FEM.TXT |
||
− | # Apply extract to the wordlist (hopefully) with only singular+plural feminine nouns |
||
− | extract -nobad -utf8 -e -u -id EXT.PDMS.FEM.TXT det.la_les.txt | awk -F' ' '{print $2"; "$1"; "$3}' | sort -u > extract.la_les.out.txt |
||
− | # Grep out the lines where both singular + plural were found |
||
− | cat extract.la_les.out.txt | grep ',' > EX.txt |
||
− | # Re-organise lines |
||
− | cat EX.txt | sed 's/;/\t/g' | awk '{print $3"; "$2"; "$1}' | sed 's/;/\t\t\t/g' |
||
− | </pre> |
||
− | |||
− | ;Example output |
||
− | |||
− | <pre> |
||
− | (abdicación,abdicaciones) espresi/ón__n abdicaci |
||
− | (abegosa,abegoses) páxin/a__n abegos |
||
− | (abeya,abeyes) páxin/a__n abey |
||
− | (Academia,Academias) imaxe__n Academia |
||
− | (academia,academies) páxin/a__n academi |
||
− | (acción,acciones) espresi/ón__n acci |
||
− | |||
− | ... |
||
− | </pre> |
||
+ | * [https://github.com/apertium/apertium-ast Asturian Data: apertium-ast] |
||
+ | * [https://github.com/apertium/apertium-spa-ast Spanish-Asturian Pair: apertium-spa-ast] |
||
[[Category:Languages]] |
[[Category:Languages]] |
||
+ | [[Category:Romance languages]] |
Latest revision as of 22:19, 25 October 2018
Contents |