Difference between revisions of "Asturian"

Latest revision as of 22:19, 25 October 2018

External links[edit]

@@ Line 1: / Line 1: @@
 {{TOCD}}
-==Planning==
-{{see-also|Building dictionaries|Apertium New Language Pair HOWTO}}
-;Overview
+==External links==
-# '''Pre-requisites'''
-## <s>Frequency ordered wordlist</s>
-#:http://xixona.dlsi.ua.es/~fran/wordlists/asturian.freqlist.txt
-# '''Asturian morphological analyser'''
-## All of the high frequency closed categories (freq. >= 50/23000) &mdash; pronouns, determiners, prepositions, etc.
-## 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
-##:''With ~8,000 high frequency words, we should have >85% coverage on open-domain text.''
-## Adding frequent multiwords
-# '''Bilingual dictionary'''
-## Translations of each Asturian word into Spanish (taking into account frequency and generality in translating ambiguous pairs)
-## Translations of Asturian multiwords
-# '''POS Tagger'''
-## Identify useful restrictions (e.g. determiner cannot follow determiner &mdash; e.g. "the<det> a<det> cat<noun>" is invalid.
-## Train a tagger in an unsupervised manner on an Asturian corpus.
-;Tasks
-# Checking automatically generated lemma-paradigm pairs.
-# Creating a translation dictionary of Asturian--Spanish
-# Identifying frequent multiwords which cannot be translated word-for-word between Asturian and Spanish
-# Identifying constraint/restriction rules for ambiguous sequences of words.
-On top of this, at least one or two people should become familiar with how Apertium works, for example taking a look at an existing language pair (<code>apertium-es-ca</code>, <code>apertium-es-gl</code> or <code>apertium-es-pt</code> etc.) and seeing how it works, how things in there might apply to, or be adapted for <code>apertium-es-ast</code>.
-==Calculating coverage==
-<pre>
-# Compile the dictionary
-$ lt-comp lr apertium-es-ast.ast.dix ast-es.automorf.bin apertium-es-ast.ast.acx
-apostrophes@postblank 13 15
-apostrophes@preblank 7 6
-main@standard 3864 8604
-# Calculate the number of tokenised words in the corpus
-$ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin  | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | wc -l
-# Calculate the number of words that are not unknown
-$ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin  | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep -v '\*' | wc -l
-# Calculate the coverage
-$ calc 489819/954464*100
-   ~51.31875062862507124417
-# Show the top-ten unknown words.
-$ cat ast-tagger-data/ast.crp.txt | apertium-destxt | lt-proc ast.bin  | apertium-retxt | sed 's/\$\W*\^/$\n^/g' | grep '\*' | sort -f | uniq -c | sort -gr | head -10
-^nun/*nun$
-^se/*se$
-^sos/*sos$
-^ta/*ta$
-^tien/*tien$
-^parte/*parte$
-^s/*s$
-^nome/*nome$
-^primer/*primer$
-^sieglu/*sieglu$
-</pre>
-==Resources==
-* Asturian Wiktionary &mdash; 120 nouns + genders + plural forms
-:Retrieved all pages, converted into [[speling format]], and derived paradigms.
-* Asturian Wikipedia:
-<pre>
-# Make the file so that each line starts with a determiner
-cat ast.crp.txt  | sed 's/el /\nel /g' | sed 's/la /\nla /g' | sed 's/lo /\nlo /g' | sed 's/las /\nlas /g' | sed 's/les /\nles /g' |
-sed 's/los /\nlos /g' > ast.dets.txt
-# Grep out the determiners
-cat ast.dets.txt  | grep -e '^el' -e '^la' -e '^lo' -e '^les' -e '^las' -e '^los' > dets.txt
-# Grep out the lines starting with feminine determiners in plural followed by one word (hopefully a noun)
-cat dets.txt | grep '^les' | sort  | grep -v 'les y' | cut -f1,2 -d' ' | sort -u > det.les.txt
-# Grep out the lines starting with feminine determiners in singular followed by one word (hopefully a noun)
-cat dets.txt | grep '^la' | sort | grep -v 'la súa' | cut -f1,2 -d' ' | sort -u > det.la.txt
-# Combine the two previous files
-cat det.la.txt det.les.txt > det.la_les.txt
-# Get extract style paradigms from existing dictionary
-python /home/fran/scripts/apertium2extract.py /home/fran/svnroot/apertium/trunk/incubator/apertium-es-ast.ast.dix  > EXT.PDMS.FEM.TXT
-# Apply extract to the wordlist (hopefully) with only singular+plural feminine nouns
-extract -nobad -utf8 -e -u -id EXT.PDMS.FEM.TXT det.la_les.txt  | awk -F' ' '{print $2"; "$1"; "$3}' | sort -u > extract.la_les.out.txt
-# Grep out the lines where both singular + plural were found
-cat extract.la_les.out.txt | grep ',' > EX.txt
-# Re-organise lines
-cat EX.txt | sed 's/;/\t/g' | awk '{print $3"; "$2"; "$1}' | sed 's/;/\t\t\t/g'
-</pre>
-;Example output
-<pre>
-(abdicación,abdicaciones)                        espresi/ón__n                           abdicaci
-(abegosa,abegoses)                               páxin/a__n                              abegos
-(abeya,abeyes)                                   páxin/a__n                              abey
-(Academia,Academias)                             imaxe__n                                Academia
-(academia,academies)                             páxin/a__n                              academi
-(acción,acciones)                                espresi/ón__n                           acci
-...
-</pre>
+* [https://github.com/apertium/apertium-ast Asturian Data: apertium-ast]
+* [https://github.com/apertium/apertium-spa-ast Spanish-Asturian Pair: apertium-spa-ast]
 [[Category:Languages]]
+[[Category:Romance languages]]

Difference between revisions of "Asturian"

Latest revision as of 22:19, 25 October 2018

Contents

External links[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools