Difference between revisions of "Asturian"

From Apertium
Jump to navigation Jump to search
m (SVN -> Github links)
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  +
{{TOCD}}
==Planning==
 
   
  +
==External links==
;Milestones
 
 
# '''Pre-requisites'''
 
## <s>Frequency ordered wordlist</s>
 
#:http://xixona.dlsi.ua.es/~fran/asturian.freqlist.txt
 
# '''Asturian morphological analyser'''
 
## High frequency closed categories (freq. >= 50/23000) &mdash; pronouns, determiners, prepositions, etc.
 
## 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
 
#::''With ~8,000 high frequency words, we should have >85% coverage on open-domain text.''
 
## Adding frequent multiwords
 
# '''Bilingual dictionary'''
 
## Translations of each Asturian word into Spanish (taking into account frequency and generality in translating ambiguous pairs)
 
## Translations of Asturian multiwords
 
# '''Tagger'''
 
## Identify useful restrictions.
 
## Train a tagger in an unsupervised manner on an Asturian corpus
 
 
;Tasks
 
 
# Checking automatically generated lemma-paradigm pairs
 
# Identifying frequent multiwords which cannot be translated word-for-word between Asturian and Spanish
 
# Identifying constraint/restriction rules for ambiguous sequences of words.
 
 
==Resources==
 
 
* Asturian Wiktionary &mdash; 120 nouns + genders + plural forms
 
:Retrieved all pages, converted into [[speling format]], and derived paradigms.
 
 
* Asturian Wikipedia:
 
<pre>
 
# Make the file so that each line starts with a determiner
 
cat ast.crp.txt | sed 's/el /\nel /g' | sed 's/la /\nla /g' | sed 's/lo /\nlo /g' | sed 's/las /\nlas /g' | sed 's/les /\nles /g' | sed 's/los /\nlos /g' > ast.dets.txt
 
# Grep out the determiners
 
cat ast.dets.txt | grep -e '^el' -e '^la' -e '^lo' -e '^les' -e '^las' -e '^los' > dets.txt
 
# Grep out the lines starting with feminine determiners in plural followed by one word (hopefully a noun)
 
cat dets.txt | grep '^les' | sort | grep -v 'les y' | cut -f1,2 -d' ' | sort -u > det.les.txt
 
# Grep out the lines starting with feminine determiners in singular followed by one word (hopefully a noun)
 
cat dets.txt | grep '^la' | sort | grep -v 'la súa' | cut -f1,2 -d' ' | sort -u > det.la.txt
 
# Combine the two previous files
 
cat det.la.txt det.les.txt > det.la_les.txt
 
# Get extract style paradigms from existing dictionary
 
python /home/fran/scripts/apertium2extract.py /home/fran/svnroot/apertium/trunk/incubator/apertium-es-ast.ast.dix > EXT.PDMS.FEM.TXT
 
# Apply extract to the wordlist (hopefully) with only singular+plural feminine nouns
 
extract -nobad -utf8 -e -u -id EXT.PDMS.FEM.TXT det.la_les.txt | awk -F' ' '{print $2"; "$1"; "$3}' | sort -u > extract.la_les.out.txt
 
# Grep out the lines where both singular + plural were found
 
cat extract.la_les.out.txt | grep ',' > EX.txt
 
# Re-organise lines
 
cat EX.txt | sed 's/;/\t/g' | awk '{print $3"; "$2"; "$1}' | sed 's/;/\t\t\t/g'
 
</pre>
 
 
;Example output
 
 
<pre>
 
(abdicación,abdicaciones) espresi/ón__n abdicaci
 
(abegosa,abegoses) páxin/a__n abegos
 
(abeya,abeyes) páxin/a__n abey
 
(Academia,Academias) imaxe__n Academia
 
(academia,academies) páxin/a__n academi
 
(acción,acciones) espresi/ón__n acci
 
 
...
 
</pre>
 
   
  +
* [https://github.com/apertium/apertium-ast Asturian Data: apertium-ast]
  +
* [https://github.com/apertium/apertium-spa-ast Spanish-Asturian Pair: apertium-spa-ast]
   
 
[[Category:Languages]]
 
[[Category:Languages]]
  +
[[Category:Romance languages]]

Latest revision as of 22:19, 25 October 2018