Difference between revisions of "Asturian"

Revision as of 15:48, 12 June 2008

Tasks

Pre-requisites
1. ~~Frequency ordered wordlist~~
http://xixona.dlsi.ua.es/~fran/asturian.freqlist.txt
Asturian morphological analyser
1. High frequency closed categories (freq. >= 50/23000) — pronouns, determiners, prepositions, etc.
2. 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
3. Adding frequent multiwords
Bilingual dictionary
1. Translations of each Asturian word into Spanish (taking into account frequency and generality in translating ambiguous pairs)
2. Translations of Asturian multiwords

Sub-tasks

Checking automatically generated lemma-paradigm pairs
1. Identifying frequent multiwords which cannot be translated word-for-word

Resources

Asturian Wiktionary — 120 nouns + genders + plural forms

Retrieved all pages, converted into speling format, and derived paradigms.

Asturian Wikipedia:

# Make the file so that each line starts with a determiner
cat ast.crp.txt  | sed 's/el /\nel /g' | sed 's/la /\nla /g' | sed 's/lo /\nlo /g' | sed 's/las /\nlas /g' | sed 's/les /\nles /g' | sed 's/los /\nlos /g' > ast.dets.txt
# Grep out the determiners
cat ast.dets.txt  | grep -e '^el' -e '^la' -e '^lo' -e '^les' -e '^las' -e '^los' > dets.txt
# Grep out the lines starting with feminine determiners in plural followed by one word (hopefully a noun)
cat dets.txt | grep '^les' | sort  | grep -v 'les y' | cut -f1,2 -d' ' | sort -u > det.les.txt
# Grep out the lines starting with feminine determiners in singular followed by one word (hopefully a noun)
cat dets.txt | grep '^la' | sort | grep -v 'la súa' | cut -f1,2 -d' ' | sort -u > det.la.txt
# Combine the two previous files
cat det.la.txt det.les.txt > det.la_les.txt
# Get extract style paradigms from existing dictionary
python /home/fran/scripts/apertium2extract.py /home/fran/svnroot/apertium/trunk/incubator/apertium-es-ast.ast.dix  > EXT.PDMS.FEM.TXT
# Apply extract to the wordlist (hopefully) with only singular+plural feminine nouns
extract -nobad -utf8 -e -u -id EXT.PDMS.FEM.TXT det.la_les.txt  | awk -F' ' '{print $2"; "$1"; "$3}' | sort -u > extract.la_les.out.txt
# Grep out the lines where both singular + plural were found
cat extract.la_les.out.txt | grep ',' > EX.txt 
# Re-organise lines
cat EX.txt | sed 's/;/\t/g' | awk '{print $3"; "$2"; "$1}' | sed 's/;/\t\t\t/g'

Example output

(abdicación,abdicaciones)                        espresi/ón__n                           abdicaci
(abegosa,abegoses)                               páxin/a__n                              abegos
(abeya,abeyes)                                   páxin/a__n                              abey
(Academia,Academias)                             imaxe__n                                Academia
(academia,academies)                             páxin/a__n                              academi
(acción,acciones)                                espresi/ón__n                           acci

...

@@ Line 7: / Line 7: @@
 ## High frequency closed categories (freq. >= 50/23000) &mdash; pronouns, determiners, prepositions, etc.
 ## 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
+## Adding frequent multiwords
-## Frequent multi-words
 # '''Bilingual dictionary'''
-## Translations of each Asturian word into Spanish
+## Translations of each Asturian word into Spanish (taking into account frequency and generality in translating ambiguous pairs)
-## Translations of Asturian multi-words
+## Translations of Asturian multiwords
 ;Sub-tasks
 # Checking automatically generated lemma-paradigm pairs
+## Identifying frequent multiwords which cannot be translated word-for-word
-# Producing
 ==Resources==

Difference between revisions of "Asturian"

Revision as of 15:48, 12 June 2008

Tasks

Resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools