Difference between revisions of "Asturian"

Revision as of 15:46, 12 June 2008

Tasks

Pre-requisites
1. ~~Frequency ordered wordlist~~

http://xixona.dlsi.ua.es/~fran/asturian.freqlist.txt

Asturian morphological analyser
1. High frequency closed categories (freq. >= 50/23000) — pronouns, determiners, prepositions, etc.
2. 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
3. Frequent multi-words
Bilingual dictionary
1. Translations of each Asturian word into Spanish
2. Translations of Asturian multi-words

Sub-tasks

Checking automatically generated lemma-paradigm pairs
Producing

Resources

Asturian Wiktionary — 120 nouns + genders + plural forms

Retrieved all pages, converted into speling format, and derived paradigms.

Asturian Wikipedia:

# Make the file so that each line starts with a determiner
cat ast.crp.txt  | sed 's/el /\nel /g' | sed 's/la /\nla /g' | sed 's/lo /\nlo /g' | sed 's/las /\nlas /g' | sed 's/les /\nles /g' | sed 's/los /\nlos /g' > ast.dets.txt
# Grep out the determiners
cat ast.dets.txt  | grep -e '^el' -e '^la' -e '^lo' -e '^les' -e '^las' -e '^los' > dets.txt
# Grep out the lines starting with feminine determiners in plural followed by one word (hopefully a noun)
cat dets.txt | grep '^les' | sort  | grep -v 'les y' | cut -f1,2 -d' ' | sort -u > det.les.txt
# Grep out the lines starting with feminine determiners in singular followed by one word (hopefully a noun)
cat dets.txt | grep '^la' | sort | grep -v 'la súa' | cut -f1,2 -d' ' | sort -u > det.la.txt
# Combine the two previous files
cat det.la.txt det.les.txt > det.la_les.txt
# Get extract style paradigms from existing dictionary
python /home/fran/scripts/apertium2extract.py /home/fran/svnroot/apertium/trunk/incubator/apertium-es-ast.ast.dix  > EXT.PDMS.FEM.TXT
# Apply extract to the wordlist (hopefully) with only singular+plural feminine nouns
extract -nobad -utf8 -e -u -id EXT.PDMS.FEM.TXT det.la_les.txt  | awk -F' ' '{print $2"; "$1"; "$3}' | sort -u > extract.la_les.out.txt
# Grep out the lines where both singular + plural were found
cat extract.la_les.out.txt | grep ',' > EX.txt 
# Re-organise lines
cat EX.txt | sed 's/;/\t/g' | awk '{print $3"; "$2"; "$1}' | sed 's/;/\t\t\t/g'

Example output

(abdicación,abdicaciones)                        espresi/ón__n                           abdicaci
(abegosa,abegoses)                               páxin/a__n                              abegos
(abeya,abeyes)                                   páxin/a__n                              abey
(Academia,Academias)                             imaxe__n                                Academia
(academia,academies)                             páxin/a__n                              academi
(acción,acciones)                                espresi/ón__n                           acci

...

@@ Line 1: / Line 1: @@
 ==Tasks==
-* '''Pre-requisites'''
+# '''Pre-requisites'''
-** Frequency ordered wordlist
+## <s>Frequency ordered wordlist</s>
+:::http://xixona.dlsi.ua.es/~fran/asturian.freqlist.txt
-* '''Asturian morphological analyser'''
+# '''Asturian morphological analyser'''
-** High frequency closed categories (freq. >= 50/23000) &mdash; pronouns, determiners, prepositions, etc.
+## High frequency closed categories (freq. >= 50/23000) &mdash; pronouns, determiners, prepositions, etc.
-** 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
+## 2,000 highest frequency words from each open category (noun, verb, adjective, adverb)
-** Frequent multi-words
+## Frequent multi-words
-* '''Bilingual dictionary'''
+# '''Bilingual dictionary'''
-** Translations of each Asturian word into Spanish
-** Translations of Asturian multi-words
+## Translations of each Asturian word into Spanish
+## Translations of Asturian multi-words
+;Sub-tasks
+# Checking automatically generated lemma-paradigm pairs
+# Producing
 ==Resources==

Difference between revisions of "Asturian"

Revision as of 15:46, 12 June 2008

Tasks

Resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools