Difference between revisions of "Sardo e italiano/Work plan"

Revision as of 03:19, 28 July 2016

Personas:

Gianfranco, Adrià, Hèctor, Fran, Mikel

Tareas:

Convertir el corrector ortográfico en analizador .dix (= ~40k entradas)

Se puede incluir código AGPL en un par de lenguas GPL? --Mlforcada (talk) 11:53, 28 June 2016 (CEST)

Importar las palabras del glossario de la región (= 6425 entradas) https://svn.code.sf.net/p/apertium/svn/incubator/apertium-srd-ita/dev/glossariu.ita-srd.nospaces.txt
~~Crear un corpus de sardo LSC de Limbas e natziones~~ descargar aquí
Importar las palabras que quedan de Morph-it!
Arreglar los enclíticos de verbos.
Añadir ~15,000 palabras al diccionario bilingüe (para tener al menos 20k correspondencias LSC-italiano).
Arreglar los nombres propios en el diccionario de italiano (se tiene que empezar desde cero creo.)
Revisar entradas al bilingüe y monolingüe de sardo para que sigan LSC (categoria por categoria).
Trabajar en reglas de transferencia
Trabajar en reglas de desambiguación
Trabajar en reglas de selección léxica
Compilar un pequeño corpus paralelo de LSC-italiano.
Hacer el testvoc.
Hacer una evaluación (artículos de Wikipedia).
Escribir artículo (Linguamática?, SEPLN?)

rendiment: approx. 1000 words/day bidix.

Plan semanal

Semana	Fechas	Cobertura	Testvoc	Eval.	(%) cov. raw		(%) cov. trimmed		(%) WER		Bidix	Err.		Cumplido ?
Semana	Fechas	Cobertura	Testvoc	Eval.	srd	ita	srd→ita	ita→srd	srd→ita	ita→srd	Bidix	srd→ita	ita→srd	Cumplido ?
0	11 abril—17 abril	74%		350	80.6	85.9	74.5	76.5	24.00	11.72	2,919			✓
1	18 abril—24 abril	76%			80.6	85.9	77.9	77.8			7,106	109,489	60,296	✓
2	25 abril—1 mayo	78%			82.9	87.1	80.3	78.6			10,606	380,825	49,697	✓
3	2 mayo—8 mayo	80%	pr, cnj*, adv	500	84.0	87.2	82.2	79.8	24.79	16.73	11,627	444,291	49,221	✓
4	9 mayo—15 mayo	80%			85.8	88.2	82.3	81.1			11,778	467,068	149,773	✓
5	16 mayo—22 mayo	80.5%			85.8	88.5	82.5	81.5			11,821	429,598	44,666	✓
6	23 mayo—29 mayo	81%	prn, det		85.8	88.5	82.5	81.5			11,725	376,283	7,936
7	30 mayo—5 junio	81.5%			86.4	89.3	84.4	82.7			12,703	421,065
8	6 junio—12 junio	82%			86.8	91.0	85.0	84.1			13,556	43,780
9	13 junio—19 junio	83%			86.9	91.3	85.2	84.5			14,568	215,595	55,763
10	20 junio—26 junio	84%			86.9	91.3	85.2	84.5			16,471	147,160	11,039
11	27 junio—3 julio	85%	n	500	88.3	91.3	86.5	84.9	39.43	18.99	16,837		10,524
12	4 julio—10 julio	86%			88.3	91.4	86.5	85.0			17,034	326,972	9,963
13	11 junio—17 julio	87%	vblex			91.5		85.2			17,348	204,266	377
14	~~18 julio—24 julio~~	87%			88.5	91.5	86.9	85.5			17,887	28,658	0
15	25 julio—31 julio	88%	adj					86.9
16	1 agosto—7 agosto	89%
17	8 agosto—14 agosto	90%		2000
18	15 agosto—21 agosto	90%

Para calcular los numeros

Errors (calculate in apertium-srd-ita)

$ bash dev/testvoc/generation.sh srd-ita | wc -l 
$ bash dev/testvoc/generation.sh ita-srd | wc -l

Bidix (calculate in apertium-srd-ita)

$ cat apertium-srd-ita.srd-ita.dix | grep '<l' | wc -l

Trimmed coverage (calculate in apertium-srd-ita)

$ cat srd.crp.txt | apertium -d . srd-ita-morph | sed 's/\$\W*\^/$\n^/g' > /tmp/srd.trim.coverage.txt
$ calc `cat /tmp/srd.trim.coverage.txt | grep -v '\*' | wc -l `/`cat /tmp/srd.trim.coverage.txt | wc -l`

$ cat ita.crp.txt | apertium -d . ita-srd-morph | sed 's/\$\W*\^/$\n^/g' > /tmp/ita.trim.coverage.txt
$ calc `cat /tmp/ita.trim.coverage.txt | grep -v '\*' | wc -l `/`cat /tmp/ita.trim.coverage.txt | wc -l`

Raw coverage (calculate in apertium-srd, apertium-ita)

$ cat srd.crp.txt | apertium -d . srd-morph | sed 's/\$\W*\^/$\n^/g' > /tmp/srd.raw.coverage.txt
$ calc `cat /tmp/srd.raw.coverage.txt | grep -v '\*' | wc -l `/`cat /tmp/srd.raw.coverage.txt | wc -l`

$ cat ita.crp.txt | apertium -d . ita-morph | sed 's/\$\W*\^/$\n^/g' > /tmp/ita.raw.coverage.txt
$ calc `cat /tmp/ita.raw.coverage.txt | grep -v '\*' | wc -l `/`cat /tmp/ita.raw.coverage.txt | wc -l`

Para hacer un corpus reducido de italiano

wget https://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
wget https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
python3 WikiExtractor.py --infn itwiki-latest-pages-articles.xml.bz2 2>log 
cat -n wiki.txt | grep -P '7\t' | cut -f2- > wiki.10pc.txt

@@ Line 62: / Line 62: @@
 | <s>14</s> || <s>18 julio&mdash;24 julio</s>   || 87% ||           ||        ||  || 88.5 || 91.5 || 86.9 ||  85.5 ||       ||   || 17,887 || 28,658 || 0 ||
 |-
-| 15      || 25 julio&mdash;31 julio   || 88%       ||  adj         ||        ||    ||     ||       ||     ||         ||       ||   || || || ||
+| 15      || 25 julio&mdash;31 julio   || 88%       ||  adj         ||        ||    ||     ||       ||     || 86.9 ||       ||   || || || ||
 |-
 | 16      || 1 agosto&mdash;7 agosto   || 89%       ||               ||       ||    ||    ||        ||      ||        ||       ||   || ||  || ||

Difference between revisions of "Sardo e italiano/Work plan"

Revision as of 03:19, 28 July 2016

Plan semanal

Para calcular los numeros

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools