Search results

Turkic languages
The ultimate goal is to have multi-purpose transducers and annotated corpora (i.e. treebanks) for a variety of Turkic languages. These can then be pair Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

35 KB (3,577 words) - 15:24, 1 October 2021
Germanic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

32 KB (3,684 words) - 06:16, 28 December 2018
Ideas for Google Code-In (2011)
...(grammatical descriptions, wordlists, dictionaries, spellcheckers, papers, corpora, etc.) for Aromanian, along with the licences they are under. || || [[User: ...=center| {{sc|research}} || 3. Easy || Create manually tagged corpora: Occitan || Fix tagging errors in a piece of analysed text, for use in tag

187 KB (21,006 words) - 22:14, 12 November 2012
Languages
...however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally use * The coverage of the transducer on a variety of corpora

15 KB (1,783 words) - 22:33, 1 February 2019
Learning rules from parallel and non-parallel corpora
== Estimating rules using parallel corpora == ...see [[Running the monolingual rule learning]] if you only have monolingual corpora).

14 KB (2,181 words) - 19:01, 17 August 2018
Languages of the Volga-Kama region
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc === Corpora and corpora projects ===

9 KB (987 words) - 23:25, 22 December 2014
Contributing to an existing pair
* directory es-tagger-data : Contains data needed for the Spanish tagger (corpora, etc.) * directory ca-tagger-data : Contains data needed for the Catalan tagger (corpora, etc.)

50 KB (7,915 words) - 00:04, 10 March 2019
Ideas for Google Summer of Code
| name = Dictionary induction from parallel corpora / Revive ReTraTos | description = Extract dictionaries from parallel corpora

23 KB (3,198 words) - 09:15, 4 March 2024
Corpora formats
...quirements for corpora, and a number of different formats for storing such corpora have sprung up. Some examples include: ...ng on). The following is an idea Jonathan has for implementing a standard corpora format for use by apertium.

5 KB (813 words) - 00:08, 28 December 2011
Sardu abbarra bivu!
...millions of words as in the statistical methods: it takes only two smaller corpora and a dictionary containing rules to conjugate verbs and to match nouns and ...anguage pairs of the same linguistic family without the need of linguistic corpora. The experience of Apertium with several minoritised languages such as Occi

15 KB (2,339 words) - 00:41, 4 June 2018
Uralic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

22 KB (2,520 words) - 23:09, 22 December 2014
Romance languages
=== Annotated corpora ===

18 KB (2,312 words) - 18:25, 18 September 2016
Balkan languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

12 KB (1,308 words) - 19:27, 27 August 2017
English and Kazakh
* Collect parallel kaz-eng corpora! By new plan, we focused on adding vocabulary from 4 corpora.

20 KB (2,856 words) - 06:26, 27 May 2021
Mandarin Chinese
* [http://corpus.leeds.ac.uk/query-zh.html A Collection of Chinese Corpora and Frequency Lists.] ===Corpora===

16 KB (2,148 words) - 03:28, 16 December 2015
Xhosa
....za/Faculties/ART/Xhosa/Pages/Research-.aspx "Cross linguistics upon Xhosa Corpora Research"] == Monolingual/Parallel Corpora ==

4 KB (566 words) - 05:57, 18 April 2020
Generating lexical-selection rules from a parallel corpus
{{deprecated2|Learning rules from parallel and non-parallel corpora}} * a parallel corpus (see [[Corpora]])

15 KB (2,206 words) - 13:58, 7 October 2014
Building dictionaries
==Getting corpora== WORDLIST=/home/spectre/corpora/afrikaans-meester-utf8.txt

16 KB (2,566 words) - 21:36, 15 March 2020
UD annotatrix/UD annotatrix at GSoC 2017
...ert the data between the formats. It also allows to either upload or paste corpora in plain text and then convert them into CoNLL-U. ...des support for saving user corpora on server and then accessing the saved corpora via unique URL.

6 KB (930 words) - 15:59, 29 August 2017
Getting started with Annotatrix
...is an open source tool included on the Apertium project that let you train corpora and manage related files with a friendly user interface and letting you foc ...rom this view you are able to see corpora and training details, insert new corpora and train them easily

8 KB (1,376 words) - 11:14, 29 October 2014
Ideas for Google Summer of Code/User-friendly lexical selection training
* [[Learning rules from parallel and non-parallel corpora]] – this is the current documentation on training/inferring rules ** preprocess corpora

4 KB (541 words) - 13:46, 29 March 2021
Hindi
===Corpora=== * [http://corpora.uni-leipzig.de/en?corpusId=hin_news_2011 Hindi News Corpus] Creative Common

6 KB (806 words) - 00:45, 7 December 2018
Apertium-regtest
Apertium-regtest is a program for managing regression tests and [[Corpus test|corpora]]. # in the browser, select one or all of the corpora to rerun tests for

11 KB (1,823 words) - 12:17, 6 June 2023
French
===Corpora=== * [http://childes.talkbank.org/access/French/ CHIDES Corpora]. [http://talkbank.org/share/rules.html ''Requires reference'']

15 KB (2,081 words) - 07:14, 12 August 2020
Publications
..., Tommi Pirinen, Jonathan Washington. "Finite-state morphologies and text corpora as resources for improving morphological descriptions". [https://sites.goog ...f Inferring shallow-transfer machine translation rules from small parallel corpora]". In Journal of Artificial Intelligence Research. volume 34, p. 605-635.

33 KB (4,418 words) - 11:52, 29 December 2021
Indic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

12 KB (1,017 words) - 09:06, 18 January 2022
Unigram tagger
====Training on Corpora with Ambiguous Lexical Units==== ...m-tagger</code> '', the tagger prints warnings about ambiguous analyses in corpora to stderr.''

20 KB (3,229 words) - 20:06, 12 March 2018
Romanian
* Tufiş, D., A. M. Barbu, V. Pătraşcu, G. Rotariu, and C. Popescu. "Corpora and Corpus-Based Morpho-Lexical Processing." ''Recent Advances in Roma ===Corpora===

7 KB (889 words) - 09:53, 28 November 2018
Lexical feature transfer - First report
...why the errors are so high. Re-evaluation will be done as soon as the two corpora are manually realigned. ...nslated corpora, and the label will be extracted from the manually written corpora. This method might provide better results since the model will be trained o

6 KB (838 words) - 17:47, 25 July 2012
Comparison of part-of-speech tagging systems
==Corpora== The tagged corpora used in the experiments are found in the monolingual packages in [[language

16 KB (1,448 words) - 16:50, 22 August 2017
Apertium-uzb-kaa
...n be found [https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora here]. ...m/corpora/jam/uzb.txt | apertium -d . uzb-kaa) -ref ../../../data4apertium/corpora/jam/kaa.txt

5 KB (515 words) - 14:34, 1 September 2019
Hindi and English
===Corpora=== * [http://opus.nlpl.eu/ Hindi-English Parallel Corpora]

8 KB (1,079 words) - 11:17, 3 December 2018
Nepali
===Corpora === * [http://www.elra.info/en/catalogues/free-resources/nepali-corpora/ ''"Nepali"'']

8 KB (948 words) - 19:59, 30 December 2017
Getting started with induction tools
=== Obtaining corpora (and getAlignmentWithText.pl) === The corpora need to be untarred, and inserted into a new, common directory.

7 KB (973 words) - 02:52, 20 May 2021
Related software
Automatic shallow-transfer rules generation from parallel corpora ...in statistical machine translation, that have been extracted from parallel corpora and extended with a set of restrictions controlling their application.

4 KB (525 words) - 19:21, 17 September 2009
Generating lexical-selection rules from monolingual corpora
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm

12 KB (1,634 words) - 18:26, 26 September 2016
Grfro3d/proposal apertium cat-srd and ita-srd
...e project was a success. All the goals have been achieved: the creation of corpora in LSC; italian monodix: apertium-srd-srd.dix: 51,743 words; apertium-ita-i ...rned to use markup languages (XML and HTML) for the creation of linguistic corpora. At present, I attend a Master’s Degree in Translation of specialized tex

21 KB (3,171 words) - 14:34, 3 April 2017
Turkic MT Improvements GSoC2019 report
== Corpora and Coverage == ...he help of mentors on Kipchak languages. Most frequent unknown tokens from corpora of each language (mostly consisting of Wikipedia entries, Bible and Quran)

7 KB (798 words) - 18:30, 26 August 2019
German
===Corpora=== * [https://korpora.zim.uni-duisburg-essen.de/Limas/ Corpora from Limas z.Hd. Prof. Dr. Bernhard Schröder Universität Duisburg-Essen,

8 KB (900 words) - 10:15, 4 December 2018
Uighur and Turkish/GSoC2018 report
== Corpora and Coverage == Our main corpora consisted of [https://www.rfa.org/uyghur/ RFA], [http://uy.ts.cn/ Tanritor]

5 KB (607 words) - 13:25, 12 August 2018
Javanese
== Corpora == ...8ba2a9c0e50bc885bfad3bfbff3b4afbd.pdf Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech]

7 KB (881 words) - 13:11, 12 December 2018
Celtic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

10 KB (1,263 words) - 06:04, 23 December 2014
Lexical feature transfer - Second report
== Corpora, sets and alignment == The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, c

5 KB (620 words) - 12:21, 27 July 2012
Apertium-aze/stats
== Corpora == * wikipage: <section begin=azadliq2012-wikipage />RFERL corpora<section end=azadliq2012-wikipage />

1,013 bytes (115 words) - 22:42, 12 August 2014
Uighur and Turkish/Paper
* Evaluate system on corpora === Various Potential Corpora ===

10 KB (1,483 words) - 07:00, 14 August 2018
Indonesian
=== Corpora === * [https://corpora.uni-leipzig.de/en?corpusId=ind_mixed_2013 Leipzig Corpora Collection - Indonesian]

5 KB (629 words) - 13:08, 21 December 2019
Tungusic languages
===Corpora===

2 KB (172 words) - 17:09, 27 March 2017
Constraint-based lexical selection module
...ine. Rules can be manually written, or learnt from monolingual or parallel corpora. {{main|Learning rules from parallel and non-parallel corpora}}

19 KB (2,820 words) - 15:26, 11 April 2023
Génération de règles de sélection lexicale depuis un corpus parallèle
* d'un corpus parallèle (voir [[Corpora]]) Nous alors faire l'exemple avec [[Corpora|EuroParl]] et la paire anglais vers espagnol d'Apertium.

9 KB (1,445 words) - 14:05, 7 October 2014
Building a pseudo-parallel corpus
...language model for the target language in order to create pseudo-parallel corpora, and use them in the same way as parallel ones. IRSTLM is a tool for building n-gram language models from corpora. It supports different smoothing and interpolation methods, including Writt

3 KB (364 words) - 23:25, 23 August 2012
Apertium-nno-nob/kjektåkunne
$ xzcat ~/corpora/nob/*ntb*.xz | head -100000 | apertium -d . nob-nno_e > 2019-09-30.before $ xzcat ~/corpora/nob/*ntb*.xz | head -100000 | apertium -d . nob-nno_e > 2019-09-30.after

2 KB (327 words) - 08:02, 1 October 2019
Iranian languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

22 KB (2,532 words) - 11:36, 30 July 2018
Task ideas for Google Code-in (2013)
...like the page on [[Aromanian]]—i.e., all available dictionaries, grammars, corpora, machine translators, etc., print or digital, where available, whether Free

68 KB (10,323 words) - 15:37, 25 October 2014
Sardo e italiano/Rapporto finale
...essicale e dell’analisi contrastiva è stata provvidenziale la creazione di corpora costituiti da testi redatti nella variante LSC, estrapolati da riviste on-l ...vocabolario Logudorese-italiano di Mario Casu e l’analisi approfondita dei corpora paralleli che ci hanno permesso di capire quale fosse, caso per caso, il ma

13 KB (1,910 words) - 11:34, 23 August 2016
Tatar and Russian
* Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository. |colspan="2" rowspan="4"| Corpus testvoc clean on all of the available corpora ||rowspan="4"| ||rowspan="4" colspan="2" style="text-align: center"| ✗||r

8 KB (1,006 words) - 12:48, 9 March 2018
Flyer
or adapting the software to fit your needs. Existing free (GPL) data and corpora easily reusable to feed Apertium's dictionaries are also welcome. ...esidades particulares. También se agradece la disponibilización de datos y corpora libres (GPL) que sean reutilizables para mejorar los diccionarios de Aperti

26 KB (3,122 words) - 06:25, 27 May 2021
Mayan languages
=== Annotated corpora ===

3 KB (241 words) - 20:44, 9 September 2020
Google Summer of Code/Report 2013
===Application for "Interface for creating tagged corpora" GSOC 2013===

2 KB (200 words) - 08:21, 13 January 2015
Kazakh and Tatar
...pertium-kaz/stats|~{{:apertium-kaz/stats/average}}%]] coverage over random corpora ...pertium-tat/stats|~{{:apertium-tat/stats/average}}%]] coverage over random corpora

4 KB (586 words) - 01:53, 10 March 2018
Hindi and Bengali
===Corpora=== * [http://corpora.uni-leipzig.de/en?corpusId=hin_news_2011 Hindi News Corpus]

4 KB (557 words) - 05:45, 25 August 2021
Ideas for Google Summer of Code/Add weights to lttoolbox
...aled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly. * gold-standard tagged corpora and

5 KB (816 words) - 02:32, 13 February 2018
Kazakh
=== Corpora ===

7 KB (943 words) - 20:51, 6 September 2018
Apertium-kaz/stats
== Corpora ==

4 KB (479 words) - 02:06, 28 February 2020
Annotatrix/Work plan
*Trainer on_fly working and training corpora '''Done''' ***Now the trainer is able to train corpora using just the keyboard, with a friendly user interface, cleaned corpus to

12 KB (1,602 words) - 15:47, 10 October 2013
Apertium-crh/stats
== Corpora ==

2 KB (286 words) - 10:51, 4 June 2017
Apertium-test/teststats/
== Corpora ==

2 KB (242 words) - 19:49, 3 January 2018
Apertium-quality/Quickstart
...do some of the tests like generation testing or coverage testing, we need corpora, right? Have no fear, for `aq-wikicrp` is here! Let us get a Maltese wikipe ...t you'd expect, tests the dictionary for coverage. Using our newly created corpora, we can test the coverage! Feel free to use either one, but be consistent;

12 KB (1,931 words) - 17:06, 24 October 2018
Ideas for Google Summer of Code/Make a language pair state-of-the-art
...tion quality. This will involve improving coverage to 95-98% on a range of corpora and decreasing word error rate by 30-50%. For example if the current word e ...cial languages of EU? : Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working la

2 KB (383 words) - 19:46, 2 March 2023
Uyghur
=== Corpora ===

773 bytes (75 words) - 19:17, 8 June 2014
Portuguese
=== Corpora ===

2 KB (302 words) - 16:23, 26 December 2017
Apertium-lin/stats
== Corpora ==

2 KB (278 words) - 00:24, 15 June 2021
Apertium-byv/stats
== Corpora ==

2 KB (290 words) - 02:07, 24 July 2019
Apertium-ibo/stats
== Corpora ==

1 KB (158 words) - 03:34, 13 July 2021
Apertium-khk/stats
== Corpora ==

1 KB (146 words) - 20:16, 24 March 2020
Farsi
===Corpora===

8 KB (1,143 words) - 18:45, 11 August 2015
Wolof
=== Corpora ===

4 KB (538 words) - 02:40, 27 December 2016
Yue Chinese
=== Corpora ===

3 KB (390 words) - 09:39, 27 December 2017
Apertium-gle/stats
== Corpora ==

867 bytes (90 words) - 20:14, 24 March 2020
Sardinian and Italian/Final Report
...r the lexical analysis and selection contrastive was providential creating corpora consist of texts written in the LSC variant, taken from magazines on -line ...io Casu's Logudorese-Italian vocabulary and in-depth analysis of parallel corpora that have allowed us to understand what, case by case, the greatest number

7 KB (1,110 words) - 11:34, 23 August 2016
Apertium-kaa/stats
== Corpora ==

3 KB (367 words) - 06:16, 1 October 2021
Arabic
* http://permalink.gmane.org/gmane.science.linguistics.corpora/22281 Arabic names from dbpedia ===Corpora===

3 KB (437 words) - 10:23, 21 November 2021
Tatar
=== Corpora ===

2 KB (194 words) - 04:52, 31 December 2017
Apertium-kir/stats
== Corpora ==

4 KB (440 words) - 21:41, 15 December 2019
Apertium-tyv/stats
== Corpora ==

2 KB (272 words) - 21:51, 15 December 2019
Apertium-ces/stats
== Corpora ==

2 KB (213 words) - 17:55, 16 December 2017
Apertium-bua/stats
== Corpora ==

2 KB (250 words) - 16:26, 11 April 2015
Apertium-sah/stats
== Corpora ==

3 KB (342 words) - 21:33, 15 December 2019
Apertium Turkic
...Pirinen, Jonathan Washington (2015). "Finite-state morphologies and text corpora as resources for improving morphological descriptions". [https://sites.goog

13 KB (1,710 words) - 20:32, 30 August 2018
Apertium-fin/stats
== Corpora ==

328 bytes (35 words) - 05:49, 16 December 2014
English and Catalan/Workplan
* Tagger training preparation (tagged corpora unification) * Tagger training preparation (tagged corpora unification)

5 KB (506 words) - 14:56, 28 August 2017
Apertium-kum/stats
== Corpora ==

2 KB (235 words) - 05:58, 16 December 2014
Bislama
== Corpora ==

4 KB (524 words) - 18:29, 29 December 2013
Luxembourgish
===Corpora===

9 KB (1,145 words) - 06:19, 28 December 2018
Apertium-nno/stats
== Corpora ==

2 KB (231 words) - 15:43, 1 October 2021
Apertium-tuk/stats
== Corpora ==

943 bytes (97 words) - 23:58, 7 September 2014
Apertium-cy-en/stats
== Corpora ==

7 KB (1,098 words) - 10:56, 4 May 2016
Apertium-ell/stats
== Corpora ==

1 KB (135 words) - 06:02, 16 December 2014
Apertium-nob/stats
== Corpora ==

2 KB (231 words) - 15:46, 1 October 2021
Apertium-bel/stats
== Corpora ==

1 KB (157 words) - 05:29, 22 August 2017
Apertium-br-fr/stats
== Corpora ==

6 KB (786 words) - 10:56, 4 May 2016
Apertium-hye/stats
== Corpora ==

1 KB (154 words) - 06:03, 16 December 2014
Apertium-krc/stats
== Corpora ==

2 KB (255 words) - 19:20, 9 October 2021
Apertium-mkd/stats
== Corpora ==

1 KB (173 words) - 06:04, 16 December 2014
Apertium-gla/stats
== Corpora ==

1 KB (135 words) - 06:03, 16 December 2014
Apertium-tur/stats
== Corpora ==

2 KB (199 words) - 06:51, 6 July 2018
Apertium-hbs/stats
== Corpora ==

1 KB (173 words) - 06:03, 16 December 2014
Apertium-ava/stats
== Corpora ==

1 KB (177 words) - 06:01, 16 December 2014
Apertium-chv/stats
== Corpora ==

2 KB (211 words) - 06:02, 16 December 2014
Apertium-slv/stats
== Corpora ==

1 KB (173 words) - 06:06, 16 December 2014
Using GIZA++
...is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionar *[[Corpora]]

4 KB (589 words) - 11:51, 29 April 2015
Apertium-oss/stats
== Corpora ==

417 bytes (42 words) - 20:17, 24 March 2020
Apertium-bak/stats
== Corpora ==

2 KB (246 words) - 21:49, 15 December 2019
Apertium-bul/stats
== Corpora ==

1 KB (154 words) - 06:02, 16 December 2014
Manx
==Parallel corpora==

766 bytes (87 words) - 08:07, 20 January 2009
Crimean Tatar
== Electronic Corpora ==

335 bytes (29 words) - 16:39, 24 May 2017
Apertium-tat/stats
== Corpora ==

3 KB (414 words) - 21:40, 15 December 2019
Apertium-rus/stats
== Corpora ==

1 KB (158 words) - 05:28, 22 August 2017
Hausa
'''CORPORA''' ...sketchengine.eu/user-guide/user-manual/corpora/by-language/hausa-boko-text-corpora/

1 KB (179 words) - 18:40, 26 October 2018
Apertium-kjh/stats
== Corpora ==

511 bytes (54 words) - 20:57, 15 February 2015
Apertium-mlt/stats
== Corpora ==

1 KB (135 words) - 06:05, 16 December 2014
Apertium-dan/stats
== Corpora ==

1 KB (135 words) - 06:59, 28 June 2016
Apertium-uig/stats
== Corpora ==

2 KB (213 words) - 06:51, 6 July 2018
Apertium-yid/stats
==Corpora==

2 KB (212 words) - 04:26, 2 January 2019
Apertium-nog/stats
== Corpora ==

1 KB (176 words) - 06:05, 16 December 2014
Apertium-ukr/stats
== Corpora ==

1 KB (158 words) - 05:29, 22 August 2017
Aromanian
===Corpora===

8 KB (1,048 words) - 05:32, 1 December 2017
Kashmiri
== Other Corpora ==

6 KB (811 words) - 10:42, 2 July 2018
Apertium-uzb/stats
== Corpora ==

3 KB (324 words) - 21:41, 15 December 2019
Apertium-sqi/stats
== Corpora ==

1 KB (154 words) - 06:16, 16 December 2014
Unsupervised tagger training
# the best taggers use hand-tagged corpora to train with (we use untagged corpora -- for English)

7 KB (1,177 words) - 08:34, 8 October 2014
Traductions en français
* [[Corpora]]

13 KB (1,601 words) - 23:31, 23 July 2021
Automatically generating compound bidix entries
...toscore.txt | ~/source/apertium/trunk/apertium-lex-learner/irstlm-ranker ~/corpora/català/en.blm > unk.trans.scored.txt

9 KB (1,470 words) - 11:28, 24 March 2012
Lextor
to be related. Both corpora must be pre-processed before the training. This pre-processing, consisting in analysing the corpora and

11 KB (1,814 words) - 03:22, 9 March 2019
Helsinki Apertium Workshop/Session 8
The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are

12 KB (1,683 words) - 08:42, 10 May 2013
Tartu Apertium Course/Session 8
The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are

12 KB (1,683 words) - 11:00, 30 October 2015
Курсы машинного перевода для языков России/Session 8
The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are

12 KB (1,679 words) - 12:00, 31 January 2012
Turkic-Turkic translator
* '''Coverage''' is the naïve coverage over one or more free corpora.

6 KB (591 words) - 22:50, 30 October 2017
Quick and dirty guide addendum: other important things
...steps to create each one. But you'll want to build it up, testing against corpora, etc. You want to be able to have as many correct analyses as possible, an ...if your pair will support translation in both directions). The corpus or corpora ideally should represent a range of content—i.e., it shouldn't be just sp

10 KB (1,615 words) - 07:43, 20 December 2015
Apertium cat-srd and ita-srd/GSoC 2017
...dinian to Italian. We started a manual morphological disambiguation of the corpora that will help the translator to recognize the correct morphology of each w We have treated two corpora: one journalistic and more dialectal, and other taken directly from literar

9 KB (1,306 words) - 15:56, 2 September 2017
Narimann/GSOC 2019 proposal: Kazakh-Turkish and Turkish-Kazakh
Word Sense Disambiguation for WordNet corpora ...oblem is that SMT requires a lot of data in the form of parallel languages corpora, since they very addicted to data, and many languages cannot afford it. Whi

8 KB (1,094 words) - 13:10, 14 April 2019
Semitic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

20 KB (2,336 words) - 18:10, 14 April 2015
Linguistic Resources Document
...t consisting of a set of linguistic resources (dictionaries, cross models, corpora, links to other LRDs, etc.). ...uistics resources: morphological and bilingual dictionaries, cross models, corpora, etc.</description>

8 KB (902 words) - 09:19, 6 October 2014
Lexical selection
* [[Learning rules from parallel and non-parallel corpora]] ...sing statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does n

4 KB (625 words) - 08:36, 29 April 2015
Extracting bilingual dictionaries with Giza++
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 & ...e> script to generate relative frequency lists of in-domain and out-domain corpora.

5 KB (680 words) - 11:53, 26 September 2016
Testvoc
xzcat corpora/nno.xz | tr -d '#@/' | apertium -d . nno-nob-dgen | grep '.\{0,6\}[#@/].\{0

9 KB (1,400 words) - 22:30, 18 January 2021
Meta-evaluation
** Overall accuracy (over parallel corpora): WER/PER/BLEU ...generalised script that supports hfst and lttoolbox binaries and arbitrary corpora would be good. It should also (optionally) output hitparades (e.g., freque

2 KB (246 words) - 02:32, 1 June 2019
Hectoralos/GSOC 2019 proposal: Catalan-Italian and Catalan-Portuguese
...an give unsatisfactory results (WER ≈ 30%, coverage below 85% in Wikipedia corpora). Both were published in 2009 and, apparently, no one has worked on them si

16 KB (2,285 words) - 06:46, 12 April 2019
Assimilation Evaluation Toolkit
Corpora in { gap } are large collections of texts enhanced with special markup. The Corpora in { gap } are large collections of { gap } with { gap }. They allow lingui

9 KB (1,368 words) - 09:04, 23 April 2015
Apertium-kir
#* <code>$ cat corpus.dep | ./corpora/add_morph.py > corpus_with_annotation.dep</code> #* <code>$ cat corpus_with_annotation.dep | ./corpora/dep_to_seg.py > corpus_with_annotation.seg</code>

1 KB (218 words) - 14:51, 24 April 2024
Automated extraction of lexical resources
corpora. ...the precess, creating an easy-to-use framework for using constraints with corpora in order to obtain information about words of interest; and later to provid

6 KB (928 words) - 13:57, 3 April 2009
Task ideas for Google Code-in/Documentation of resources
* large monolingual corpora of the language * parallel corpora of the language and some other language

1 KB (202 words) - 19:55, 12 April 2021
Crossdics
...ment specifying a set of linguistic resources (dictionaries, cross models, corpora, other LRD files, etc).

5 KB (633 words) - 13:29, 6 October 2017
Google Summer of Code/Application 2008
...ger and an initial set of translation rules from monolingual and bilingual corpora.

8 KB (1,255 words) - 19:50, 12 April 2021
Ideas for Google Summer of Code/Add a new variety to an existing language
...ing with dictionaries, lexical selection rules, transfer rules, scripting, corpora. The objective is to facilitate the generation of varieties for languages w ...cial languages of EU? : Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working la

2 KB (377 words) - 19:18, 25 January 2023
Ideas for Google Summer of Code/Interface for creating tagged corpora
...are creating unvaluable linguistic resources such as disambiguated tagged corpora. [[Category:Ideas for Google Summer of Code|Interface for creating tagged corpora]]

2 KB (269 words) - 21:26, 5 April 2013
Frankier/GSOC 2016 submission
Had an idea of fixing of out sync corpora automatically and started an "MVP" here: https://github.com/frankier/aperti ...opt (rather than the quality of its output). It could be used to help keep corpora, tagger models and morphologies in sync (though poking and possible automat

3 KB (456 words) - 18:17, 29 August 2016
Crimean Tatar and Turkish/Work plan
# all pronouns from Crimean Tatar corpora are translated without debug symbols * analyse corpora with crh-morph mode

4 KB (496 words) - 18:27, 19 June 2017
Crimean Tatar and Turkish/GSoC Report
...an vocabulary were used to good effect to reach a high coverage on all the corpora.

4 KB (551 words) - 23:52, 28 August 2017
Wikipedia dumps
...edia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's [[Content Translat $ zcat ~/Nedlastingar/cx-corpora.nb2nn.text.tmx.gz \

3 KB (436 words) - 05:40, 10 April 2019
Task ideas for Google Code-in (2012)
...(grammatical descriptions, wordlists, dictionaries, spellcheckers, papers, corpora, etc.), along with the licences they are under. See for example the page [[

14 KB (2,007 words) - 03:06, 27 October 2013
Using linguistic resources
...ich can be inserted into them. This data might consist of wordlists, word-corpora derived from web-crawlers such as [http://borel.slu.edu/crubadan/ Crubadán

13 KB (2,112 words) - 12:11, 26 May 2023
Google Code-in/Application 2015
...ents immensely good at helping us out with these: for instance, annotating corpora that are needed to train Apertium modules, or finding bugs in the handling

7 KB (1,111 words) - 10:10, 15 November 2015
Multi-engine translation synthesiser
...maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems.

5 KB (802 words) - 07:04, 10 May 2012
Google Summer of Code/Application 2009
...ger and an initial set of translation rules from monolingual and bilingual corpora.

10 KB (1,543 words) - 19:50, 12 April 2021
Crossdics Example
...uistics resources: morphological and bilingual dictionaries, cross models, corpora, etc.</description>

6 KB (689 words) - 22:58, 25 October 2018
Calculating coverage
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist

4 KB (583 words) - 15:18, 10 January 2022
Préparation de données pour Moses
Pour le corpus parallèle, on va utiliser Europarl, la page [[corpora]] (seulement en anglais) en liste d'autres :

5 KB (699 words) - 07:52, 8 October 2014
Running the MaxEnt rule learning
TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola

3 KB (520 words) - 21:25, 14 February 2014
Google Code-in/Application 2014
...ents immensely good at helping us out with these: for instance, annotating corpora that are needed to train Apertium modules, or finding bugs in the handling

6 KB (987 words) - 10:21, 7 November 2014
Preparing data for Moses factored training using Apertium
For the parallel corpus we're going to use Europarl, the page [[corpora]] lists some others:

4 KB (647 words) - 07:45, 8 October 2014
Xml grep
Some corpora are formatted in XML and put e.g. the real text contents inside a particula

5 KB (863 words) - 09:04, 10 October 2017
Running the monolingual rule learning
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm

4 KB (503 words) - 19:01, 17 August 2018
Apertium cat-srd/ Apertium ita-srd: relata finale
...biguadore morfològicu chi at a èssere ùtile pro sa disambiguatzione de sos corpora e non si nd'at a pòdere fàghere a mancu pro isvilupare àteras crobas lin

13 KB (2,173 words) - 19:17, 24 June 2018
Automatic postediting at GSoC 2018
...nually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus,

7 KB (1,033 words) - 15:27, 15 August 2018
Development
...UD Annotatrix] - in-browser software for annotating Universal Dependencies corpora.

2 KB (251 words) - 10:07, 27 June 2022
Turkish and Kyrgyz/Final report
...lexicon database with part of speech. We achieved coverage of % on SETimes corpora. And i am really happy with kymorph. Special thanks to firspeaker.

5 KB (680 words) - 07:14, 26 August 2011
Dravidian languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

19 KB (2,201 words) - 09:21, 9 December 2019
Indirect contribution guide
* corpora.

9 KB (1,494 words) - 05:58, 18 March 2015
Apertium Turkic/TODO
* consider including the web concordancer on the site (and consider what corpora to provide search access to...)

4 KB (514 words) - 21:24, 19 August 2015
Tagger training
Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here] and [http://wt.jrc.it/lt/Acqu

7 KB (1,058 words) - 07:37, 4 July 2016
Kazakh and Tatar/Diary
...t that they could increase the coverage significantly, because the testing corpora are either news or WP).

8 KB (1,205 words) - 21:50, 19 July 2012
Kazakh and Tatar/Remaining unanalysed forms
...rds is a semi-standard convention (it's occurring at least some in all the corpora). We should figure out where this is happening and see if it's something w

28 KB (769 words) - 11:34, 13 April 2013
English to Polish
{{see-also|Corpora}}

11 KB (1,750 words) - 13:24, 10 December 2010
L'outil ReTraTos
.... (2008) "Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation". ''Machine Translatio

8 KB (1,301 words) - 09:43, 6 October 2014
Fabriquer des dictionnaires
WORDLIST=/home/spectre/corpora/afrikaans-meester-utf8.txt

11 KB (1,852 words) - 07:04, 8 October 2014
Kazakh and Tatar/TODO
...ns: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text

4 KB (603 words) - 21:20, 31 August 2015
Apertium-neural
* Optimised for small corpora (under 100k parallel sentences)

869 bytes (111 words) - 15:06, 29 June 2020
ReTraTos
...li08j.pdf Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation]". ''Machine Translati

8 KB (1,273 words) - 09:32, 3 May 2024
Apertium-quality/Application Documentation
-x, --xml Output corpora in XML format

9 KB (1,003 words) - 11:02, 30 August 2011
Bosnian-Croatian-Montenegrin-Serbian and Slovenian
The corpora used for this task can be found here: http://www.statmt.org/europarl/v7/sl-

6 KB (625 words) - 16:54, 1 July 2013
Corpus test
Before you start you first need a [[Corpora|corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c e

6 KB (966 words) - 20:16, 23 July 2021
Transducers as flag diacritics and their topology learning
...learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.

6 KB (842 words) - 06:41, 20 October 2014
Test de corpus
Avant que vous vous commenciez vous avez d'abord besoin d'un [[Corpora|corpus]]. Regardez dans apertium-eo-en/corpa/enwiki.crp.txt.bz2 ? Lancez

7 KB (1,057 words) - 11:52, 7 October 2014
Concordancer
* Efficiency: Make it scale up to corpora of millions of words. This might involve doing (a) pre-analysis of the corp

3 KB (549 words) - 02:11, 10 March 2018
Incorporating guessing into Apertium
...to make a translation guesser using the existing bidix and two monolingual corpora in a similar way.

4 KB (558 words) - 13:07, 26 June 2020
Chinese and Spanish
...ingual dictionaries: At the beginning we started using Chinese and Spanish corpora in order to obtain lots of Chinese-Spanish word pairs. Using the Stanford S

7 KB (830 words) - 21:33, 30 September 2013
Ideas for Google Summer of Code/Adopt a language pair
...story] (or [https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam from here] ) as possible — Minimum one sentence. ...stvoc]] clean, and has a coverage of around 80% or more on a range of free corpora.

6 KB (1,024 words) - 15:22, 20 April 2021
Resources
* [[Corpora]]

1 KB (164 words) - 05:20, 4 December 2019
Shallow syntactic function labeller
...l trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morpho

5 KB (764 words) - 01:40, 8 March 2018
Tatar and Bashkir/Work plan
|0|| || collecting Tatar and Bashkir corpora, scraping a parallel corpus, making a frequency dictionary

2 KB (228 words) - 10:55, 9 May 2018
Ressources (français)
* [[Corpora]]

1 KB (185 words) - 13:45, 7 October 2014
Apertium-kaz/test
{{LangStats2 | lang = kaz | corpora = Әуезов,bible,azattyq2010,wp2011,quran | corpus1 = Әуезов | co

612 bytes (76 words) - 12:36, 9 January 2013
Medumba
Corpora

3 KB (482 words) - 19:28, 3 November 2017
Weighted transfer rules at GSoC 2016
...the weighted rules on overall quality and speed of translation using large corpora for training and evaluation.

9 KB (1,387 words) - 13:37, 23 August 2016
User-based chunking
* Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.

2 KB (356 words) - 21:15, 21 June 2020
Ideas for Google Summer of Code/Dictionary induction from parallel corpora
Write a script that reads two parallel corpora, applies the appropriate monolingual taggers and some word-aligner ([https:

511 bytes (84 words) - 18:17, 21 March 2024
SFST
wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6g.tar.gz

3 KB (426 words) - 17:48, 30 March 2012
Utiliser GIZA++
*[[Corpora]]

4 KB (553 words) - 08:43, 8 October 2014
Siciliano y castellano/Resources
'''Corpora:

2 KB (231 words) - 14:38, 5 October 2019
Apertium-tki
* [[/Corpora]]

1 KB (144 words) - 20:07, 15 July 2021
Multitrans
...helper when training (see [[Learning rules from parallel and non-parallel corpora]]).

3 KB (392 words) - 05:40, 22 August 2021

Search results

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools