Search results

Turkic languages
The ultimate goal is to have multi-purpose transducers and annotated corpora (i.e. treebanks) for a variety of Turkic languages. These can then be pair Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

35 KB (3,577 words) - 15:24, 1 October 2021
Germanic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

32 KB (3,684 words) - 06:16, 28 December 2018
Ideas for Google Code-In (2011)
...(grammatical descriptions, wordlists, dictionaries, spellcheckers, papers, corpora, etc.) for Aromanian, along with the licences they are under. || || [[User: ...=center| {{sc|research}} || 3. Easy || Create manually tagged corpora: Occitan || Fix tagging errors in a piece of analysed text, for use in tag

187 KB (21,006 words) - 22:14, 12 November 2012
Languages
...however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally use * The coverage of the transducer on a variety of corpora

15 KB (1,783 words) - 22:33, 1 February 2019
Learning rules from parallel and non-parallel corpora
== Estimating rules using parallel corpora == ...see [[Running the monolingual rule learning]] if you only have monolingual corpora).

14 KB (2,181 words) - 19:01, 17 August 2018
Languages of the Volga-Kama region
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc === Corpora and corpora projects ===

9 KB (987 words) - 23:25, 22 December 2014
Contributing to an existing pair
* directory es-tagger-data : Contains data needed for the Spanish tagger (corpora, etc.) * directory ca-tagger-data : Contains data needed for the Catalan tagger (corpora, etc.)

50 KB (7,915 words) - 00:04, 10 March 2019
Ideas for Google Summer of Code
| name = Dictionary induction from parallel corpora / Revive ReTraTos | description = Extract dictionaries from parallel corpora

23 KB (3,198 words) - 09:15, 4 March 2024
Corpora formats
...quirements for corpora, and a number of different formats for storing such corpora have sprung up. Some examples include: ...ng on). The following is an idea Jonathan has for implementing a standard corpora format for use by apertium.

5 KB (813 words) - 00:08, 28 December 2011
Sardu abbarra bivu!
...millions of words as in the statistical methods: it takes only two smaller corpora and a dictionary containing rules to conjugate verbs and to match nouns and ...anguage pairs of the same linguistic family without the need of linguistic corpora. The experience of Apertium with several minoritised languages such as Occi

15 KB (2,339 words) - 00:41, 4 June 2018
Uralic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

22 KB (2,520 words) - 23:09, 22 December 2014
Romance languages
=== Annotated corpora ===

18 KB (2,312 words) - 18:25, 18 September 2016
Balkan languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

12 KB (1,308 words) - 19:27, 27 August 2017
English and Kazakh
* Collect parallel kaz-eng corpora! By new plan, we focused on adding vocabulary from 4 corpora.

20 KB (2,856 words) - 06:26, 27 May 2021
Mandarin Chinese
* [http://corpus.leeds.ac.uk/query-zh.html A Collection of Chinese Corpora and Frequency Lists.] ===Corpora===

16 KB (2,148 words) - 03:28, 16 December 2015
Xhosa
....za/Faculties/ART/Xhosa/Pages/Research-.aspx "Cross linguistics upon Xhosa Corpora Research"] == Monolingual/Parallel Corpora ==

4 KB (566 words) - 05:57, 18 April 2020
Generating lexical-selection rules from a parallel corpus
{{deprecated2|Learning rules from parallel and non-parallel corpora}} * a parallel corpus (see [[Corpora]])

15 KB (2,206 words) - 13:58, 7 October 2014
Building dictionaries
==Getting corpora== WORDLIST=/home/spectre/corpora/afrikaans-meester-utf8.txt

16 KB (2,566 words) - 21:36, 15 March 2020
UD annotatrix/UD annotatrix at GSoC 2017
...ert the data between the formats. It also allows to either upload or paste corpora in plain text and then convert them into CoNLL-U. ...des support for saving user corpora on server and then accessing the saved corpora via unique URL.

6 KB (930 words) - 15:59, 29 August 2017
Getting started with Annotatrix
...is an open source tool included on the Apertium project that let you train corpora and manage related files with a friendly user interface and letting you foc ...rom this view you are able to see corpora and training details, insert new corpora and train them easily

8 KB (1,376 words) - 11:14, 29 October 2014
Ideas for Google Summer of Code/User-friendly lexical selection training
* [[Learning rules from parallel and non-parallel corpora]] – this is the current documentation on training/inferring rules ** preprocess corpora

4 KB (541 words) - 13:46, 29 March 2021
Apertium-regtest
Apertium-regtest is a program for managing regression tests and [[Corpus test|corpora]]. # in the browser, select one or all of the corpora to rerun tests for

11 KB (1,823 words) - 12:17, 6 June 2023
Hindi
===Corpora=== * [http://corpora.uni-leipzig.de/en?corpusId=hin_news_2011 Hindi News Corpus] Creative Common

6 KB (806 words) - 00:45, 7 December 2018
French
===Corpora=== * [http://childes.talkbank.org/access/French/ CHIDES Corpora]. [http://talkbank.org/share/rules.html ''Requires reference'']

15 KB (2,081 words) - 07:14, 12 August 2020
Publications
..., Tommi Pirinen, Jonathan Washington. "Finite-state morphologies and text corpora as resources for improving morphological descriptions". [https://sites.goog ...f Inferring shallow-transfer machine translation rules from small parallel corpora]". In Journal of Artificial Intelligence Research. volume 34, p. 605-635.

33 KB (4,418 words) - 11:52, 29 December 2021
Indic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

12 KB (1,017 words) - 09:06, 18 January 2022
Unigram tagger
====Training on Corpora with Ambiguous Lexical Units==== ...m-tagger</code> '', the tagger prints warnings about ambiguous analyses in corpora to stderr.''

20 KB (3,229 words) - 20:06, 12 March 2018
Romanian
* Tufiş, D., A. M. Barbu, V. Pătraşcu, G. Rotariu, and C. Popescu. "Corpora and Corpus-Based Morpho-Lexical Processing." ''Recent Advances in Roma ===Corpora===

7 KB (889 words) - 09:53, 28 November 2018
Hindi and English
===Corpora=== * [http://opus.nlpl.eu/ Hindi-English Parallel Corpora]

8 KB (1,079 words) - 11:17, 3 December 2018
Lexical feature transfer - First report
...why the errors are so high. Re-evaluation will be done as soon as the two corpora are manually realigned. ...nslated corpora, and the label will be extracted from the manually written corpora. This method might provide better results since the model will be trained o

6 KB (838 words) - 17:47, 25 July 2012
Comparison of part-of-speech tagging systems
==Corpora== The tagged corpora used in the experiments are found in the monolingual packages in [[language

16 KB (1,448 words) - 16:50, 22 August 2017
Apertium-uzb-kaa
...n be found [https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora here]. ...m/corpora/jam/uzb.txt | apertium -d . uzb-kaa) -ref ../../../data4apertium/corpora/jam/kaa.txt

5 KB (515 words) - 14:34, 1 September 2019
Getting started with induction tools
=== Obtaining corpora (and getAlignmentWithText.pl) === The corpora need to be untarred, and inserted into a new, common directory.

7 KB (973 words) - 02:52, 20 May 2021
Nepali
===Corpora === * [http://www.elra.info/en/catalogues/free-resources/nepali-corpora/ ''"Nepali"'']

8 KB (948 words) - 19:59, 30 December 2017
Related software
Automatic shallow-transfer rules generation from parallel corpora ...in statistical machine translation, that have been extracted from parallel corpora and extended with a set of restrictions controlling their application.

4 KB (525 words) - 19:21, 17 September 2009
Generating lexical-selection rules from monolingual corpora
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm

12 KB (1,634 words) - 18:26, 26 September 2016
Grfro3d/proposal apertium cat-srd and ita-srd
...e project was a success. All the goals have been achieved: the creation of corpora in LSC; italian monodix: apertium-srd-srd.dix: 51,743 words; apertium-ita-i ...rned to use markup languages (XML and HTML) for the creation of linguistic corpora. At present, I attend a Master’s Degree in Translation of specialized tex

21 KB (3,171 words) - 14:34, 3 April 2017
German
===Corpora=== * [https://korpora.zim.uni-duisburg-essen.de/Limas/ Corpora from Limas z.Hd. Prof. Dr. Bernhard Schröder Universität Duisburg-Essen,

8 KB (900 words) - 10:15, 4 December 2018
Uighur and Turkish/GSoC2018 report
== Corpora and Coverage == Our main corpora consisted of [https://www.rfa.org/uyghur/ RFA], [http://uy.ts.cn/ Tanritor]

5 KB (607 words) - 13:25, 12 August 2018
Javanese
== Corpora == ...8ba2a9c0e50bc885bfad3bfbff3b4afbd.pdf Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech]

7 KB (881 words) - 13:11, 12 December 2018
Turkic MT Improvements GSoC2019 report
== Corpora and Coverage == ...he help of mentors on Kipchak languages. Most frequent unknown tokens from corpora of each language (mostly consisting of Wikipedia entries, Bible and Quran)

7 KB (798 words) - 18:30, 26 August 2019
Celtic languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

10 KB (1,263 words) - 06:04, 23 December 2014
Apertium-aze/stats
== Corpora == * wikipage: <section begin=azadliq2012-wikipage />RFERL corpora<section end=azadliq2012-wikipage />

1,013 bytes (115 words) - 22:42, 12 August 2014
Uighur and Turkish/Paper
* Evaluate system on corpora === Various Potential Corpora ===

10 KB (1,483 words) - 07:00, 14 August 2018
Indonesian
=== Corpora === * [https://corpora.uni-leipzig.de/en?corpusId=ind_mixed_2013 Leipzig Corpora Collection - Indonesian]

5 KB (629 words) - 13:08, 21 December 2019
Lexical feature transfer - Second report
== Corpora, sets and alignment == The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, c

5 KB (620 words) - 12:21, 27 July 2012
Tungusic languages
===Corpora===

2 KB (172 words) - 17:09, 27 March 2017
Constraint-based lexical selection module
...ine. Rules can be manually written, or learnt from monolingual or parallel corpora. {{main|Learning rules from parallel and non-parallel corpora}}

19 KB (2,820 words) - 15:26, 11 April 2023
Génération de règles de sélection lexicale depuis un corpus parallèle
* d'un corpus parallèle (voir [[Corpora]]) Nous alors faire l'exemple avec [[Corpora|EuroParl]] et la paire anglais vers espagnol d'Apertium.

9 KB (1,445 words) - 14:05, 7 October 2014
Building a pseudo-parallel corpus
...language model for the target language in order to create pseudo-parallel corpora, and use them in the same way as parallel ones. IRSTLM is a tool for building n-gram language models from corpora. It supports different smoothing and interpolation methods, including Writt

3 KB (364 words) - 23:25, 23 August 2012
Apertium-nno-nob/kjektåkunne
$ xzcat ~/corpora/nob/*ntb*.xz | head -100000 | apertium -d . nob-nno_e > 2019-09-30.before $ xzcat ~/corpora/nob/*ntb*.xz | head -100000 | apertium -d . nob-nno_e > 2019-09-30.after

2 KB (327 words) - 08:02, 1 October 2019
Iranian languages
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc

22 KB (2,532 words) - 11:36, 30 July 2018
Task ideas for Google Code-in (2013)
...like the page on [[Aromanian]]—i.e., all available dictionaries, grammars, corpora, machine translators, etc., print or digital, where available, whether Free

68 KB (10,323 words) - 15:37, 25 October 2014
Sardo e italiano/Rapporto finale
...essicale e dell’analisi contrastiva è stata provvidenziale la creazione di corpora costituiti da testi redatti nella variante LSC, estrapolati da riviste on-l ...vocabolario Logudorese-italiano di Mario Casu e l’analisi approfondita dei corpora paralleli che ci hanno permesso di capire quale fosse, caso per caso, il ma

13 KB (1,910 words) - 11:34, 23 August 2016
Tatar and Russian
* Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository. |colspan="2" rowspan="4"| Corpus testvoc clean on all of the available corpora ||rowspan="4"| ||rowspan="4" colspan="2" style="text-align: center"| ✗||r

8 KB (1,006 words) - 12:48, 9 March 2018
Flyer
or adapting the software to fit your needs. Existing free (GPL) data and corpora easily reusable to feed Apertium's dictionaries are also welcome. ...esidades particulares. También se agradece la disponibilización de datos y corpora libres (GPL) que sean reutilizables para mejorar los diccionarios de Aperti

26 KB (3,122 words) - 06:25, 27 May 2021
Mayan languages
=== Annotated corpora ===

3 KB (241 words) - 20:44, 9 September 2020
Google Summer of Code/Report 2013
===Application for "Interface for creating tagged corpora" GSOC 2013===

2 KB (200 words) - 08:21, 13 January 2015
Kazakh and Tatar
...pertium-kaz/stats|~{{:apertium-kaz/stats/average}}%]] coverage over random corpora ...pertium-tat/stats|~{{:apertium-tat/stats/average}}%]] coverage over random corpora

4 KB (586 words) - 01:53, 10 March 2018
Hindi and Bengali
===Corpora=== * [http://corpora.uni-leipzig.de/en?corpusId=hin_news_2011 Hindi News Corpus]

4 KB (557 words) - 05:45, 25 August 2021
Ideas for Google Summer of Code/Add weights to lttoolbox
...aled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly. * gold-standard tagged corpora and

5 KB (816 words) - 02:32, 13 February 2018
Apertium-test/teststats/
== Corpora ==

2 KB (242 words) - 19:49, 3 January 2018
Kazakh
=== Corpora ===

7 KB (943 words) - 20:51, 6 September 2018
Apertium-kaz/stats
== Corpora ==

4 KB (479 words) - 02:06, 28 February 2020
Annotatrix/Work plan
*Trainer on_fly working and training corpora '''Done''' ***Now the trainer is able to train corpora using just the keyboard, with a friendly user interface, cleaned corpus to

12 KB (1,602 words) - 15:47, 10 October 2013
Apertium-crh/stats
== Corpora ==

2 KB (286 words) - 10:51, 4 June 2017
Apertium-quality/Quickstart
...do some of the tests like generation testing or coverage testing, we need corpora, right? Have no fear, for `aq-wikicrp` is here! Let us get a Maltese wikipe ...t you'd expect, tests the dictionary for coverage. Using our newly created corpora, we can test the coverage! Feel free to use either one, but be consistent;

12 KB (1,931 words) - 17:06, 24 October 2018
Uyghur
=== Corpora ===

773 bytes (75 words) - 19:17, 8 June 2014
Portuguese
=== Corpora ===

2 KB (302 words) - 16:23, 26 December 2017
Apertium-lin/stats
== Corpora ==

2 KB (278 words) - 00:24, 15 June 2021
Apertium-byv/stats
== Corpora ==

2 KB (290 words) - 02:07, 24 July 2019
Apertium-ibo/stats
== Corpora ==

1 KB (158 words) - 03:34, 13 July 2021
Ideas for Google Summer of Code/Make a language pair state-of-the-art
...tion quality. This will involve improving coverage to 95-98% on a range of corpora and decreasing word error rate by 30-50%. For example if the current word e ...cial languages of EU? : Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working la

2 KB (383 words) - 19:46, 2 March 2023
Wolof
=== Corpora ===

4 KB (538 words) - 02:40, 27 December 2016
Yue Chinese
=== Corpora ===

3 KB (390 words) - 09:39, 27 December 2017
Apertium-khk/stats
== Corpora ==

1 KB (146 words) - 20:16, 24 March 2020
Farsi
===Corpora===

8 KB (1,143 words) - 18:45, 11 August 2015
Arabic
* http://permalink.gmane.org/gmane.science.linguistics.corpora/22281 Arabic names from dbpedia ===Corpora===

3 KB (437 words) - 10:23, 21 November 2021
Tatar
=== Corpora ===

2 KB (194 words) - 04:52, 31 December 2017
Apertium-kir/stats
== Corpora ==

4 KB (440 words) - 21:41, 15 December 2019
Apertium-tyv/stats
== Corpora ==

2 KB (272 words) - 21:51, 15 December 2019
Apertium-ces/stats
== Corpora ==

2 KB (213 words) - 17:55, 16 December 2017
Apertium-bua/stats
== Corpora ==

2 KB (250 words) - 16:26, 11 April 2015
Apertium-sah/stats
== Corpora ==

3 KB (342 words) - 21:33, 15 December 2019
Apertium-gle/stats
== Corpora ==

867 bytes (90 words) - 20:14, 24 March 2020
Sardinian and Italian/Final Report
...r the lexical analysis and selection contrastive was providential creating corpora consist of texts written in the LSC variant, taken from magazines on -line ...io Casu's Logudorese-Italian vocabulary and in-depth analysis of parallel corpora that have allowed us to understand what, case by case, the greatest number

7 KB (1,110 words) - 11:34, 23 August 2016
Apertium-kaa/stats
== Corpora ==

3 KB (367 words) - 06:16, 1 October 2021
Apertium Turkic
...Pirinen, Jonathan Washington (2015). "Finite-state morphologies and text corpora as resources for improving morphological descriptions". [https://sites.goog

13 KB (1,710 words) - 20:32, 30 August 2018
Apertium-mkd/stats
== Corpora ==

1 KB (173 words) - 06:04, 16 December 2014
Apertium-gla/stats
== Corpora ==

1 KB (135 words) - 06:03, 16 December 2014
Apertium-tur/stats
== Corpora ==

2 KB (199 words) - 06:51, 6 July 2018
Apertium-hbs/stats
== Corpora ==

1 KB (173 words) - 06:03, 16 December 2014
Apertium-ava/stats
== Corpora ==

1 KB (177 words) - 06:01, 16 December 2014
Apertium-chv/stats
== Corpora ==

2 KB (211 words) - 06:02, 16 December 2014
Apertium-slv/stats
== Corpora ==

1 KB (173 words) - 06:06, 16 December 2014
Using GIZA++
...is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionar *[[Corpora]]

4 KB (589 words) - 11:51, 29 April 2015
Apertium-oss/stats
== Corpora ==

417 bytes (42 words) - 20:17, 24 March 2020
Apertium-bak/stats
== Corpora ==

2 KB (246 words) - 21:49, 15 December 2019
Apertium-bul/stats
== Corpora ==

1 KB (154 words) - 06:02, 16 December 2014
Manx
==Parallel corpora==

766 bytes (87 words) - 08:07, 20 January 2009

Search results

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools