Search results

Jump to navigation Jump to search
  • The ultimate goal is to have multi-purpose transducers and annotated corpora (i.e. treebanks) for a variety of Turkic languages. These can then be pair Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    35 KB (3,577 words) - 15:24, 1 October 2021
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    32 KB (3,684 words) - 06:16, 28 December 2018
  • ...(grammatical descriptions, wordlists, dictionaries, spellcheckers, papers, corpora, etc.) for Aromanian, along with the licences they are under. || || [[User: ...=center| {{sc|research}} || 3. Easy || Create manually tagged corpora: Occitan || Fix tagging errors in a piece of analysed text, for use in tag
    187 KB (21,006 words) - 22:14, 12 November 2012
  • ...however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally use * The coverage of the transducer on a variety of corpora
    15 KB (1,783 words) - 22:33, 1 February 2019
  • == Estimating rules using parallel corpora == ...see [[Running the monolingual rule learning]] if you only have monolingual corpora).
    14 KB (2,181 words) - 19:01, 17 August 2018
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc === Corpora and corpora projects ===
    9 KB (987 words) - 23:25, 22 December 2014
  • * directory es-tagger-data : Contains data needed for the Spanish tagger (corpora, etc.) * directory ca-tagger-data : Contains data needed for the Catalan tagger (corpora, etc.)
    50 KB (7,915 words) - 00:04, 10 March 2019
  • | name = Dictionary induction from parallel corpora / Revive ReTraTos | description = Extract dictionaries from parallel corpora
    23 KB (3,198 words) - 09:15, 4 March 2024
  • ...quirements for corpora, and a number of different formats for storing such corpora have sprung up. Some examples include: ...ng on). The following is an idea Jonathan has for implementing a standard corpora format for use by apertium.
    5 KB (813 words) - 00:08, 28 December 2011
  • ...millions of words as in the statistical methods: it takes only two smaller corpora and a dictionary containing rules to conjugate verbs and to match nouns and ...anguage pairs of the same linguistic family without the need of linguistic corpora. The experience of Apertium with several minoritised languages such as Occi
    15 KB (2,339 words) - 00:41, 4 June 2018
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    22 KB (2,520 words) - 23:09, 22 December 2014
  • === Annotated corpora ===
    18 KB (2,312 words) - 18:25, 18 September 2016
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    12 KB (1,308 words) - 19:27, 27 August 2017
  • * Collect parallel kaz-eng corpora! By new plan, we focused on adding vocabulary from 4 corpora.
    20 KB (2,856 words) - 06:26, 27 May 2021
  • * [http://corpus.leeds.ac.uk/query-zh.html A Collection of Chinese Corpora and Frequency Lists.] ===Corpora===
    16 KB (2,148 words) - 03:28, 16 December 2015
  • ....za/Faculties/ART/Xhosa/Pages/Research-.aspx "Cross linguistics upon Xhosa Corpora Research"] == Monolingual/Parallel Corpora ==
    4 KB (566 words) - 05:57, 18 April 2020
  • {{deprecated2|Learning rules from parallel and non-parallel corpora}} * a parallel corpus (see [[Corpora]])
    15 KB (2,206 words) - 13:58, 7 October 2014
  • ==Getting corpora== WORDLIST=/home/spectre/corpora/afrikaans-meester-utf8.txt
    16 KB (2,566 words) - 21:36, 15 March 2020
  • ...ert the data between the formats. It also allows to either upload or paste corpora in plain text and then convert them into CoNLL-U. ...des support for saving user corpora on server and then accessing the saved corpora via unique URL.
    6 KB (930 words) - 15:59, 29 August 2017
  • ...is an open source tool included on the Apertium project that let you train corpora and manage related files with a friendly user interface and letting you foc ...rom this view you are able to see corpora and training details, insert new corpora and train them easily
    8 KB (1,376 words) - 11:14, 29 October 2014
  • * [[Learning rules from parallel and non-parallel corpora]] – this is the current documentation on training/inferring rules ** preprocess corpora
    4 KB (541 words) - 13:46, 29 March 2021
  • ===Corpora=== * [http://corpora.uni-leipzig.de/en?corpusId=hin_news_2011 Hindi News Corpus] Creative Common
    6 KB (806 words) - 00:45, 7 December 2018
  • Apertium-regtest is a program for managing regression tests and [[Corpus test|corpora]]. # in the browser, select one or all of the corpora to rerun tests for
    11 KB (1,823 words) - 12:17, 6 June 2023
  • ===Corpora=== * [http://childes.talkbank.org/access/French/ CHIDES Corpora]. [http://talkbank.org/share/rules.html ''Requires reference'']
    15 KB (2,081 words) - 07:14, 12 August 2020
  • ..., Tommi Pirinen, Jonathan Washington. "Finite-state morphologies and text corpora as resources for improving morphological descriptions". [https://sites.goog ...f Inferring shallow-transfer machine translation rules from small parallel corpora]". In Journal of Artificial Intelligence Research. volume 34, p. 605-635.
    33 KB (4,418 words) - 11:52, 29 December 2021
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    12 KB (1,017 words) - 09:06, 18 January 2022
  • ====Training on Corpora with Ambiguous Lexical Units==== ...m-tagger</code> '', the tagger prints warnings about ambiguous analyses in corpora to stderr.''
    20 KB (3,229 words) - 20:06, 12 March 2018
  • * Tufiş, D., A. M. Barbu, V. Pătraşcu, G. Rotariu, and C. Popescu. "Corpora and Corpus-Based Morpho-Lexical Processing."&nbsp;''Recent Advances in Roma ===Corpora===
    7 KB (889 words) - 09:53, 28 November 2018
  • ...why the errors are so high. Re-evaluation will be done as soon as the two corpora are manually realigned. ...nslated corpora, and the label will be extracted from the manually written corpora. This method might provide better results since the model will be trained o
    6 KB (838 words) - 17:47, 25 July 2012
  • ==Corpora== The tagged corpora used in the experiments are found in the monolingual packages in [[language
    16 KB (1,448 words) - 16:50, 22 August 2017
  • ...n be found [https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora here]. ...m/corpora/jam/uzb.txt | apertium -d . uzb-kaa) -ref ../../../data4apertium/corpora/jam/kaa.txt
    5 KB (515 words) - 14:34, 1 September 2019
  • ===Corpora=== * [http://opus.nlpl.eu/ Hindi-English Parallel Corpora]
    8 KB (1,079 words) - 11:17, 3 December 2018
  • ===Corpora === * [http://www.elra.info/en/catalogues/free-resources/nepali-corpora/ ''"Nepali"'']
    8 KB (948 words) - 19:59, 30 December 2017
  • === Obtaining corpora (and getAlignmentWithText.pl) === The corpora need to be untarred, and inserted into a new, common directory.
    7 KB (973 words) - 02:52, 20 May 2021
  • Automatic shallow-transfer rules generation from parallel corpora ...in statistical machine translation, that have been extracted from parallel corpora and extended with a set of restrictions controlling their application.
    4 KB (525 words) - 19:21, 17 September 2009
  • ~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
    12 KB (1,634 words) - 18:26, 26 September 2016
  • ...e project was a success. All the goals have been achieved: the creation of corpora in LSC; italian monodix: apertium-srd-srd.dix: 51,743 words; apertium-ita-i ...rned to use markup languages (XML and HTML) for the creation of linguistic corpora. At present, I attend a Master’s Degree in Translation of specialized tex
    21 KB (3,171 words) - 14:34, 3 April 2017
  • == Corpora and Coverage == ...he help of mentors on Kipchak languages. Most frequent unknown tokens from corpora of each language (mostly consisting of Wikipedia entries, Bible and Quran)
    7 KB (798 words) - 18:30, 26 August 2019
  • ===Corpora=== * [https://korpora.zim.uni-duisburg-essen.de/Limas/ Corpora from Limas z.Hd. Prof. Dr. Bernhard Schröder Universität Duisburg-Essen,
    8 KB (900 words) - 10:15, 4 December 2018
  • == Corpora and Coverage == Our main corpora consisted of [https://www.rfa.org/uyghur/ RFA], [http://uy.ts.cn/ Tanritor]
    5 KB (607 words) - 13:25, 12 August 2018
  • == Corpora == ...8ba2a9c0e50bc885bfad3bfbff3b4afbd.pdf Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech]
    7 KB (881 words) - 13:11, 12 December 2018
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    10 KB (1,263 words) - 06:04, 23 December 2014
  • == Corpora, sets and alignment == The parallel corpora for the Macedonian - English pair, a total of 207.778 parallel sentences, c
    5 KB (620 words) - 12:21, 27 July 2012
  • == Corpora == * wikipage: <section begin=azadliq2012-wikipage />RFERL corpora<section end=azadliq2012-wikipage />
    1,013 bytes (115 words) - 22:42, 12 August 2014
  • * Evaluate system on corpora === Various Potential Corpora ===
    10 KB (1,483 words) - 07:00, 14 August 2018
  • === Corpora === * [https://corpora.uni-leipzig.de/en?corpusId=ind_mixed_2013 Leipzig Corpora Collection - Indonesian]
    5 KB (629 words) - 13:08, 21 December 2019
  • ===Corpora===
    2 KB (172 words) - 17:09, 27 March 2017
  • ...ine. Rules can be manually written, or learnt from monolingual or parallel corpora. {{main|Learning rules from parallel and non-parallel corpora}}
    19 KB (2,820 words) - 15:26, 11 April 2023
  • * d'un corpus parallèle (voir [[Corpora]]) Nous alors faire l'exemple avec [[Corpora|EuroParl]] et la paire anglais vers espagnol d'Apertium.
    9 KB (1,445 words) - 14:05, 7 October 2014
  • ...language model for the target language in order to create pseudo-parallel corpora, and use them in the same way as parallel ones. IRSTLM is a tool for building n-gram language models from corpora. It supports different smoothing and interpolation methods, including Writt
    3 KB (364 words) - 23:25, 23 August 2012
  • $ xzcat ~/corpora/nob/*ntb*.xz | head -100000 | apertium -d . nob-nno_e > 2019-09-30.before $ xzcat ~/corpora/nob/*ntb*.xz | head -100000 | apertium -d . nob-nno_e > 2019-09-30.after
    2 KB (327 words) - 08:02, 1 October 2019
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    22 KB (2,532 words) - 11:36, 30 July 2018
  • ...like the page on [[Aromanian]]—i.e., all available dictionaries, grammars, corpora, machine translators, etc., print or digital, where available, whether Free
    68 KB (10,323 words) - 15:37, 25 October 2014
  • ...essicale e dell’analisi contrastiva è stata provvidenziale la creazione di corpora costituiti da testi redatti nella variante LSC, estrapolati da riviste on-l ...vocabolario Logudorese-italiano di Mario Casu e l’analisi approfondita dei corpora paralleli che ci hanno permesso di capire quale fosse, caso per caso, il ma
    13 KB (1,910 words) - 11:34, 23 August 2016
  • * Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository. |colspan="2" rowspan="4"| Corpus testvoc clean on all of the available corpora ||rowspan="4"| ||rowspan="4" colspan="2" style="text-align: center"| ✗||r
    8 KB (1,006 words) - 12:48, 9 March 2018
  • or adapting the software to fit your needs. Existing free (GPL) data and corpora easily reusable to feed Apertium's dictionaries are also welcome. ...esidades particulares. También se agradece la disponibilización de datos y corpora libres (GPL) que sean reutilizables para mejorar los diccionarios de Aperti
    26 KB (3,122 words) - 06:25, 27 May 2021
  • === Annotated corpora ===
    3 KB (241 words) - 20:44, 9 September 2020
  • ===Application for "Interface for creating tagged corpora" GSOC 2013===
    2 KB (200 words) - 08:21, 13 January 2015
  • ...pertium-kaz/stats|~{{:apertium-kaz/stats/average}}%]] coverage over random corpora ...pertium-tat/stats|~{{:apertium-tat/stats/average}}%]] coverage over random corpora
    4 KB (586 words) - 01:53, 10 March 2018
  • ===Corpora=== * [http://corpora.uni-leipzig.de/en?corpusId=hin_news_2011 Hindi News Corpus]
    4 KB (557 words) - 05:45, 25 August 2021
  • ...aled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly. * gold-standard tagged corpora and
    5 KB (816 words) - 02:32, 13 February 2018
  • === Corpora ===
    7 KB (943 words) - 20:51, 6 September 2018
  • == Corpora ==
    4 KB (479 words) - 02:06, 28 February 2020
  • *Trainer on_fly working and training corpora '''Done''' ***Now the trainer is able to train corpora using just the keyboard, with a friendly user interface, cleaned corpus to
    12 KB (1,602 words) - 15:47, 10 October 2013
  • == Corpora ==
    2 KB (286 words) - 10:51, 4 June 2017
  • == Corpora ==
    2 KB (242 words) - 19:49, 3 January 2018
  • ...do some of the tests like generation testing or coverage testing, we need corpora, right? Have no fear, for `aq-wikicrp` is here! Let us get a Maltese wikipe ...t you'd expect, tests the dictionary for coverage. Using our newly created corpora, we can test the coverage! Feel free to use either one, but be consistent;
    12 KB (1,931 words) - 17:06, 24 October 2018
  • ...tion quality. This will involve improving coverage to 95-98% on a range of corpora and decreasing word error rate by 30-50%. For example if the current word e ...cial languages of EU? : Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working la
    2 KB (383 words) - 19:46, 2 March 2023
  • === Corpora ===
    773 bytes (75 words) - 19:17, 8 June 2014
  • === Corpora ===
    2 KB (302 words) - 16:23, 26 December 2017
  • == Corpora ==
    2 KB (278 words) - 00:24, 15 June 2021
  • == Corpora ==
    2 KB (290 words) - 02:07, 24 July 2019
  • == Corpora ==
    1 KB (158 words) - 03:34, 13 July 2021
  • == Corpora ==
    1 KB (146 words) - 20:16, 24 March 2020
  • ===Corpora===
    8 KB (1,143 words) - 18:45, 11 August 2015
  • === Corpora ===
    4 KB (538 words) - 02:40, 27 December 2016
  • === Corpora ===
    3 KB (390 words) - 09:39, 27 December 2017
  • == Corpora ==
    867 bytes (90 words) - 20:14, 24 March 2020
  • ...r the lexical analysis and selection contrastive was providential creating corpora consist of texts written in the LSC variant, taken from magazines on -line ...io Casu's Logudorese-Italian vocabulary and in-depth analysis of parallel corpora that have allowed us to understand what, case by case, the greatest number
    7 KB (1,110 words) - 11:34, 23 August 2016
  • == Corpora ==
    3 KB (367 words) - 06:16, 1 October 2021
  • * http://permalink.gmane.org/gmane.science.linguistics.corpora/22281 Arabic names from dbpedia ===Corpora===
    3 KB (437 words) - 10:23, 21 November 2021
  • === Corpora ===
    2 KB (194 words) - 04:52, 31 December 2017
  • == Corpora ==
    4 KB (440 words) - 21:41, 15 December 2019
  • == Corpora ==
    2 KB (272 words) - 21:51, 15 December 2019
  • == Corpora ==
    2 KB (213 words) - 17:55, 16 December 2017
  • == Corpora ==
    2 KB (250 words) - 16:26, 11 April 2015
  • == Corpora ==
    3 KB (342 words) - 21:33, 15 December 2019
  • ...Pirinen, Jonathan Washington (2015). "Finite-state morphologies and text corpora as resources for improving morphological descriptions". [https://sites.goog
    13 KB (1,710 words) - 20:32, 30 August 2018
  • == Corpora ==
    328 bytes (35 words) - 05:49, 16 December 2014
  • * Tagger training preparation (tagged corpora unification) * Tagger training preparation (tagged corpora unification)
    5 KB (506 words) - 14:56, 28 August 2017
  • == Corpora ==
    2 KB (235 words) - 05:58, 16 December 2014
  • == Corpora ==
    4 KB (524 words) - 18:29, 29 December 2013
  • ===Corpora===
    9 KB (1,145 words) - 06:19, 28 December 2018
  • == Corpora ==
    2 KB (231 words) - 15:43, 1 October 2021
  • == Corpora ==
    943 bytes (97 words) - 23:58, 7 September 2014
  • == Corpora ==
    7 KB (1,098 words) - 10:56, 4 May 2016
  • == Corpora ==
    1 KB (135 words) - 06:02, 16 December 2014
  • == Corpora ==
    2 KB (231 words) - 15:46, 1 October 2021
  • == Corpora ==
    1 KB (157 words) - 05:29, 22 August 2017
  • == Corpora ==
    6 KB (786 words) - 10:56, 4 May 2016
  • == Corpora ==
    1 KB (154 words) - 06:03, 16 December 2014
  • == Corpora ==
    2 KB (255 words) - 19:20, 9 October 2021
  • == Corpora ==
    1 KB (173 words) - 06:04, 16 December 2014
  • == Corpora ==
    1 KB (135 words) - 06:03, 16 December 2014
  • == Corpora ==
    2 KB (199 words) - 06:51, 6 July 2018
  • == Corpora ==
    1 KB (173 words) - 06:03, 16 December 2014
  • == Corpora ==
    1 KB (177 words) - 06:01, 16 December 2014
  • == Corpora ==
    2 KB (211 words) - 06:02, 16 December 2014
  • == Corpora ==
    1 KB (173 words) - 06:06, 16 December 2014
  • ...is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionar *[[Corpora]]
    4 KB (589 words) - 11:51, 29 April 2015
  • == Corpora ==
    417 bytes (42 words) - 20:17, 24 March 2020
  • == Corpora ==
    2 KB (246 words) - 21:49, 15 December 2019
  • == Corpora ==
    1 KB (154 words) - 06:02, 16 December 2014
  • ==Parallel corpora==
    766 bytes (87 words) - 08:07, 20 January 2009
  • == Electronic Corpora ==
    335 bytes (29 words) - 16:39, 24 May 2017
  • == Corpora ==
    3 KB (414 words) - 21:40, 15 December 2019
  • == Corpora ==
    1 KB (158 words) - 05:28, 22 August 2017
  • '''CORPORA''' ...sketchengine.eu/user-guide/user-manual/corpora/by-language/hausa-boko-text-corpora/
    1 KB (179 words) - 18:40, 26 October 2018
  • == Corpora ==
    511 bytes (54 words) - 20:57, 15 February 2015
  • == Corpora ==
    1 KB (135 words) - 06:05, 16 December 2014
  • == Corpora ==
    1 KB (135 words) - 06:59, 28 June 2016
  • == Corpora ==
    2 KB (213 words) - 06:51, 6 July 2018
  • ==Corpora==
    2 KB (212 words) - 04:26, 2 January 2019
  • == Corpora ==
    1 KB (176 words) - 06:05, 16 December 2014
  • == Corpora ==
    1 KB (158 words) - 05:29, 22 August 2017
  • ===Corpora===
    8 KB (1,048 words) - 05:32, 1 December 2017
  • == Other Corpora ==
    6 KB (811 words) - 10:42, 2 July 2018
  • == Corpora ==
    3 KB (324 words) - 21:41, 15 December 2019
  • == Corpora ==
    1 KB (154 words) - 06:16, 16 December 2014
  • # the best taggers use hand-tagged corpora to train with (we use untagged corpora -- for English)
    7 KB (1,177 words) - 08:34, 8 October 2014
  • * [[Corpora]]
    13 KB (1,601 words) - 23:31, 23 July 2021
  • ...toscore.txt | ~/source/apertium/trunk/apertium-lex-learner/irstlm-ranker ~/corpora/català/en.blm > unk.trans.scored.txt
    9 KB (1,470 words) - 11:28, 24 March 2012
  • to be related. Both corpora must be pre-processed before the training. This pre-processing, consisting in analysing the corpora and
    11 KB (1,814 words) - 03:22, 9 March 2019
  • The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are
    12 KB (1,683 words) - 08:42, 10 May 2013
  • The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are
    12 KB (1,683 words) - 11:00, 30 October 2015
  • The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are
    12 KB (1,679 words) - 12:00, 31 January 2012
  • * '''Coverage''' is the naïve coverage over one or more free corpora.
    6 KB (591 words) - 22:50, 30 October 2017
  • ...steps to create each one. But you'll want to build it up, testing against corpora, etc. You want to be able to have as many correct analyses as possible, an ...if your pair will support translation in both directions). The corpus or corpora ideally should represent a range of content—i.e., it shouldn't be just sp
    10 KB (1,615 words) - 07:43, 20 December 2015
  • ...dinian to Italian. We started a manual morphological disambiguation of the corpora that will help the translator to recognize the correct morphology of each w We have treated two corpora: one journalistic and more dialectal, and other taken directly from literar
    9 KB (1,306 words) - 15:56, 2 September 2017
  • Word Sense Disambiguation for WordNet corpora ...oblem is that SMT requires a lot of data in the form of parallel languages corpora, since they very addicted to data, and many languages cannot afford it. Whi
    8 KB (1,094 words) - 13:10, 14 April 2019
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    20 KB (2,336 words) - 18:10, 14 April 2015
  • ...t consisting of a set of linguistic resources (dictionaries, cross models, corpora, links to other LRDs, etc.). ...uistics resources: morphological and bilingual dictionaries, cross models, corpora, etc.</description>
    8 KB (902 words) - 09:19, 6 October 2014
  • * [[Learning rules from parallel and non-parallel corpora]] ...sing statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does n
    4 KB (625 words) - 08:36, 29 April 2015
  • -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 & ...e> script to generate relative frequency lists of in-domain and out-domain corpora.
    5 KB (680 words) - 11:53, 26 September 2016
  • xzcat corpora/nno.xz | tr -d '#@/' | apertium -d . nno-nob-dgen | grep '.\{0,6\}[#@/].\{0
    9 KB (1,400 words) - 22:30, 18 January 2021
  • ** Overall accuracy (over parallel corpora): WER/PER/BLEU ...generalised script that supports hfst and lttoolbox binaries and arbitrary corpora would be good. It should also (optionally) output hitparades (e.g., freque
    2 KB (246 words) - 02:32, 1 June 2019
  • ...an give unsatisfactory results (WER ≈ 30%, coverage below 85% in Wikipedia corpora). Both were published in 2009 and, apparently, no one has worked on them si
    16 KB (2,285 words) - 06:46, 12 April 2019
  • Corpora in { gap } are large collections of texts enhanced with special markup. The Corpora in { gap } are large collections of { gap } with { gap }. They allow lingui
    9 KB (1,368 words) - 09:04, 23 April 2015
  • #* <code>$ cat corpus.dep | ./corpora/add_morph.py > corpus_with_annotation.dep</code> #* <code>$ cat corpus_with_annotation.dep | ./corpora/dep_to_seg.py > corpus_with_annotation.seg</code>
    1 KB (218 words) - 14:51, 24 April 2024
  • corpora. ...the precess, creating an easy-to-use framework for using constraints with corpora in order to obtain information about words of interest; and later to provid
    6 KB (928 words) - 13:57, 3 April 2009
  • * large monolingual corpora of the language * parallel corpora of the language and some other language
    1 KB (202 words) - 19:55, 12 April 2021
  • ...ment specifying a set of linguistic resources (dictionaries, cross models, corpora, other LRD files, etc).
    5 KB (633 words) - 13:29, 6 October 2017
  • ...ger and an initial set of translation rules from monolingual and bilingual corpora.
    8 KB (1,255 words) - 19:50, 12 April 2021
  • ...ing with dictionaries, lexical selection rules, transfer rules, scripting, corpora. The objective is to facilitate the generation of varieties for languages w ...cial languages of EU? : Then this task will be hard. Pairs which have huge corpora of parallel texts, like the 24 official EU languages or the 3 EU working la
    2 KB (377 words) - 19:18, 25 January 2023
  • ...are creating unvaluable linguistic resources such as disambiguated tagged corpora. [[Category:Ideas for Google Summer of Code|Interface for creating tagged corpora]]
    2 KB (269 words) - 21:26, 5 April 2013
  • Had an idea of fixing of out sync corpora automatically and started an "MVP" here: https://github.com/frankier/aperti ...opt (rather than the quality of its output). It could be used to help keep corpora, tagger models and morphologies in sync (though poking and possible automat
    3 KB (456 words) - 18:17, 29 August 2016
  • # all pronouns from Crimean Tatar corpora are translated without debug symbols * analyse corpora with crh-morph mode
    4 KB (496 words) - 18:27, 19 June 2017
  • ...an vocabulary were used to good effect to reach a high coverage on all the corpora.
    4 KB (551 words) - 23:52, 28 August 2017
  • ...edia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's [[Content Translat $ zcat ~/Nedlastingar/cx-corpora.nb2nn.text.tmx.gz \
    3 KB (436 words) - 05:40, 10 April 2019
  • ...(grammatical descriptions, wordlists, dictionaries, spellcheckers, papers, corpora, etc.), along with the licences they are under. See for example the page [[
    14 KB (2,007 words) - 03:06, 27 October 2013
  • ...ich can be inserted into them. This data might consist of wordlists, word-corpora derived from web-crawlers such as [http://borel.slu.edu/crubadan/ Crubadán
    13 KB (2,112 words) - 12:11, 26 May 2023
  • ...ents immensely good at helping us out with these: for instance, annotating corpora that are needed to train Apertium modules, or finding bugs in the handling
    7 KB (1,111 words) - 10:10, 15 November 2015
  • ...maximum usage of available resources for marginalised languages. Parallel corpora, user-feedback, other translation systems.
    5 KB (802 words) - 07:04, 10 May 2012
  • ...ger and an initial set of translation rules from monolingual and bilingual corpora.
    10 KB (1,543 words) - 19:50, 12 April 2021
  • ...uistics resources: morphological and bilingual dictionaries, cross models, corpora, etc.</description>
    6 KB (689 words) - 22:58, 25 October 2018
  • $ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist
    4 KB (583 words) - 15:18, 10 January 2022
  • Pour le corpus parallèle, on va utiliser Europarl, la page [[corpora]] (seulement en anglais) en liste d'autres :
    5 KB (699 words) - 07:52, 8 October 2014
  • TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola
    3 KB (520 words) - 21:25, 14 February 2014
  • ...ents immensely good at helping us out with these: for instance, annotating corpora that are needed to train Apertium modules, or finding bugs in the handling
    6 KB (987 words) - 10:21, 7 November 2014
  • For the parallel corpus we're going to use Europarl, the page [[corpora]] lists some others:
    4 KB (647 words) - 07:45, 8 October 2014
  • Some corpora are formatted in XML and put e.g. the real text contents inside a particula
    5 KB (863 words) - 09:04, 10 October 2017
  • MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
    4 KB (503 words) - 19:01, 17 August 2018
  • ...biguadore morfològicu chi at a èssere ùtile pro sa disambiguatzione de sos corpora e non si nd'at a pòdere fàghere a mancu pro isvilupare àteras crobas lin
    13 KB (2,173 words) - 19:17, 24 June 2018
  • ...nually filtered and corrected source - target pairs from OpenSubtitles2018 corpora preprocessed with bicleaner + both ways Apertium translations (ukr -> rus,
    7 KB (1,033 words) - 15:27, 15 August 2018
  • ...UD Annotatrix] - in-browser software for annotating Universal Dependencies corpora.
    2 KB (251 words) - 10:07, 27 June 2022
  • ...lexicon database with part of speech. We achieved coverage of % on SETimes corpora. And i am really happy with kymorph. Special thanks to firspeaker.
    5 KB (680 words) - 07:14, 26 August 2011
  • Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "produc
    19 KB (2,201 words) - 09:21, 9 December 2019
  • * corpora.
    9 KB (1,494 words) - 05:58, 18 March 2015
  • * consider including the web concordancer on the site (and consider what corpora to provide search access to...)
    4 KB (514 words) - 21:24, 19 August 2015
  • Some pre-processed corpora can be found [http://corpora.informatik.uni-leipzig.de/download.html here] and [http://wt.jrc.it/lt/Acqu
    7 KB (1,058 words) - 07:37, 4 July 2016
  • ...t that they could increase the coverage significantly, because the testing corpora are either news or WP).
    8 KB (1,205 words) - 21:50, 19 July 2012
  • ...rds is a semi-standard convention (it's occurring at least some in all the corpora). We should figure out where this is happening and see if it's something w
    28 KB (769 words) - 11:34, 13 April 2013
  • {{see-also|Corpora}}
    11 KB (1,750 words) - 13:24, 10 December 2010
  • .... (2008) "Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation". ''Machine Translatio
    8 KB (1,301 words) - 09:43, 6 October 2014
  • WORDLIST=/home/spectre/corpora/afrikaans-meester-utf8.txt
    11 KB (1,852 words) - 07:04, 8 October 2014
  • ...ns: apertium-kaz-tat has at least 15000 top stems, 95% coverage on all the corpora we have, and no more than 15% Word-Error-Rate on any randomly selected text
    4 KB (603 words) - 21:20, 31 August 2015
  • * Optimised for small corpora (under 100k parallel sentences)
    869 bytes (111 words) - 15:06, 29 June 2020
  • ...li08j.pdf Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation]". ''Machine Translati
    8 KB (1,273 words) - 09:32, 3 May 2024
  • -x, --xml Output corpora in XML format
    9 KB (1,003 words) - 11:02, 30 August 2011
  • The corpora used for this task can be found here: http://www.statmt.org/europarl/v7/sl-
    6 KB (625 words) - 16:54, 1 July 2013
  • Before you start you first need a [[Corpora|corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c e
    6 KB (966 words) - 20:16, 23 July 2021
  • ...learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.
    6 KB (842 words) - 06:41, 20 October 2014
  • Avant que vous vous commenciez vous avez d'abord besoin d'un [[Corpora|corpus]]. Regardez dans apertium-eo-en/corpa/enwiki.crp.txt.bz2 ? Lancez
    7 KB (1,057 words) - 11:52, 7 October 2014
  • * Efficiency: Make it scale up to corpora of millions of words. This might involve doing (a) pre-analysis of the corp
    3 KB (549 words) - 02:11, 10 March 2018
  • ...to make a translation guesser using the existing bidix and two monolingual corpora in a similar way.
    4 KB (558 words) - 13:07, 26 June 2020
  • ...ingual dictionaries: At the beginning we started using Chinese and Spanish corpora in order to obtain lots of Chinese-Spanish word pairs. Using the Stanford S
    7 KB (830 words) - 21:33, 30 September 2013
  • ...story] (or [https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam from here] ) as possible &mdash; Minimum one sentence. ...stvoc]] clean, and has a coverage of around 80% or more on a range of free corpora.
    6 KB (1,024 words) - 15:22, 20 April 2021
  • * [[Corpora]]
    1 KB (164 words) - 05:20, 4 December 2019
  • ...l trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morpho
    5 KB (764 words) - 01:40, 8 March 2018
  • |0|| || collecting Tatar and Bashkir corpora, scraping a parallel corpus, making a frequency dictionary
    2 KB (228 words) - 10:55, 9 May 2018
  • * [[Corpora]]
    1 KB (185 words) - 13:45, 7 October 2014
  • {{LangStats2 | lang = kaz | corpora = Әуезов,bible,azattyq2010,wp2011,quran | corpus1 = Әуезов | co
    612 bytes (76 words) - 12:36, 9 January 2013
  • Corpora
    3 KB (482 words) - 19:28, 3 November 2017
  • ...the weighted rules on overall quality and speed of translation using large corpora for training and evaluation.
    9 KB (1,387 words) - 13:37, 23 August 2016
  • * Discovering these kind of things should be possible using only monolingual corpora, or resources such as Wikipedia.
    2 KB (356 words) - 21:15, 21 June 2020
  • Write a script that reads two parallel corpora, applies the appropriate monolingual taggers and some word-aligner ([https:
    511 bytes (84 words) - 18:17, 21 March 2024
  • wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6g.tar.gz
    3 KB (426 words) - 17:48, 30 March 2012
  • *[[Corpora]]
    4 KB (553 words) - 08:43, 8 October 2014
  • '''Corpora:
    2 KB (231 words) - 14:38, 5 October 2019
  • * [[/Corpora]]
    1 KB (144 words) - 20:07, 15 July 2021
  • ...helper when training (see [[Learning rules from parallel and non-parallel corpora]]).
    3 KB (392 words) - 05:40, 22 August 2021