Difference between revisions of "English"
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
'''English''' ([[Wikipedia:English language]]) is a West Germanic language. It is available in |
'''English''' ([[Wikipedia:English language]]) is a West Germanic language. It is available in Apertium as a standalone analyser/generator ([[English#Apertium-eng|apertium-eng]]) and as a component of several pairs which translate to/from English. |
||
== Language pairs == |
== Language pairs == |
||
Line 151: | Line 151: | ||
== Apertium-eng == |
== Apertium-eng == |
||
===Current status=== |
|||
''Last update: 28 Aug 2017'' |
|||
'''Dix entries:''' 54,453 |
|||
'''Dix paradigms:''' 377 |
|||
'''Coverage:''' 93.55% (Wikipedia) |
|||
===Dictionary guidelines=== |
|||
The current English dictionary is quite big (nearly 55,000 entries), so tidiness is essential to ensure future development: |
|||
* Keep entries sorted alphabetically. |
|||
* Keep entries grouped by type and tags (do not mix different types of proper nouns together). |
|||
* Check the file with apertium-dixtools (to update the number of entries and remove duplicates). |
|||
====Spelling variants==== |
|||
The standard spelling variant in Apertium is British English. American English spelling entries are officially supported using <code>v="eng"</code> in the original British entry and <code>v="eng_US"</code> in the American entry, which should be a subform of the first one. |
|||
=== Tagger === |
=== Tagger === |
||
Apertium-eng currently uses a tagger trained in a supervised manner using a hand-tagged corpus and a perceptron. The corpus can be found [https://svn.code.sf.net/p/apertium/svn/languages/apertium-eng/texts/consensus here], and should be modified after any tag change in the monolingual dictionary to make sure it matches the current status of the language module. The MTX file for English has no rules defined for now, but it should improve disambiguation after some restrictions are added. |
|||
The tagger can be trained with the perceptron using the following command (see [[Perceptron tagger]]): |
|||
<pre> |
|||
$ apertium-tagger -xs 10 eng.prob eng.tagged eng.untagged apertium-eng.eng.mtx |
|||
</pre> |
|||
''Note: in the example above, "10" is the number of iterations, "eng.tagged" is the hand-tagged corpus and "eng.untagged" is the corpus tagged by apertium-tagger.'' |
|||
Before making changes to the hand-tagged corpus, make sure you have read the [[Tagging_guidelines_for_English|tagging guidelines]]. A good corpus is the key to a good tagger! |
|||
=== Constraint Grammar === |
|||
Apertium-eng currently has 38 CG rules. There is a lot of room for disambiguation improvement using CG, but ideally some rules should be moved to the tagger MTX. |
|||
===Future work=== |
|||
* Improve support of British/American ortography variants. |
|||
Latest revision as of 22:00, 1 May 2018
English (Wikipedia:English language) is a West Germanic language. It is available in Apertium as a standalone analyser/generator (apertium-eng) and as a component of several pairs which translate to/from English.
Contents
Language pairs[edit]
See also: List of language pairs
In trunk:
Pair name | Languages | Last update |
---|---|---|
apertium-cy-en |
Welsh <-> English | 13 Dec 2015 |
apertium-en-ca |
English <-> Catalan | 28 Mar 2016 |
apertium-en-es |
English <-> Spanish | 11 Apr 2017 |
apertium-en-gl |
English <-> Galician | 15 Jul 2016 |
apertium-eo-en |
Esperanto <-> English | 13 Dec 2015 |
apertium-eu-en |
Basque --> English | 13 Dec 2015 |
apertium-hbs-eng |
Serbo-Croatian <-> English | 15 Oct 2014 |
apertium-isl-eng |
Icelandic --> English | 02 Mar 2016 |
apertium-mk-en |
Macedonian --> English | 12 Oct 2014 |
In staging:
Pair name | Languages | Last update |
---|---|---|
apertium-eng-kaz |
English <-> Kazakh | 14 Apr 2017 |
In nursery:
Pair name | Languages | Last update |
---|---|---|
apertium-bg-en |
Bulgarian <-> English | 09 Jun 2014 |
apertium-en-pt |
English <-> Portuguese | 09 Jun 2014 |
apertium-eng-afr |
English <-> Afrikaans | 18 Nov 2016 |
apertium-eng-deu |
English <-> German | 13 Apr 2017 |
apertium-eng-hin |
English <-> Hindi | 12 Jan 2017 |
apertium-fin-eng |
Finnish <-> English | 07 Jun 2015 |
apertium-hye-eng |
Armenian --> English | 22 Jan 2013 |
apertium-nor-eng |
Norwegian <-> English | 25 Apr 2016 |
In incubator:
Pair name | Languages | Last update |
---|---|---|
apertium-asm-eng |
Assamese <-> English | 04 Jan 2016 |
apertium-bn-en |
Bengali <-> English | 04 Jan 2016 |
apertium-ckb-eng |
Central Kurdish <-> English | 13 Oct 2016 |
apertium-ell-eng |
Modern Greek <-> English | 29 Sep 2015 |
apertium-en-ga |
English -?- Irish | 13 Dec 2016 |
apertium-en-it |
English <-> Italian | 26 Jun 2015 |
apertium-en-lt |
English -?- Lithuanian | 07 Dec 2010 |
apertium-en-lv |
English -?- Latvian | 26 Jun 2015 |
apertium-en-mt |
English -?- Maltese | 19 Jun 2011 |
apertium-en-nl |
English <-> Dutch | 29 Apr 2011 |
apertium-en-pl |
English <-> Polish | 26 Jun 2015 |
apertium-en-sq |
English --> Albanian | 31 Aug 2010 |
apertium-eng-cat |
English <-> Catalan | 24 Jan 2016 |
apertium-eng-ina |
English <-> Interlingua | 13 Jan 2017 |
apertium-eng-ita |
English <-> Italian | 08 Jan 2017 |
apertium-eng-lvs |
English <-> Standard Latvian | 09 Jun 2014 |
apertium-eng-pes |
English <-> Iranian Persian | 11 Aug 2015 |
apertium-eng-sco |
English <-> Scots | 01 Apr 2017 |
apertium-eng-tel |
English <-> Telugu | 26 Jun 2015 |
apertium-fra-eng |
French <-> English | 24 Mar 2017 |
apertium-gle-eng |
Irish <-> English | 01 Feb 2016 |
apertium-ht-en |
Haitian Creole --> English | 04 Jan 2016 |
apertium-hun-eng |
Hungarian --> English | 19 Jan 2016 |
apertium-kmr-eng |
Kmer ? <-> English | 06 Mar 2017 |
apertium-ky-en |
Kyrgyz -?- English | 29 Jun 2011 |
apertium-la-en |
Latin -?- English | 01 Dec 2011 |
apertium-lat-eng |
Latin <-> English | 11 Jan 2017 |
apertium-mal-eng |
Malayalam <-> English | 04 Jan 2016 |
apertium-mar-eng |
Marathi --> English | 12 May 2013 |
apertium-mfe-en |
Morisyen --> English | 19 Jun 2010 |
apertium-ne-en |
Nepali <-> English | 26 Jun 2015 |
apertium-pes-eng |
Iranian Persian <-> English | 11 Mar 2017 |
apertium-rus-eng |
Russian -?- English | 18 May 2014 |
apertium-sah-eng |
Yakut -?- English | 18 Mar 2015 |
apertium-si-en |
Sinhala <-> English | 26 Jun 2015 |
apertium-sjo-eng |
Xibe --> English | 09 Nov 2014 |
apertium-swa-eng |
Swahili -?- English | 17 Dec 2016 |
apertium-swe-eng |
Swedish <-> English | 17 Dec 2016 |
apertium-tha-eng |
Thai <-> English | 03 Dec 2016 |
apertium-tr-en |
Turkish <-- English | 04 Aug 2011 |
apertium-vi-en |
Vietnamese --> English | 24 Oct 2010 |
Apertium-eng[edit]
Current status[edit]
Last update: 28 Aug 2017
Dix entries: 54,453
Dix paradigms: 377
Coverage: 93.55% (Wikipedia)
Dictionary guidelines[edit]
The current English dictionary is quite big (nearly 55,000 entries), so tidiness is essential to ensure future development:
- Keep entries sorted alphabetically.
- Keep entries grouped by type and tags (do not mix different types of proper nouns together).
- Check the file with apertium-dixtools (to update the number of entries and remove duplicates).
Spelling variants[edit]
The standard spelling variant in Apertium is British English. American English spelling entries are officially supported using v="eng"
in the original British entry and v="eng_US"
in the American entry, which should be a subform of the first one.
Tagger[edit]
Apertium-eng currently uses a tagger trained in a supervised manner using a hand-tagged corpus and a perceptron. The corpus can be found here, and should be modified after any tag change in the monolingual dictionary to make sure it matches the current status of the language module. The MTX file for English has no rules defined for now, but it should improve disambiguation after some restrictions are added.
The tagger can be trained with the perceptron using the following command (see Perceptron tagger):
$ apertium-tagger -xs 10 eng.prob eng.tagged eng.untagged apertium-eng.eng.mtx
Note: in the example above, "10" is the number of iterations, "eng.tagged" is the hand-tagged corpus and "eng.untagged" is the corpus tagged by apertium-tagger.
Before making changes to the hand-tagged corpus, make sure you have read the tagging guidelines. A good corpus is the key to a good tagger!
Constraint Grammar[edit]
Apertium-eng currently has 38 CG rules. There is a lot of room for disambiguation improvement using CG, but ideally some rules should be moved to the tagger MTX.
Future work[edit]
- Improve support of British/American ortography variants.
For further documentation about English in Apertium, check: Category:English