Difference between revisions of "English"

From Apertium
Jump to navigation Jump to search
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''English''' ([[Wikipedia:English language]]) is a West Germanic language. It is available in [[Apertium]] as a standalone analyser/generator (apertium-eng) and as a component of several pairs which translate to/from English.
'''English''' ([[Wikipedia:English language]]) is a West Germanic language. It is available in Apertium as a standalone analyser/generator ([[English#Apertium-eng|apertium-eng]]) and as a component of several pairs which translate to/from English.


== Language pairs ==
== Language pairs ==
Line 149: Line 149:
|-
|-
|}
|}

== Apertium-eng ==

===Current status===

''Last update: 28 Aug 2017''

'''Dix entries:''' 54,453

'''Dix paradigms:''' 377

'''Coverage:''' 93.55% (Wikipedia)

===Dictionary guidelines===

The current English dictionary is quite big (nearly 55,000 entries), so tidiness is essential to ensure future development:

* Keep entries sorted alphabetically.
* Keep entries grouped by type and tags (do not mix different types of proper nouns together).
* Check the file with apertium-dixtools (to update the number of entries and remove duplicates).

====Spelling variants====

The standard spelling variant in Apertium is British English. American English spelling entries are officially supported using <code>v="eng"</code> in the original British entry and <code>v="eng_US"</code> in the American entry, which should be a subform of the first one.

=== Tagger ===

Apertium-eng currently uses a tagger trained in a supervised manner using a hand-tagged corpus and a perceptron. The corpus can be found [https://svn.code.sf.net/p/apertium/svn/languages/apertium-eng/texts/consensus here], and should be modified after any tag change in the monolingual dictionary to make sure it matches the current status of the language module. The MTX file for English has no rules defined for now, but it should improve disambiguation after some restrictions are added.

The tagger can be trained with the perceptron using the following command (see [[Perceptron tagger]]):

<pre>
$ apertium-tagger -xs 10 eng.prob eng.tagged eng.untagged apertium-eng.eng.mtx
</pre>

''Note: in the example above, "10" is the number of iterations, "eng.tagged" is the hand-tagged corpus and "eng.untagged" is the corpus tagged by apertium-tagger.''

Before making changes to the hand-tagged corpus, make sure you have read the [[Tagging_guidelines_for_English|tagging guidelines]]. A good corpus is the key to a good tagger!

=== Constraint Grammar ===

Apertium-eng currently has 38 CG rules. There is a lot of room for disambiguation improvement using CG, but ideally some rules should be moved to the tagger MTX.

===Future work===

* Improve support of British/American ortography variants.



''For further documentation about English in Apertium, check: [[:Category:English]]''
''For further documentation about English in Apertium, check: [[:Category:English]]''

Latest revision as of 22:00, 1 May 2018

English (Wikipedia:English language) is a West Germanic language. It is available in Apertium as a standalone analyser/generator (apertium-eng) and as a component of several pairs which translate to/from English.

Language pairs[edit]

See also: List of language pairs

In trunk:

Pair name Languages Last update
apertium-cy-en Welsh <-> English 13 Dec 2015
apertium-en-ca English <-> Catalan 28 Mar 2016
apertium-en-es English <-> Spanish 11 Apr 2017
apertium-en-gl English <-> Galician 15 Jul 2016
apertium-eo-en Esperanto <-> English 13 Dec 2015
apertium-eu-en Basque --> English 13 Dec 2015
apertium-hbs-eng Serbo-Croatian <-> English 15 Oct 2014
apertium-isl-eng Icelandic --> English 02 Mar 2016
apertium-mk-en Macedonian --> English 12 Oct 2014

In staging:

Pair name Languages Last update
apertium-eng-kaz English <-> Kazakh 14 Apr 2017

In nursery:

Pair name Languages Last update
apertium-bg-en Bulgarian <-> English 09 Jun 2014
apertium-en-pt English <-> Portuguese 09 Jun 2014
apertium-eng-afr English <-> Afrikaans 18 Nov 2016
apertium-eng-deu English <-> German 13 Apr 2017
apertium-eng-hin English <-> Hindi 12 Jan 2017
apertium-fin-eng Finnish <-> English 07 Jun 2015
apertium-hye-eng Armenian --> English 22 Jan 2013
apertium-nor-eng Norwegian <-> English 25 Apr 2016

In incubator:

Pair name Languages Last update
apertium-asm-eng Assamese <-> English 04 Jan 2016
apertium-bn-en Bengali <-> English 04 Jan 2016
apertium-ckb-eng Central Kurdish <-> English 13 Oct 2016
apertium-ell-eng Modern Greek <-> English 29 Sep 2015
apertium-en-ga English -?- Irish 13 Dec 2016
apertium-en-it English <-> Italian 26 Jun 2015
apertium-en-lt English -?- Lithuanian 07 Dec 2010
apertium-en-lv English -?- Latvian 26 Jun 2015
apertium-en-mt English -?- Maltese 19 Jun 2011
apertium-en-nl English <-> Dutch 29 Apr 2011
apertium-en-pl English <-> Polish 26 Jun 2015
apertium-en-sq English --> Albanian 31 Aug 2010
apertium-eng-cat English <-> Catalan 24 Jan 2016
apertium-eng-ina English <-> Interlingua 13 Jan 2017
apertium-eng-ita English <-> Italian 08 Jan 2017
apertium-eng-lvs English <-> Standard Latvian 09 Jun 2014
apertium-eng-pes English <-> Iranian Persian 11 Aug 2015
apertium-eng-sco English <-> Scots 01 Apr 2017
apertium-eng-tel English <-> Telugu 26 Jun 2015
apertium-fra-eng French <-> English 24 Mar 2017
apertium-gle-eng Irish <-> English 01 Feb 2016
apertium-ht-en Haitian Creole --> English 04 Jan 2016
apertium-hun-eng Hungarian --> English 19 Jan 2016
apertium-kmr-eng Kmer ? <-> English 06 Mar 2017
apertium-ky-en Kyrgyz -?- English 29 Jun 2011
apertium-la-en Latin -?- English 01 Dec 2011
apertium-lat-eng Latin <-> English 11 Jan 2017
apertium-mal-eng Malayalam <-> English 04 Jan 2016
apertium-mar-eng Marathi --> English 12 May 2013
apertium-mfe-en Morisyen --> English 19 Jun 2010
apertium-ne-en Nepali <-> English 26 Jun 2015
apertium-pes-eng Iranian Persian <-> English 11 Mar 2017
apertium-rus-eng Russian -?- English 18 May 2014
apertium-sah-eng Yakut -?- English 18 Mar 2015
apertium-si-en Sinhala <-> English 26 Jun 2015
apertium-sjo-eng Xibe --> English 09 Nov 2014
apertium-swa-eng Swahili -?- English 17 Dec 2016
apertium-swe-eng Swedish <-> English 17 Dec 2016
apertium-tha-eng Thai <-> English 03 Dec 2016
apertium-tr-en Turkish <-- English 04 Aug 2011
apertium-vi-en Vietnamese --> English 24 Oct 2010

Apertium-eng[edit]

Current status[edit]

Last update: 28 Aug 2017

Dix entries: 54,453

Dix paradigms: 377

Coverage: 93.55% (Wikipedia)

Dictionary guidelines[edit]

The current English dictionary is quite big (nearly 55,000 entries), so tidiness is essential to ensure future development:

  • Keep entries sorted alphabetically.
  • Keep entries grouped by type and tags (do not mix different types of proper nouns together).
  • Check the file with apertium-dixtools (to update the number of entries and remove duplicates).

Spelling variants[edit]

The standard spelling variant in Apertium is British English. American English spelling entries are officially supported using v="eng" in the original British entry and v="eng_US" in the American entry, which should be a subform of the first one.

Tagger[edit]

Apertium-eng currently uses a tagger trained in a supervised manner using a hand-tagged corpus and a perceptron. The corpus can be found here, and should be modified after any tag change in the monolingual dictionary to make sure it matches the current status of the language module. The MTX file for English has no rules defined for now, but it should improve disambiguation after some restrictions are added.

The tagger can be trained with the perceptron using the following command (see Perceptron tagger):

$ apertium-tagger -xs 10 eng.prob eng.tagged eng.untagged apertium-eng.eng.mtx

Note: in the example above, "10" is the number of iterations, "eng.tagged" is the hand-tagged corpus and "eng.untagged" is the corpus tagged by apertium-tagger.

Before making changes to the hand-tagged corpus, make sure you have read the tagging guidelines. A good corpus is the key to a good tagger!

Constraint Grammar[edit]

Apertium-eng currently has 38 CG rules. There is a lot of room for disambiguation improvement using CG, but ideally some rules should be moved to the tagger MTX.

Future work[edit]

  • Improve support of British/American ortography variants.


For further documentation about English in Apertium, check: Category:English