Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Tagging guidelines for English

From Apertium
Revision as of 14:13, 26 September 2016 by Rcrowther (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

[edit] About tagging

You can think of part-of-speech tagging a bit like answering a series of multiple-choice questions. The word is the question, and the possible analyses are the answers. Unknown words can be thought of as questions we don't know what the possible answers are yet. To "tag" the text, you need to answer all of the questions by deleting the "incorrect" answers.

[edit] Why is this important?

Hand-tagged texts are needed in large quantities (tens, or better hundreds, of thousands of words) to 'train' the automatic taggers found in some Apertium language pairs. Getting the right tag for a word is important, as translation depends on it. For instance, the word book can be a verb or a noun. When translating to Spanish, they have different translation:

  • [noun] Ann bought a book about gardening → Ann compró un libro sobre jardinería.
  • [verb] Ann wants to book a room at the Ritz → Ann quiere reservar una habitación en el Ritz.

This is why we have many hand-tagging tasks in the Google Code-In.

[edit] Guidelines

[edit] "both"

The word "both" can be a conjunction, joining two noun phrases, a determiner, modifying a noun phrase, and a pronoun, substituting a noun phrase.

  • cnjcoo
    • I like both cats and dogs.
  • det
    • Both children like playing in the garden.
  • prn
    • Both thought it was a good idea.
    • They both like playing in the garden.
    • Both of them like playing in the garden.

[edit] "Which"

  • det.itg.sp [determiner, modifies a noun or an adjective]
    • Which car have you bought?
  • prn.itg.m.sp [pronoun, does not modify a noun or adjective, meaning "which one"?]
    • Which is better?
  • rel.an.mf.sp [relative pronoun, introduces an explanatory sentence that describes a noun]
    • I saw the car, which stayed there all night.

[edit] "this", "several", "each"

The word "this" (along with its plural "these") can be either a determiner, modifying a noun phrase, or a pronoun, replacing a noun phrase.

  • det.dem
    • I don't like this cat.
    • I don't like these cats.
  • prn
    • This is the reason.
    • These are the ones.

The word "several" follows a similar pattern:

  • det.dem
    • He ate several cakes.
  • prn
    • They like cakes, they always buy several when they go to the shops.
    • Several of them thought it was a good idea.

[edit] "that"

The word "that" can be either a determiner, which modifies a noun phrase, a demonstrative pronoun which substitutes a noun phrase, a subordinating conjunction or a relative pronoun.

  • det.dem
    • I don't like that cat.
    • I don't like those cats.
  • prn
    • That is the reason.
    • Those are the ones.
  • rel
    • These are the ones that I like.
  • cnjsub
    • I think that you like cats.

Here is a tip for distinguishing rel and cnjsub. Try substituting the word "that" for the word "which" and see how it sounds. If it sounds ok, then your "that" is probably a relative pronoun, if it sounds bad, it's probably a conjunction.

  • ok: These are the ones which I like.
  • not ok: I think which you like cats.

[edit] "no"

The word "no" in English can be a determiner, modifying a noun phrase or an adverb (or interjection).

  • det.ind
    • There are no cats in my attic.
  • adv
    • No! Don't do that!

[edit] Verbs with "-ing"

The ending -ing in English can be a gerund (adverbial), a substantive (like a noun) or a present participle (like an adjective).

  • vblex.subs:
    • Roughly, when you can substitute it with a noun: "Flying is hard" → "Flight is hard"
  • vblex.pprs:
    • Roughly, when you can substitute it with a relative clause: "The flying circus" → "The circus that flies"
  • vblex.ger
    • When it follows to be in continuous tenses, or when it can be replaced by a prepositional phrase or a different verbal phrase:
      • "He came singing" → "He came with a song"
      • "He is singing → "He sings"

When you have pattern the X-ing noun you will almost always choose between subs and pprs. Try reformulating the phrase using "the noun of -ing" and "the noun which X(-s,-ed,-)". If the former sounds better, go for subs if the latter sounds better go with pprs.

  • For some transmissions, sometimes half the teeth are removed to speed the shifting process, at the expense of greater wear .
    • subs: ... to speed the process of shifting ...
    • pprs: ... to speed the process which shifts ...
  • Activity also increased in the remaining months of 1971 , in contrast to the rest of the year .
    • subs: ... in the months of remaining ...
    • pprs: ... in the months which remained ...

[edit] Adverb or adjective

A word like "first" can be either an adverb, or an ordinal adjective. An adverb modifies a verb phrase (or an adjective phrase, or another adverb), an ordinal adjective modifies a noun phrase.

  • adj
    • This is my first computer.
  • adv
    • First I'm going to buy a computer. [modifying a verb]

[edit] Infinitive or present

In English, the present tense (for all persons except 3rd singular) (pres) and the "short" infinitive (inf) of most verbs has the same wordform. For example:

  • inf
    • I like to play football.
  • pres
    • I play football on Wednesdays.

A tip for distinguishing is to try and put the verb into the third person and see how it sounds, so e.g.

  • not ok: He likes to plays football.
  • ok: He plays football.

Another tip: Infinitive is the form found after the preposition to, or after modal verbs such as can, cannot or can't, could, could not or couldn't, may, might, do not or don't, did not or didn't, will, will not or won't, would or wouldn't, must, and must not or mustn't.

[edit] Adverb or preposition

In English many words can be adverbs or prepositions. Both often modify a verb phrase, but a preposition is typically followed by a noun phrase, where an adverb stands on its own.

  • pr
    • He plays by the river.
  • adv
    • He walks by.

[edit] -ed (and other forms): Past tense or past participle

Many verbs ending in -ed (worked) may be past tense and past participle. This happens also with other irregular verb forms such as found, made, thought, brought, etc.

Past participles often work as adjectives, modifying a noun: well-defined task; they also appear in perfect verb forms with have: "I have defined a system", or in passive forms with be: "It was defined as a series of actions".

The past tense is a simple verb which is what happens' in the sentence: "I defined a procedure", "The procedure he defined".

Here is a trick: change the verb to a form of go, or drink, or take and see if the sentence is syntactically OK (even if it does not make much sense).

  • If you would have went or drank or took, then it is past tense (past);
  • if you would have gone or drunk or taken, then it is a past participle (pp).

[edit] "In", "on", "Under"

They can be adverbs (adv) or prepositions (pr). As an adverb it stands alone and modifies the verb. As a preposition, it connects a noun phrase (a phrase built around a noun) to one preceding element in the sentence.

Trick: If you can change it by out then it is an adverb.

  • The technician is in: adverb ("The technician is out" is OK)
  • The technician is in the restroom: preposition ("The technician is out the restroom" is not OK)
  • The technician is under the table: preposition ("The technician is out the table" is not OK)
  • The light is on: adverb ("The light is out" is OK)

[edit] "'s"

The word 's is usually appended to nouns and can be three different things:

  • If it can be detached as is, it is a form of the verb to be (vbser): "Mike's an athlete" → "Mike is an athlete"
  • If it can be detached as has, it is a for of the verb to have (vbhaver): "Mike's become an athlete" → "Mike has become an athlete"
  • If it marks the possessor or owner of something, it is just the genitive (gen) ending: "Mike's car" → "The car that Mike owns".

[edit] "his"

This can be a pronoun (prn) or a determiner (det). As a determiner, it specifies a noun or noun phrase. As a pronoun, it stands on its own as a noun phrase.

Trick: If you can change it by my, your, our, etc., then it is a determiner.

  • His father came to pick him up: determiner ("My father came to pick him up" is OK)
  • I drove my car and he drove his: pronoun ("I drove my car and he drove my" is not OK)

[edit] "it"

It can be a subject pronoun (subj) or an object or oblique pronoun (obj). Try making it plural. If you would write they it would be a subject pronoun. If you would write them it is an object/oblique pronoun.

  • "It was difficult" → "they were difficult" OK, "them where difficult" not OK. Therefore, subj.
  • "He wrote it down" → "He wrote they down" not OK, "He wrote them down" OK. Therefore, obj.
  • "She prepared a place for it" → "She prepared a place for they" not OK, "She prepared a place for them" OK. Therefore, obj.

[edit] "put"

This verb is the same in many of its forms:

  • inf
    • They did say they were quite willing to put the document before the Pope. (... to throw the document ...)
  • pres
    • They always put their coats on before leaving the house. (They always throw their coats on before ...)
  • past
    • He put forth a form of "radical empiricism" (He threw forth a form ...)
  • pp
    • They have put their coats on the table. (They have thrown their coats on the table)
    • The water to be purified is placed in a chamber and put under great pressure.

Tip: Try replacing "put" with "throw".

[edit] Toponym or anthroponym

Often a word can be both a person's given name (anthroponym), and a place name (toponym).

  • top
    • I live in Victoria.
  • ant
    • I hang out with Victoria.

But in some cases it can be really ambiguous:

  • I like Victoria.

In this case try searching for more context to determine.

[edit] "do"

The word "do" can be an auxiliary verb in the present tense, and a lexical verb in the infinitive or present. If it is used in a negative construction (I do not like that), or an emphatic construction (I do like that) followed by an infinitive, then it is most likely the auxiliary.

  • vbdo.pres
    • They do not have the luxury of viewing the original film.
  • vblex.inf
    • Modern psychology can do much to explain thought processes.
    • In her words she is ready to do anything.
  • vblex.pres
    • They do so by means of a magic potion.
    • They do the same when they come back.

[edit] "have"

The word "have" can be both an auxiliary verb in the present tense or infinitive, or a lexical verb in the present tense or infinitive.

  • vbhaver.pres
    • He has gone out.
  • vbhaver.inf
    • He will have gone out.
  • vblex.pres
    • He has a cat.
  • vblex.inf
    • He will have a cat.

A tip for distinguishing vbhaver from vblex is to try and replace "have" with "own" or "possess". If you can replace it and it makes sense, then it is vblex, e.g.

  • He owns a cat = He has a cat.
  • He will own a cat = He will have a cat.

[edit] "then"

The word "then" can be a conjunction or an adverb.

  • adv
    • I saw him then.
    • I knew him back then.
  • cnjadv
    • Then I went out.
    • I ate dinner, then I washed up the pots.

A tip is to try and replace it with "and then", if it makes sense then it is probably a conjunction.

Personal tools