Tagging guidelines for Catalan
These tagging guidelines will never be complete. Problematic words will be added as they are encountered.
You can think of part-of-speech tagging a bit like answering a series of multiple-choice questions. The word is the question, and the possible analyses are the answers. Unknown words can be thought of as questions we don't know what the possible answers are yet. To "tag" the text, you need to answer all of the questions by deleting the "incorrect" answers.
Why is this important?
Hand-tagged texts are needed in large quantities (tens, or better hundreds, of thousands of words) to 'train' the automatic taggers found in some Apertium language pairs. Getting the right tag for a word is important, as translation depends on it. For instance, the Spanish word canto can be a verb or a noun. When translating to English, they have different translations:
- [verb] Cada dia duc l'ordinador a l'oficina → Everyday I bring my computer to the office.
- [noun] Enric d'Aragó va ser el primer duc de Villena → Henry of Aragon was the first duke of Villena.
This is why we have many hand-tagging tasks in the Google Code-In.
These guidelines cover some difficult words when hand-tagging Catalan Apertium output.
La is a very common common ambiguous word in Catalan. It can be a determiner or a pronoun.
- It is a determiner (
det.def.f.sg) when it precedes a noun phrase and can be substituted by another determiner such as una or aquella
- La història ens ensenya el camí del futur
- It is a proclitic pronoun (
prn.pro.p3.f.sg) when it precedes a verb and it means a ella, a aquesta, etc..
- La va saludar quan se la va trobar pel carrer.