Difference between revisions of "Named entity recognition"

From Apertium
Jump to navigation Jump to search
(Category:Documentation in English)
(apertium-pn-recogniser doesn't deal with the problem most explained here …)
Line 1: Line 1:
Named entity recognition is about recognising named entities, for example proper nouns, etc. in text.
{{TOCD}}

Named entity recognition is about recognising named entities, for example proper nouns, etc. in text. When working with long rules, one of the problems in having them applied can be proper nouns. For example, names, companies, places etc. that aren't in the dictionaries and thus are not analysed. So for example in a sentence like:
There are several translation issues that can show up when there are ''unknown'' proper nouns in the input. One is that transfer rules that work on <np>-tagged words do not apply when the word is unknown. Another is that proper nouns can be ambiguous with other, known words, and thus be translated when they should stay untranslated.

{{tocd}}

==Unknown proper nouns in transfer==
When working with long rules, one of the problems in having them applied can be proper nouns. For example, names, companies, places etc. that aren't in the dictionaries and thus are not analysed. So for example in a sentence like:


* Die man het John gesien.
* Die man het John gesien.
Line 22: Line 28:
Which is less than ideal. What we need is something that can tag "John" as a proper noun (<code><np></code>), so that the rules may be applied in the appropriate fashion.
Which is less than ideal. What we need is something that can tag "John" as a proper noun (<code><np></code>), so that the rules may be applied in the appropriate fashion.


==Examples==
===Case handling===


The problem becomes more acute in other language groups where proper nouns have cases. For example in Serbo-Croatian or Polish:
The problem becomes more acute in other language groups where proper nouns have cases. For example in Serbo-Croatian or Polish:
Line 32: Line 38:
:Marijom → Marija{{fade|<np><ant><f><sg>&lt;ins&gt;}}
:Marijom → Marija{{fade|<np><ant><f><sg>&lt;ins&gt;}}


==Pipeline==
===Pipeline===
It should probably go in between tagging and transfer, and work only on unknown words.
It should probably go in between tagging and transfer, and work only on unknown words.


==The Apertium proper noun recogniser==
==Disambiguating unknown proper nouns from known words==
This module detects proper nouns in the input and marks them as unknown words so that the rest of the modules in the pipeline do not process them. This avoids the common case of wrong translations of source-language proper nouns which are also common nouns according to the dictionaries. The proper noun recogniser is mainly based on the one already included in the [[Freeling]] project.
The module [[apertium-pn-recogniser]] detects proper nouns in the input and marks them as unknown words so that the rest of the modules in the pipeline do not process them. This avoids the common case of wrong translations of source-language proper nouns which are also common nouns according to the dictionaries. The proper noun recogniser is mainly based on the one already included in the [[Freeling]] project.


The proper noun recogniser must be invoked between the tagger and the transfer modules. Option -p is needed in the tagger, so a version from Apertium greater or equal to 3.1.0 is needed.
The proper noun recogniser must be invoked between the tagger and the transfer modules. Option -p is needed in the tagger, so a version from Apertium greater or equal to 3.1.0 is needed.


Check out from SVN with:
Check out from SVN with:
svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-pn-recogniser
svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-pn-recogniser





Revision as of 07:35, 20 July 2014

Named entity recognition is about recognising named entities, for example proper nouns, etc. in text.

There are several translation issues that can show up when there are unknown proper nouns in the input. One is that transfer rules that work on <np>-tagged words do not apply when the word is unknown. Another is that proper nouns can be ambiguous with other, known words, and thus be translated when they should stay untranslated.

Template:Tocd

Unknown proper nouns in transfer

When working with long rules, one of the problems in having them applied can be proper nouns. For example, names, companies, places etc. that aren't in the dictionaries and thus are not analysed. So for example in a sentence like:

  • Die man het John gesien.

would be analysed something like (simplifying slightly):

  • Die<det> man<n><vbhaver> *John gesien<vblex><past>

If we have a rule that says something like:

  • <vbhaver> <noun phrase> <vblex><past> → <vbhaver> <vblex><past> <noun phrase>

This will not apply, because "John" is not detected as anything. As a result the translation will be worse because the word re-ordering has not taken place. So, instead of getting:

  • The man had seen John

We would get:

  • The man had John seen.

Which is less than ideal. What we need is something that can tag "John" as a proper noun (<np>), so that the rules may be applied in the appropriate fashion.

Case handling

The problem becomes more acute in other language groups where proper nouns have cases. For example in Serbo-Croatian or Polish:

Władysława → Władysław<np><ant><m><sg><gen>

and

Marijom → Marija<np><ant><f><sg><ins>

Pipeline

It should probably go in between tagging and transfer, and work only on unknown words.

Disambiguating unknown proper nouns from known words

The module apertium-pn-recogniser detects proper nouns in the input and marks them as unknown words so that the rest of the modules in the pipeline do not process them. This avoids the common case of wrong translations of source-language proper nouns which are also common nouns according to the dictionaries. The proper noun recogniser is mainly based on the one already included in the Freeling project.

The proper noun recogniser must be invoked between the tagger and the transfer modules. Option -p is needed in the tagger, so a version from Apertium greater or equal to 3.1.0 is needed.

Check out from SVN with:

svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-pn-recogniser


Further reading

External links