Difference between revisions of "Incorporating guessing into Apertium"

From Apertium
Jump to navigation Jump to search
(Created page with "Apertium has a coverage problem. The greater the coverage of the real dictionaries, the more accurate guessing will be. So we wouldn't want to try with a pair that has 80% co...")
 
Line 19: Line 19:
   
 
Morphological generation for the regular part of the paradigm is largely a solved problem and could be implemented fairly easily.
 
Morphological generation for the regular part of the paradigm is largely a solved problem and could be implemented fairly easily.
  +
  +
==Rule component==
  +
  +
Morphological rules might look something like,
  +
  +
<pre>
  +
<rule>
  +
<match tags="np.ant"/>
  +
<match case="Aa" unknown="true"><add-reading tags="np.ant"/></match>
  +
<match tags="np.cog"/>
  +
</rule>
  +
  +
<rule>
  +
<match tags="np.al"/>
  +
<match tags="pr"><add-reading tags="np.al"/></match>
  +
<match tags="np.al"/>
  +
</rule>
  +
  +
<rule>
  +
<match tags="quot"/>
  +
<match case="Aa"><add-reading tags="np.al"/></match>
  +
<match tags="quot"/>
  +
</rule>
  +
  +
</pre>

Revision as of 11:45, 25 June 2020

Apertium has a coverage problem.

The greater the coverage of the real dictionaries, the more accurate guessing will be. So we wouldn't want to try with a pair that has 80% coverage, but we would with 95% coverage.

Neural machine translation systems get around this by doing sub-word segmentation. But Apertium can't effectively use this because of the linguistic model.

However, we could incorporate guessing into the platform, here are some ideas.

In an RBMT translation system, guessing needs to take place in three places:

  • Morphological analysis
  • Bilingual transfer
  • Morphological generation

For morphological analysis, guessers can be fairly effectively implemented or trained. They could be based on regex, and some pairs do that.

Or one could also envisage using an existing analyser + corpus to train the guesser. e.g. you start by partitioning the corpus into two, and then try iteratively training the guesser, first you do it with only 10% of the vocabulary in the existing analyser, then 20% then 30% etc. By the time you finish you should have a reasonable model of missing unknown words.

For the bilingual transfer things are more difficult, but one could imagine using techniques such as those used by Artetxe et al. to make a translation guesser using the existing bidix and two monolingual corpora in a similar way.

Morphological generation for the regular part of the paradigm is largely a solved problem and could be implemented fairly easily.

Rule component

Morphological rules might look something like,

  <rule>
     <match tags="np.ant"/> 
     <match case="Aa" unknown="true"><add-reading tags="np.ant"/></match>
     <match tags="np.cog"/>
  </rule> 

  <rule>
     <match tags="np.al"/> 
     <match tags="pr"><add-reading tags="np.al"/></match>
     <match tags="np.al"/>
  </rule> 

  <rule>
     <match tags="quot"/> 
     <match case="Aa"><add-reading tags="np.al"/></match>
     <match tags="quot"/>
  </rule>