Incorporating guessing into Apertium

Or one could also envisage using an existing analyser + corpus to train the guesser. e.g. you start by partitioning the corpus into two, and then try iteratively training the guesser, first you do it with only 10% of the vocabulary in the existing analyser, then 20% then 30% etc. By the time you finish you should have a reasonable model of missing unknown words.

For the bilingual transfer things are more difficult, but one could imagine using techniques such as those used by Artetxe et al. to make a translation guesser using the existing bidix and two monolingual corpora in a similar way.

Morphological generation for the regular part of the paradigm is largely a solved problem and could be implemented fairly easily.

Rule component[edit]

Morphological rules might look something like,

  <rule>
     <match tags="np.ant"/> 
     <match case="Aa" unknown="true"><add-reading tags="np.ant"/></match>
     <match tags="np.cog"/>
  </rule> 

  <rule>
     <match tags="np.al"/> 
     <match tags="pr"><add-reading tags="np.al"/></match>
     <match tags="np.al"/>
  </rule> 

  <rule>
     <match tags="quot"/> 
     <match case="Aa"><add-reading tags="np.al"/></match>
     <match tags="quot"/>
  </rule> 


  <rule>
     <match ends-with="ista" tags="n.mf.*"><add-reading tags="adj.mf.sp"/></match>
  </rule>

Guesser for orthographic variation[edit]

Here is an idea for dealing with unknown words caused by spelling mistakes or orthographic variation:

Input:

Word and character embeddings
+1, -1 context

Output:

Analyses for an unknown word (based on an existing analysis string)

Training:

Take a corpus that has variation in, and try and

Pitfalls:

Sometimes we'll want to leave a word unknown

Questions:

Will we ever want to add an analysis to an existing word?

Another guesser for orthographic variation[edit]

Let's say that we already have some orthographic variation in the dictionary, we can make a training set of e.g.

$ lt-expand apertium-scn.scn.dix  | tee /tmp/analyses | cut -f1 -d':' > /tmp/surface

cat /tmp/analyses | sed 's/:[<>]:/:/g' | cut -f2 -d':' | sed 's/.*/^&$/g' | lt-proc -d scn.autogen.bin  > /tmp/surface.2

$ paste /tmp/surface /tmp/surface.2 | grep -v '[~#]'

splicitazzioni	splicitazzioni
splicitazzioni	splicitazzioni
splicitazioni	splicitazzioni
splicitazione	splicitazzioni
splicitazziuni	splicitazzioni
splicitaziuni	splicitazzioni
splicitazzioni	splicitazzioni
papulazzioni	papulazzioni
papulazzioni	papulazzioni
papulazioni	papulazzioni
papulazione	papulazzioni
papulazziuni	papulazzioni
papulaziuni	papulazzioni

...

We could run something like this in the pipe before lt-proc, and then allow lt-proc to look up the analyses of the various forms and take the union,

echo "papulazione" | apertium-variation -b 3 variation.bin 
^papulazione/papulazione/papulazzioni/papulazioni$

Individually these might get:
papulazione - *papulazione
papulazzioni - papulazzioni<n><f><sp>
papulazioni - papulazzioni<n><f><sp>

So the output would be:

echo "papulazzioni" | apertium-variation -b 3 variation.bin  | lt-proc scn.automorf.bin
^papulazione/papulazzioni<n><f><sp>$

Incorporating guessing into Apertium

Contents

Rule component[edit]

Guesser for orthographic variation[edit]

Another guesser for orthographic variation[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools