Difference between revisions of "Alphabet"

From Apertium
Jump to navigation Jump to search
(Created page with "The '''<alphabet>''' part of lttoolbox dictionaries is used to specify which characters are considered possible parts of words as opposed to "blank" chars. Its main ...")
 
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
The '''&lt;alphabet&gt;''' part of [[lttoolbox]] dictionaries is used to specify which characters are considered possible parts of words as opposed to "blank" chars. Its main effect is on tokenisation of ''unknown'' words, since non-alphabet characters may still be part of a ''known'' word.
 
The '''&lt;alphabet&gt;''' part of [[lttoolbox]] dictionaries is used to specify which characters are considered possible parts of words as opposed to "blank" chars. Its main effect is on tokenisation of ''unknown'' words, since non-alphabet characters may still be part of a ''known'' word.
   
  +
==Example==
 
Say your deu.dix looks like this:
 
Say your deu.dix looks like this:
 
<pre>
 
<pre>
Line 41: Line 42:
 
This is as if you'd input "gro e".
 
This is as if you'd input "gro e".
   
But if a non-alphabetic is specified as part of a known analysis, it'll work just fine:
+
If a non-alphabetic is part of a ''known'' analysis, it'll work just fine:
 
<pre>
 
<pre>
 
$ echo über grün | lt-proc deu.automorf.bin
 
$ echo über grün | lt-proc deu.automorf.bin
Line 48: Line 49:
 
(über is in the dictionary, grün is not)
 
(über is in the dictionary, grün is not)
   
  +
  +
Most of the time, missing alphabetics isn't very problematic, but it may become a problem if the splitting leads to wrong analyses. Here the splitting leads to an analysis of the pronoun "es", which is wrong and possibly very confusing in the translated output:
  +
<pre>
  +
$ echo weißes | lt-proc deu.automorf.bin
  +
^wei/*wei$ß^es/es<prn>$
  +
</pre>
  +
  +
==Can't we just put everything into the alphabet then?==
  +
Too many alphabetics can lead to problems too, e.g. if you put space characters in your alphabet, you won't get any tokenisation on spaces! If we put a space into the alphabet of our example, even known words will not get analyses if they have spaces around them:
  +
<pre>
  +
$ echo es es | lt-proc deu.automorf.bin
  +
^es es/*es es$
  +
</pre>
  +
  +
(But see also [[Inconditional section]].)
   
 
==Wishlist==
 
==Wishlist==
   
 
* Ability to specify Unicode ranges
 
* Ability to specify Unicode ranges
  +
  +
  +
[[Category:Documentation in English]]
  +
[[Category:Writing dictionaries]]

Latest revision as of 08:52, 28 April 2014

The <alphabet> part of lttoolbox dictionaries is used to specify which characters are considered possible parts of words as opposed to "blank" chars. Its main effect is on tokenisation of unknown words, since non-alphabet characters may still be part of a known word.

Example[edit]

Say your deu.dix looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
	<alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
	<sdefs>
		<sdef n="prn" 	c="Pronoun"/>
		<sdef n="pr" 	c="Preposition"/>
	</sdefs>
	<section id="main" type="standard">
		<e> <p> <l>es</l> <r>es<s n="prn"/></r> </p> </e>
		<e> <p> <l>über</l> <r>über<s n="pr"/></r> </p> </e>
	</section>
</dictionary>

We compile it:

$ lt-comp lr deu.dix deu.bin
main@standard 4 3

And it works as expected for "es":

$ echo es | lt-proc deu.automorf.bin
^es/es<prn>$

Given a word not in the dictionary, but composed of alphabetic chars, we get an unknown-word analysis:

$ echo wei | lt-proc deu.automorf.bin
^wei/*wei$

But if that unknown word contains a non-alphabetic char, the unknown word analysis will be split on that char (which will be considered a blank):

$ echo große | lt-proc deu.automorf.bin
^gro/*gro$ß^e/*e$

This is as if you'd input "gro e".

If a non-alphabetic is part of a known analysis, it'll work just fine:

$ echo über grün | lt-proc deu.automorf.bin
^über/über<pr>$ ^gr/*gr$ü^n/*n$

(über is in the dictionary, grün is not)


Most of the time, missing alphabetics isn't very problematic, but it may become a problem if the splitting leads to wrong analyses. Here the splitting leads to an analysis of the pronoun "es", which is wrong and possibly very confusing in the translated output:

$ echo weißes | lt-proc deu.automorf.bin
^wei/*wei$ß^es/es<prn>$

Can't we just put everything into the alphabet then?[edit]

Too many alphabetics can lead to problems too, e.g. if you put space characters in your alphabet, you won't get any tokenisation on spaces! If we put a space into the alphabet of our example, even known words will not get analyses if they have spaces around them:

$ echo es es | lt-proc deu.automorf.bin
^es es/*es es$

(But see also Inconditional section.)

Wishlist[edit]

  • Ability to specify Unicode ranges