Alphabet

From Apertium
Revision as of 08:28, 28 April 2014 by Unhammer (talk | contribs) (Created page with "The '''<alphabet>''' part of lttoolbox dictionaries is used to specify which characters are considered possible parts of words as opposed to "blank" chars. Its main ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The <alphabet> part of lttoolbox dictionaries is used to specify which characters are considered possible parts of words as opposed to "blank" chars. Its main effect is on tokenisation of unknown words, since non-alphabet characters may still be part of a known word.

Say your deu.dix looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
	<alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
	<sdefs>
		<sdef n="prn" 	c="Pronoun"/>
		<sdef n="pr" 	c="Preposition"/>
	</sdefs>
	<section id="main" type="standard">
		<e> <p> <l>es</l> <r>es<s n="prn"/></r> </p> </e>
		<e> <p> <l>über</l> <r>über<s n="pr"/></r> </p> </e>
	</section>
</dictionary>

We compile it:

$ lt-comp lr deu.dix deu.bin
main@standard 4 3

And it works as expected for "es":

$ echo es | lt-proc deu.automorf.bin
^es/es<prn>$

Given a word not in the dictionary, but composed of alphabetic chars, we get an unknown-word analysis:

$ echo wei | lt-proc deu.automorf.bin
^wei/*wei$

But if that unknown word contains a non-alphabetic char, the unknown word analysis will be split on that char (which will be considered a blank):

$ echo große | lt-proc deu.automorf.bin
^gro/*gro$ß^e/*e$

This is as if you'd input "gro e".

But if a non-alphabetic is specified as part of a known analysis, it'll work just fine:

$ echo über grün | lt-proc deu.automorf.bin
^über/über<pr>$ ^gr/*gr$ü^n/*n$

(über is in the dictionary, grün is not)


Wishlist

  • Ability to specify Unicode ranges