Frequently Asked Questions

From Apertium
Jump to: navigation, search

En français


[edit] How can I contribute to this project?

See Contributing.

[edit] Why so many copies of monolingual dictionaries?

In the beginning, there weren't that many language pairs, so in keeping with the "getting something up and running" spirit of Apertium, monolingual dictionaries were simply copied from pair to pair instead of trying to design the dictionary to end all dictionaries. However, now that there are a lot of pairs around, the copies are starting to become more of a hindrance.

We have started using shared monolingual dictionaries (and other monolingual data), but most old and stable language pairs still have redundant data since Things Take Time and we haven't gotten around to merging them all.

[edit] Why do you use XML and not a database?

Isn't XML a really inefficient format for storing dictionaries, all that whitespace and tags, they're complicated to read, wouldn't it be better to have all the information in a database, like Postgres or MySQL ? Or even in flat text files?

  • Each data item is explicitly labelled with a descriptive, named tag with a clear meaning attached
  • The structure of documents may easily be validated against DTDs or schemas
  • Many technologies exist for XML (converting from and to XML, interoperability).
  • XML is quite easy to process with text-processing tools like sed, cut and awk.

You can read more in a practical and theoretical overview about our format for storing dictionaries here: Morphological dictionaries.

[edit] Does Apertium support separable verbs?

Many languages, for example most Germanic ones (with the exception of English) and Hungarian have a phenomenon called "separable verbs", also referred to as "attached prepositions" or some other names. This is where the infinitive of the verb has a part that when conjugated detaches and is moved. For example in Afrikaans, the verb for "to announce" is "aankondig". The aan part separates when the verb is conjugated, so for example:

Astronomers announce [the discovery].
Sterrekundiges kondig [die ontdekking] aan.

However, the past tense would be:

Astronomers have announced [the discovery].
Sterrekundiges het [die ontdekking] aangekondig.

On its own, "kondig" does not mean anything.


Essentially no, at the moment we do not support separable verbs. The problem for Apertium comes when the unseparated part does mean something, it is currently impossible to analyse a word in two parts when they are separated by something as nebulous as a noun-phrase (NP). There are a number of hacks that can be tried to get around this deficiency, but none of them work properly. If you would like more information on this, or have ideas how it might be fixed or dealt with, please see our page on Separable verbs.

[edit] Why are there so many less widespread languages in Apertium? Wouldn't it be logical to focus on the most-spoken languages first?

This section needs serious editing

There are a number of reasons we focus on languages with less socio-economic power.

  • There is less focus on these languages by big companies, so the communities don't "automatically" have certain resources available.
  • There are often not large parallel resources for SMT to be a viable option.
  • There are about 7K languages spoken in the world, which makes up to 49M possible language combinations. Given current resources (of all types), no one can ever hope to cover all of these.
  • Language communities often would like to contribute, etc.
  • Most Big Languages already have very good support in commercial products, so it'll take a lot of work before Apertium gets close to competitive quality for such languages, whereas with languages that don't already have good support, our work becomes useful from the beginning. This is motivating.
Personal tools