Unnamed simple dictionary insert
I am programming a little tool for inserting words into Apertium dictionaries as a university project. The tool is web-based. It aims for a standard and simple way of managing dictionaries.
By now, it can only manage the simplest of the apertium entries:
<e> <par/> <e/>
It works like pastebin. When a user first connects the web, he is asked to upload an Apertium pair, and, if he has them, the configuration files. User gets a random identifier, that can be used for returning to this session, if not closed.
Nowadays, any user with the id can insert words, download your dictionaries and configuration files or close the session (no security policy is implemented), so keep sessions short. I will add some way to block closing sessions and inserting words.
The application runs over php, and uses BaseX as XML database. It also uses a bit of XSLT.
We should upload both monolingual and the bilingual files (3), choosing the apertium pair they belong to (2). The application provides with configuration files for some pairs, but you can upload your own configuration file. If you want to use a pair that is unavailable, you can uncheck (1), and manually type the pair. You must provide your own configuration for unavailable languages.
There are only 3 languages installed: es, eu and ca (although eu is badly configured). If you want to use a pair that has a common language, like es-pt, only the pt configuration file will be required.
Follow the same procedure with the other language (5), provide a bidirectional dix (6), and press the upload button.
Now we can insert words into the dictionaries.
The written words are a simple example. Obviously, azúcar (sugar) does not translate by xec (check).
First, we can choose what translation directions should be generated (1). Usually, we want to generate only the ↔ ones, but we may generate any combination of ↔ , → , and ←.
Words should be written in its representative form, which is defined in the configuration file (2). Then, we get a list of possible paradigms (3) for the current word. After choosing the paradigm we get some flexed forms (4) of the word (again, the ones in the configuration file).
As we can see, for the noun azúcar we select the abril__n paradigm. This gives us the azúcar, azúcares flexed forms.
After writing both words, the user can generate (5) the XML nodes that will be inserted into the dictionary. If one (or both) words are already defined in the monolingual dictionary, they will not be generated. If the translation is defined in the bilingual dictionary, nothing is inserted.
User should be aware that this step logic is really simple: it does not check if the translation node should be tagged as alternative or directional. It only checks if the translation exists in the dictionary. User should be as cautious as when editing dictionaries directly. For helping the user, we get all the entries of the bilingual dictionary related to both words.
Optionally, we can check if the insertions are correct (7). User can check and manually modify the insertion queries in the appearing panel.
When all the queries are generated, we can insert (6) the words into the dictionaries. Nodes are appended to the first section of the dictionaries. The application alerts the user when the insertion is done.
We can clear the form (8) if we like, but we can work over the old data.
Finally, the user can export (9) the dictionaries. This will take a while. When this operation ends, the user will get download links for the updated dictionaries.
User can close his session any time (10). It will free the user identifier, and will delete all the dictionaries and configuration files uploaded. User will be warned about this, but now any user with the id can do this.
Now, you can select the lexical category of the words (11), and you will get some information about translation nodes containing the word you are trying to insert (12).
In this example, we can see all the translations whose right part is (or contains) xec, or their left part is (or contains) azúcar. This gives us some context for both words.
In this screenshot we can see the node editing interface. The nodes are standard Apertium dix nodes (The paradigm of the word sucre was changed to the incorrect abismo__n). Any changes made in this nodes will be put directly in the dictionaries.
You can close it from the X button or from the same button that opened it (7).
Configuration file definition
File should be named config.lang.xml
This is an old config.ca.xml (the current one can be downloaded from the tool's upload page). Its really simple, and only handles 7 paradigms.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE splitter SYSTEM "config.dtd"> <splitter> <paradigms> <paradigm n = "abric__n" idForm="" desc="Nom, Masculí (Abric)" kind="n"> <flex form = "s"/> </paradigm> <paradigm n = "ahir__adv" idForm = "" desc = "Adverbi (Ahir)" kind="adv"/> <paradigm n = "abell/a__n" idForm = "a" desc = "Nom, Femení (Abella)" kind="n"> <flex form = "es"/> </paradigm> <paradigm n = "abander/ar__lex" idForm = "ar" desc = "Verb regular, 1ª conjugació (Abanderar)" kind="vb"> <flex form = "àssiu"/> <flex form = "àveu"/> </paradigm> <paradigm n = "abdominal__adj" idForm = "" desc = "Adjectiu, masculí/femení (Abdominal)" kind="adj"> <flex form = "s"/> </paradigm> <paradigm n = "acci/ó__n" idForm = "ó" desc = "Nom, Femení (Acció)" kind="n"> <flex form = "ons"/> </paradigm> <paradigm n = "acadèmi/c__adj" idForm = "c" desc = "Adjectiu, masculí (Acadèmic)" kind="adj"> <flex form = "cs"/> <flex form = "ca"/> <flex form = "ques"/> </paradigm> </paradigms> </splitter>
The <paradigm> node has the following attributes:
- n : name of the paradigm in the apertium dix
- idForm : representative form of the paradigm. Will be used for getting the paradigm list and the root of a word
- desc : short, simple description showed in the paradigm list. Useful for pairs with obscure paradigm names (like LAT_35 in eu)
- kind : lexical type of the paradigm.
All the attributes, but idForm, need a non empty-string value
The <flex> node represents a significative form of a paradigm. They should be as representative as possible, so the user can change the paradigm if he sees a bad flexion. Also, they should be as few as possible, because a long list will lead to user skipping this list. It has only 1 attribute:
- form : the ending of the word
Making your own configuration file
Now, you can make your own configuration file with the scripts included in simpledix/simpledix/configHelp
- First, we need the .dix files of the dictionaries we are going to work with
- We have 4 options for generating the configuration files:
- Writing them by hand. Is a big job, but you will only have to do it one time.
- The non-interactive script generates the configuration file of all the paradigms (including the auxiliary paradigms)
- The interactive script asks us witch paradigms to include, one by one.
getConfigFileInteractive.sh apertium-xx-yy.xx.dix out.xml
- The configurable script. We have to write a file that describes what kind of paradigms should go into the simpledix configuration file.
getConfigFile_2.sh apertium-xx-yy.xx.dix paradigms.xx-yy.txt
A possible description file:
vblex .inf .fts p2 sg .ger .imp p2 pl .pis p2 sg n .m sg .f sg .m pl .f pl .mf sg .mf pl .m sp .f sp .mf sp
This means that, for example, if we find a noun (n) paradigm, the configuration file will contain all of n, m, sg, n, f, sg... but not n, m, t. The canonical form (idform) of the paradigm in the simpledix configuration file is decided by the order of this description file.
The non-interactive and interactive versions of the script extract the idForm and the kind if the paradigm is in the acci/ó__n form. User has to manually clear unwanted paradigms or flexions, and fill empty idForm, desc and kind fields.
- Insert all kind of Apertium monolingual nodes.
- Sorted paradigm list for each word.
- Better awareness of conflicts in bilingual insertions.
- metadix support
Tool works with metadix, but wont handle the special attributes of the <par> node. User can manually write the attributes, but this should have fields for it.
Please, report all the bugs you find.
username at alu dot ua dot es