User:Dtr5

From Apertium
Revision as of 09:31, 15 May 2012 by Dtr5 (talk | contribs) (Config file is updated in the tool, but I keep the old one for clarity. Also, document the new field.)
Jump to navigation Jump to search

Unnamed simple dictionary insert

http://apertium.vm.bytemark.co.uk/simpledix

I am programming a little tool for inserting words into Apertium dictionaries as a university project. The tool is web-based. It aims for a standard and simple way of managing dictionaries.

By now, it can only manage the simplest of the apertium entries:

 <e>
    <par/>
 <e/>

It works like pastebin. When a user first connects the web, he is asked to upload an Apertium pair, and, if he has them, the configuration files. User gets a random identifier, that can be used for returning to this session, if not closed.

Nowadays, any user with the id can insert words, download your dictionaries and configuration files or close the session (no security policy is implemented), so keep sessions short. I will add some way to block closing sessions and inserting words.

Initialization

Upload.png

We should upload both monolingual and the bilingual files (3), choosing the apertium pair they belong to (2). The application provides with configuration files for some pairs, but you can upload your own configuration file. If you want to use a pair that is unavailable, you can uncheck (1), and manually type the pair. You must provide your own configuration for unavailable languages.

There are only 3 languages installed: es, eu and ca (although eu is badly configured). If you want to use a pair that has a common language, like es-pt, only the pt configuration file will be required.

Now we can insert words into the dictionaries.

Word insertion

Insert.png

First, we can choose what translation directions should be generated (1). Usually, we want to generate only the ones, but we may generate any combination of , , and .

Words should be written in its representative form, which is defined in the configuration file (2). Then, we get a list of possible paradigms (3) for the current word. After choosing the paradigm we get some flexed forms (4) of the word (again, the ones in the configuration file).

After writing both words, the user can generate (5) the XML nodes that will be inserted into the dictionary. If one (or both) words are already defined in the monolingual dictionary, they will not be generated. If the translation is defined in the bilingual dictionary, nothing is inserted.

User should be aware that this step logic is really simple: it does not check if the translation node should be tagged as alternative or directional. It only checks if the translation exists in the dictionary. User should be as cautious as when editing dictionaries directly. For helping the user, we get all the entries of the bilingual dictionary related to both words.

Optionally, we can check if the insertions are correct (7). User can check and manually modify the insertion queries in the appearing panel.

When all the queries are generated, we can insert (6) the words into the dictionaries. Nodes are appended to the first section of the dictionaries. The application alerts the user when the insertion is done.

We can clear the form (8) if we like, but we can work over the old data.

Finally, the user can export (10) the dictionaries. This will take a while. When this operation ends, the user will get download links for the updated dictionaries.

User can close his session any time (11). It will free the user identifier, and will delete all the dictionaries and configuration files uploaded. User will be warned about this, but now any user with the id can do this.

Now, you can select the lexical category of the words (12), and you will get some information about translation nodes containing the word you are trying to insert (13).

Configuration file definition

File should be named config.lang.xml

This is an old config.ca.xml (the current one can be downloaded from the tool's upload page). Its really simple, and only handles 7 paradigms.

 <?xml version="1.0" encoding="UTF-8"?> 
 <!DOCTYPE splitter SYSTEM "config.dtd">
 <splitter>
   <paradigms>
       <paradigm n = "abric__n" idForm="" desc="Nom, Masculí (Abric)" kind="noun">
           <flex form = "s"/>
       </paradigm>
       <paradigm n = "ahir__adv" idForm = "" desc = "Adverbi (Ahir)" kind="adverb"/>
       <paradigm n = "abell/a__n" idForm = "a" desc = "Nom, Femení (Abella)" kind="noun">
           <flex form = "es"/>
       </paradigm>
       <paradigm n = "abander/ar__lex" idForm = "ar" desc = "Verb regular, 1ª conjugació (Abanderar)" kind="verb">
           <flex form = "àssiu"/>
           <flex form = "àveu"/>
       </paradigm>
       <paradigm n = "abdominal__adj" idForm = "" desc = "Adjectiu, masculí/femení (Abdominal)" kind="adjective"> 
           <flex form = "s"/>
       </paradigm>
       <paradigm n = "acci/ó__n" idForm = "ó" desc = "Nom, Femení (Acció)" kind="noun">
           <flex form = "ons"/>
       </paradigm>
       <paradigm n = "acadèmi/c__adj" idForm = "c" desc = "Adjectiu, masculí (Acadèmic)" kind="adjective">
           <flex form = "cs"/>
           <flex form = "ca"/>
           <flex form = "ques"/>
       </paradigm>
   </paradigms>
 </splitter>

The <paradigm> node has the following attributes:

  • n : name of the paradigm in the apertium dix
  • idForm : representative form of the paradigm. Will be used for getting the paradigm list and the root of a word
  • desc : short, simple description showed in the paradigm list. Useful for pairs with obscure paradigm names (like LAT_35 in eu)
  • kind : lexical type of the paradigm.

The <flex> node represents a significative form of a paradigm. They should be as representative as possible, so the user can change the paradigm if he sees a bad flexion. Also, they should be as few as possible, because a long list will lead to user skipping this list. It has only 1 attribute:

  • form : the ending of the word


The application runs over php, and uses BaseX as XML database. It also uses a bit of XSLT.

TODO

  • Insert all kind of Apertium monolingual nodes.
  • Sorted paradigm list for each word.
  • Better awareness of conflicts in bilingual insertions.
  • metadix support

Tool works with metadix, but wont handle the special attributes of the <par> node. User can manually write the attributes, but this should have fields for it.


BUGS

Please, report all the bugs you find.