Difference between revisions of "User:Dtr5"

From Apertium
Jump to navigation Jump to search
Line 44: Line 44:




==== Configuration file definition ====
===== Configuration file definition =====


File should be named config.lang.xml
File should be named config.lang.xml

Revision as of 08:38, 15 March 2012

Unnamed simple dictionary insert

http://apertium.vm.bytemark.co.uk/simpledix

I am programming a little tool for inserting words into Apertium dictionaries as a university project. The tool is web-based. It aims for a standard and simple way of managing dictionaries.

By now, it can only manage the simplest of the apertium entries:

 <e>
    <par/>
 <e/>

It works like pastebin. When a user first connects the web, he is asked to upload an Apertium pair, and, if he has them, the configuration files. He gets a random identifier, that can be used for returning to this session.

Upload.png

We should upload both monolingual and the bilingual files (3), choosing the apertium pair they belong to (2). The application provides with configuration files for some pairs, but you can upload your own configuration file. If you want to use a pair that is unavailable, you can uncheck (1), and manually type the pair. You should provide your own configuration for unavailable languages.

There are only 3 pairs installed: es-ca, eu-es and eu-ca, (although eu is badly configured). If you want to use a pair that has a common language, like es-pt, only the pt configuration file will be mandatory.


Now he can insert words into the dictionaries.

Insert.png

First, we can choose what translation directions should be generated (1). Usually, we want to generate only the ones, but we may generate any combination of , , and .

Words should be written in its representative form, which is defined in the configuration file (2). Then, the user gets a list of possible paradigms (3) for the current word. After choosing the paradigm we get some flexed forms (4) of the word (again, the ones in the configuration file).

After writing both words, the user can generate (5) the XML nodes that will be inserted into the dictionary. If one (or both) words are already defined in the monolingual dictionary, they will not be generated. If the translation is defined in the bilingual dictionary, nothing is inserted.

User should be aware that this step logic is really simple: it does not check if the translation node should be tagged as alternative or directional. It only checks if the translation exists in the dictionary. User should be as cautious as when editing dictionaries directly. For helping the user, we get all the entries of the bilingual dictionary related to both words.

Optionally, we can check if the insertions are correct (7). User can check and manually modify the insertion queries.

When all the queries are generated, we can insert (6) the words into the dictionaries. Nodes are appended to the first section of the dictionaries. The application alerts the user when the insertion is done.

We can clear the form (9) if we like.

Finally, the user can export (10) the dictionaries. This will take a while. When this operation ends, the user will get download links for the updated dictionaries.

User can close his session any time (11). It will free the user identifier, and will delete all the dictionaries and configuration files uploaded. User will be warned.


Configuration file definition

File should be named config.lang.xml

This is the current config.ca.xml. Its really simple, and only handles 7 paradigms.

 <?xml version="1.0" encoding="UTF-8"?> 
 <!DOCTYPE splitter SYSTEM "config.dtd">
 <splitter>
   <paradigms>
       <paradigm n = "abric__n" idForm="" desc="Nom, Masculí (Abric)">
           <flex form = "s"/>
       </paradigm>
       <paradigm n = "ahir__adv" idForm = "" desc = "Adverbi (Ahir)"/>
       <paradigm n = "abell/a__n" idForm = "a" desc = "Nom, Femení (Abella)">
           <flex form = "es"/>
       </paradigm>
       <paradigm n = "abander/ar__lex" idForm = "ar" desc = "Verb regular, 1ª conjugació (Abanderar)">
           <flex form = "àssiu"/>
           <flex form = "àveu"/>
       </paradigm>
       <paradigm n = "abdominal__adj" idForm = "" desc = "Adjectiu, masculí/femení (Abdominal)">
           <flex form = "s"/>
       </paradigm>
       <paradigm n = "acci/ó__n" idForm = "ó" desc = "Nom, Femení (Acció)">
           <flex form = "ons"/>
       </paradigm>
       <paradigm n = "acadèmi/c__adj" idForm = "c" desc = "Adjectiu, masculí (Acadèmic)">
           <flex form = "cs"/>
           <flex form = "ca"/>
           <flex form = "ques"/>
       </paradigm>
   </paradigms>
 </splitter>

The <paradigm> node has the following attributes:

  • n : name of the paradigm in the apertium dix
  • idForm : representative form of the paradigm. Will be used for getting the paradigm list and the root of a word
  • desc : short, simple description showed in the paradigm list. Useful for pairs with obscure paradigm names (like LAT_35 of eu)

The <flex> node represents a significative form of a paradigm. They should be as representative as possible, so the user can change the paradigm if he sees a bad flexion. Also, they should be as few as possible, because a long list will lead to user skipping this list. It has only 1 attribute:

  • form : the ending of the word


The application runs over php, and uses BaseX as XML database. It also uses a bit of XSLT.


TODO
  • Insert all kind of Apertium monolingual nodes.
  • Sorted paradigm list for each word.
  • Better awareness of conflicts in bilingual insertions.
  • metadix support

Tool works with metadix, but wont handle the special attributes of the <par> node. User can manually write the attributes, but tool should have a field for it.


BUGS

Please, report all the bugs you find.