Difference between revisions of "User:Dtr5"
m (Clarifications.) |
|||
Line 11: | Line 11: | ||
<e/> |
<e/> |
||
It works like pastebin. When a user first connects the web, he is asked to upload an Apertium pair, and, if he has them, the configuration files. |
It works like pastebin. When a user first connects the web, he is asked to upload an Apertium pair, and, if he has them, the configuration files. User gets a random identifier, that can be used for returning to this session, if not closed. |
||
Nowadays, any user with the id can insert words, download your dictionaries and configuration files or close the session (no security policy is implemented), so keep sessions short. I will add some way to block closing sessions and inserting words. |
|||
===== Initialization ===== |
|||
[[File:Upload.png]] |
[[File:Upload.png]] |
||
We should upload both monolingual and the bilingual files (3), choosing the apertium pair they belong to (2). The application provides with configuration files for some pairs, but you can upload your own configuration file. If you want to use a pair that is unavailable, you can uncheck (1), and manually type the pair. You |
We should upload both monolingual and the bilingual files (3), choosing the apertium pair they belong to (2). The application provides with configuration files for some pairs, but you can upload your own configuration file. If you want to use a pair that is unavailable, you can uncheck (1), and manually type the pair. You must provide your own configuration for unavailable languages. |
||
There are only 3 |
There are only 3 languages installed: es, eu and ca (although eu is badly configured). If you want to use a pair that has a common language, like es-pt, only the pt configuration file will be required. |
||
⚫ | |||
===== Word insertion ===== |
|||
⚫ | |||
[[File:Insert.png]] |
[[File:Insert.png]] |
||
First, we can choose what translation directions should be generated (1). Usually, we want to generate only the '''↔''' ones, but we may generate any combination of '''↔''' , '''→''' , and '''←'''. |
First, we can choose what translation directions should be generated (1). Usually, we want to generate only the '''↔''' ones, but we may generate any combination of '''↔''' , '''→''' , and '''←'''. |
||
Words should be written in its representative form, which is defined in the configuration file (2). Then, |
Words should be written in its representative form, which is defined in the configuration file (2). Then, we get a list of possible paradigms (3) for the current word. After choosing the paradigm we get some flexed forms (4) of the word (again, the ones in the configuration file). |
||
After writing both words, the user can generate (5) the XML nodes that will be inserted into the dictionary. If one (or both) words are already defined in the monolingual dictionary, they will not be generated. If the translation is defined in the bilingual dictionary, nothing is inserted. |
After writing both words, the user can generate (5) the XML nodes that will be inserted into the dictionary. If one (or both) words are already defined in the monolingual dictionary, they will not be generated. If the translation is defined in the bilingual dictionary, nothing is inserted. |
||
Line 36: | Line 41: | ||
When all the queries are generated, we can insert (6) the words into the dictionaries. Nodes are appended to the first section of the dictionaries. The application alerts the user when the insertion is done. |
When all the queries are generated, we can insert (6) the words into the dictionaries. Nodes are appended to the first section of the dictionaries. The application alerts the user when the insertion is done. |
||
We can clear the form (9) if we like. |
We can clear the form (9) if we like, but we can work over the old data. |
||
Finally, the user can export (10) the dictionaries. This will take a while. When this operation ends, the user will get download links for the updated dictionaries. |
Finally, the user can export (10) the dictionaries. This will take a while. When this operation ends, the user will get download links for the updated dictionaries. |
||
User can close his session any time (11). It will free the user identifier, and will delete all the dictionaries and configuration files uploaded. User will be warned. |
User can close his session any time (11). It will free the user identifier, and will delete all the dictionaries and configuration files uploaded. User will be warned about this, but now any user with the id can do this. |
||
==== Configuration file definition ==== |
|||
File should be named config.lang.xml |
File should be named config.lang.xml |
||
Line 92: | Line 97: | ||
==== TODO ==== |
|||
* Insert all kind of Apertium monolingual nodes. |
* Insert all kind of Apertium monolingual nodes. |
||
Line 98: | Line 103: | ||
* Better awareness of conflicts in bilingual insertions. |
* Better awareness of conflicts in bilingual insertions. |
||
* metadix support |
* metadix support |
||
Tool works with metadix, but wont handle the special attributes of the <par> node. User can manually write the attributes, but |
Tool works with metadix, but wont handle the special attributes of the <par> node. User can manually write the attributes, but this should have fields for it. |
||
==== BUGS ==== |
|||
Please, report all the bugs you find. |
Please, report all the bugs you find. |
Revision as of 06:53, 17 April 2012
Contents
Unnamed simple dictionary insert
http://apertium.vm.bytemark.co.uk/simpledix
I am programming a little tool for inserting words into Apertium dictionaries as a university project. The tool is web-based. It aims for a standard and simple way of managing dictionaries.
By now, it can only manage the simplest of the apertium entries:
<e> <par/> <e/>
It works like pastebin. When a user first connects the web, he is asked to upload an Apertium pair, and, if he has them, the configuration files. User gets a random identifier, that can be used for returning to this session, if not closed.
Nowadays, any user with the id can insert words, download your dictionaries and configuration files or close the session (no security policy is implemented), so keep sessions short. I will add some way to block closing sessions and inserting words.
Initialization
We should upload both monolingual and the bilingual files (3), choosing the apertium pair they belong to (2). The application provides with configuration files for some pairs, but you can upload your own configuration file. If you want to use a pair that is unavailable, you can uncheck (1), and manually type the pair. You must provide your own configuration for unavailable languages.
There are only 3 languages installed: es, eu and ca (although eu is badly configured). If you want to use a pair that has a common language, like es-pt, only the pt configuration file will be required.
Now we can insert words into the dictionaries.
Word insertion
First, we can choose what translation directions should be generated (1). Usually, we want to generate only the ↔ ones, but we may generate any combination of ↔ , → , and ←.
Words should be written in its representative form, which is defined in the configuration file (2). Then, we get a list of possible paradigms (3) for the current word. After choosing the paradigm we get some flexed forms (4) of the word (again, the ones in the configuration file).
After writing both words, the user can generate (5) the XML nodes that will be inserted into the dictionary. If one (or both) words are already defined in the monolingual dictionary, they will not be generated. If the translation is defined in the bilingual dictionary, nothing is inserted.
User should be aware that this step logic is really simple: it does not check if the translation node should be tagged as alternative or directional. It only checks if the translation exists in the dictionary. User should be as cautious as when editing dictionaries directly. For helping the user, we get all the entries of the bilingual dictionary related to both words.
Optionally, we can check if the insertions are correct (7). User can check and manually modify the insertion queries (8).
When all the queries are generated, we can insert (6) the words into the dictionaries. Nodes are appended to the first section of the dictionaries. The application alerts the user when the insertion is done.
We can clear the form (9) if we like, but we can work over the old data.
Finally, the user can export (10) the dictionaries. This will take a while. When this operation ends, the user will get download links for the updated dictionaries.
User can close his session any time (11). It will free the user identifier, and will delete all the dictionaries and configuration files uploaded. User will be warned about this, but now any user with the id can do this.
Configuration file definition
File should be named config.lang.xml
This is the current config.ca.xml. Its really simple, and only handles 7 paradigms.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE splitter SYSTEM "config.dtd"> <splitter> <paradigms> <paradigm n = "abric__n" idForm="" desc="Nom, Masculí (Abric)"> <flex form = "s"/> </paradigm> <paradigm n = "ahir__adv" idForm = "" desc = "Adverbi (Ahir)"/> <paradigm n = "abell/a__n" idForm = "a" desc = "Nom, Femení (Abella)"> <flex form = "es"/> </paradigm> <paradigm n = "abander/ar__lex" idForm = "ar" desc = "Verb regular, 1ª conjugació (Abanderar)"> <flex form = "àssiu"/> <flex form = "àveu"/> </paradigm> <paradigm n = "abdominal__adj" idForm = "" desc = "Adjectiu, masculí/femení (Abdominal)"> <flex form = "s"/> </paradigm> <paradigm n = "acci/ó__n" idForm = "ó" desc = "Nom, Femení (Acció)"> <flex form = "ons"/> </paradigm> <paradigm n = "acadèmi/c__adj" idForm = "c" desc = "Adjectiu, masculí (Acadèmic)"> <flex form = "cs"/> <flex form = "ca"/> <flex form = "ques"/> </paradigm> </paradigms> </splitter>
The <paradigm> node has the following attributes:
- n : name of the paradigm in the apertium dix
- idForm : representative form of the paradigm. Will be used for getting the paradigm list and the root of a word
- desc : short, simple description showed in the paradigm list. Useful for pairs with obscure paradigm names (like LAT_35 of eu)
The <flex> node represents a significative form of a paradigm. They should be as representative as possible, so the user can change the paradigm if he sees a bad flexion. Also, they should be as few as possible, because a long list will lead to user skipping this list. It has only 1 attribute:
- form : the ending of the word
The application runs over php, and uses BaseX as XML database. It also uses a bit of XSLT.
TODO
- Insert all kind of Apertium monolingual nodes.
- Sorted paradigm list for each word.
- Better awareness of conflicts in bilingual insertions.
- metadix support
Tool works with metadix, but wont handle the special attributes of the <par> node. User can manually write the attributes, but this should have fields for it.
BUGS
Please, report all the bugs you find.