Easy dictionary maintenance

From Apertium
Jump to: navigation, search

Contents

[edit] Introduction

This space will report developments in the project. It is also a space to post comments and suggestions.

Original Ideias
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Easy_dictionary_maintenance
Original GSOC2010 Application
http://wiki.apertium.org/wiki/User:Alessiojr/Easy_dictionary_-_Application-GSOC2010
Studant Information
Student: Alessio Miranda Junior
E-mail: alessio@inf.ufpr.br or alessio@alessiojr.com
Msn: msn@juninho.com.br
IRC: AlessioJr
GTalk: alessiojunin@gmail.com

[edit] Description

[edit] Abstract:

The idea is to develop a GUI tool to manage Apertium Monolingual and Bilingual XML files with the follow objectives
  • Create a alternative form to edit dix files with GUI resources.
  • Develop, initially, monolingual dictionaries but keeping the particular format of each file.
  • Minimize the direct manipulation of XML files, providing features that reduce this need.
  • Making use of DixTools to keep code reuse.

[edit] Why?

The number of language pairs in development for Apertium is increasing, and so is the complexity of these pairs. This increased complexity has made the job getting more complicated, thus the need for tools for the task is evident. The proposed want to make this management easier and probably will increase the probability of development for new language pairs. With better tools, more people will be able to develop language pairs.

[edit] How can use?

I believe that all Apertium society will have direct or indirect benefit. Directly, the developers of language-pairs will have their task facilitated. With a good tool to help with the work, to create or maintain a language will become easier, and probably it will take less time to get better results. Indirectly, the users will have benefits with this better and robust result.

[edit] What its the plan?

  • We're planing to create a GUI interface with features that facilitate common tasks of a user who wishes to manipulate a existing language pair or dictionary. These tasks will also be of great value to users, who have an intuitive tool to start new language pairs.
  • DixTools, tool developed for the apertium, currently already solves half problem with a main feature: load XML into memory and do the reverse returning the XML in a suitable format.
  • We believe that the main challenge of this task is to find a way to expand DixTools by adapting the existing classes as a persistence layer connected to a framework for GUI applications, supporting an integration of elements, providing tools to search, filter, integration and change.
  • The application is developed for monolingual dictionaries manipulation, but its architecture will have to provide support for future extensions (Web and Collaborative) and bilingual dictionary.

[edit] Development Report

[edit] What we're trying to build?

In short the idea is to build interfaces to facilitate manipulation of dictionaries Apertium. A parallel requirement is to build an extensible platform for other developers to build plugins and enhance the platform in the easy way and ordered.

[edit] What we are using?

[edit] Development Paradigm:

(thumbnail)
Model-View-Controller concept. The solid line represents a direct association, the dashed an indirect association via an observer (for example).
  • Model–View–Controller (MVC)
is a software architecture, currently considered an architectural pattern used in software engineering. The pattern isolates "domain logic" (the application logic for the user) from input and presentation (GUI), permitting independent development, testing and maintenance of each.
The model layer is used to manage information and notify observers when that information changes. The model is the domain-specific representation of the data upon which the application operates. Domain logic adds meaning to raw data (for example, calculating whether today is the user's birthday, or the totals, taxes, and shipping charges for shopping cart items). When a model changes its state, it notifies its associated views so they can be refreshed. Many applications use a persistent storage mechanism such as a database to store data, a model which knows how to persist itself.
The view layer renders the model into a form suitable for interaction, typically a user interface element. Multiple views can exist for a single model for different purposes. A viewport typically has a one to one correspondence with a display surface and knows how to render to it.
The controller layer receives input and initiates a response by making calls on model objects. A controller accepts input from the user and instructs the model and viewport to perform actions based on that input.
An MVC application may be a collection of model/view/controller triplets, each responsible for a different UI element.

[edit] Program Language:

  • Java

[edit] Persistence:

  • XML (Apertium XML Files)
  • Database: JavaDB or Postgres (Now Suported but disabled)

[edit] Framworks, APIs:

  • Dixtools
Is a package of java console tools to help in development of Apertium XML Files.
(thumbnail)
Basic Jpa classes structure.
  • JPA (Java Persistence API)
JPA simplifies the entity persistence model and adds new capabilities. Now developers can directly map the persistence object (POJO classes) with the relational database. The Java Persistence API has standardized the object-relational mapping technique. You can use JPA in your swing applications or web based applications.
  • JPA supports pluggable, third party persistence providers such as Hibernate and Toplink
  • JPA application can run outside the container also. So, developers can use JPA capabilities in desktop applications also
  • No need to write deployment descriptors. Annotations based meta-data are supported in JPA applications
  • Annotations defaults can be used in model class, which saves a lot of development time
  • Provides cleaner, easier, standardized object-relational mapping
  • JPA supports inheritance, polymorphism, and polymorphic queries.
  • JPA also supports named (static) and dynamic queries.
  • JEB QL is very powerfully query language provided by JPA
  • JPA helps you build a persistence layer that is vendor neutral and any persistence provider can be used
  • Netbeans Platform
Netbeans Platform Reference
The NetBeans Platform is a generic framework for commercial and open source desktop Swing applications. It provides the “plumbing” that you would otherwise need to write yourself, such as the code for managing windows, connecting actions to menu items, and updating applications at runtime. The NetBeans Platform provides all of these out of the box on top of a reliable, flexible, and well-tested modular architecture. In this refcard, you are introduced to the key concerns of the NetBeans Platform, so that you can save years of work when developing robust and extensible applications.
The key benefit of the NetBeans Platform:
  • OpenSource
  • Multplatform
  • Modular architecture.
  • Reliance on the Swing UI toolkit in combination with "Matisse" GUI Builder.
  • Designed with the idea that Software should be re-usable.
  • Generic Desktop framework
  • NetBeans platform provides the basic underpinning
  • NetBeans platform is a set of frameworks built into a single integrated software
    • Collection of libraries
    • Swing Extensions
    • NetBeans platform toolkit
  • Modules, modules and some more modules.
  • Modular architecture gives extensibility and helps to maintain the compatibility

[edit] How To?

[edit] Prototype 1 - Refactor and First Release

(thumbnail)
New DixTools Architecture
(thumbnail)
New Integrated Architecture
(thumbnail)
DixToolsSuite Components Architecture

[edit] Time Lime

Week Stage Description
1, 2 Analysis of technology in handling memory To investigate and select an effective way to view and manipulate the XML files of Apertium in memory using Java.
Analysis of the best technologie that complement the functionality of DixTools during manipulation of XML.
Maybe a database integration, trying to use VTD-XML or extend dixTools Classes.
Testing and choosing the best alternative.
2, 3 Development of first prototype Development of an interface that tries to use a core of features like Load, Save, list , search and Filter elements.
Prototype Milestone 1

[edit] Month Activities

  • Refactor Apertium-DixTools:
    • Separate Model Classes (Java Beans) from Control Classes into a new Jar Pack.
    • Integrate Java Beans With JPA Features
    • Write code to Import/Export Xml To DataBase
    • First Prototype
    • Test Features
    • Integrating with Plataform
    • First Crud prototype with Sdefs

[edit] Prototype 1

Its called DixToolsSuite and now is using an embedded version of a Java DataBase, no Database need to be installed. For now It may be slow to Import large dictionaries (I will fix Latter, It will be a better performace with real DataBase Systems, like Postgres). On my PC to import a dictionary of 2Mb it will take +- 3min.

[edit] How to Use
  • Operation:
    • Import at least one Dix File. (The first few times, testing with small dictionaries)
    • Open the Project Window. There's a combo which can be selected from a Dix. and click Open.
    • The fields are filled.
    • To delete all dictionaries click on Reset / (Delete all).
[edit] Download Link

Prototype is available in SVN. It is an installer with versions for Windows and Linux (V0.3). Also is available all the source project, developed with NetBeans 6.9. Prototype Link

[edit] Features
  • Import / Export Files Dix
  • Select an Imported Dix.
  • Show/Edit/Delete Symbols (Sdefs)
  • Show some statistics

[edit] TODO: Problems to Fix

  • (Hard) Implementation to Manager Mult Dics in DataBase (Left-Right-BiDic)
  • (Hard) Auto Fix If Import same DixFile
  • (Hard) Improve Internal Classes with performance on Saving a Big Dic Class
  • (Medium) Flexibility for chose, JavaDb, MySql and Postgress
  • (Medium) Improve Beans Classes with Lazy DB charge
  • (Medium) DicManager - Improve Dictionary Manager
  • (Medium) Use Test Framework for development
  • (Easy) Improve user menssage and Progress Dialogs
  • (Easy) Create Default Interface Configuration
  • (Easy) Fix Gui Bugs
    • When Open a Dix File AutoConfigure Interface

[edit] Prototype 2 - Implementing Real Funcionalities

[edit] Time Lime

Week Stage Description
5, 6 Simple Structures Implementation of Symbols, Alphabet and statistic features.
Need Drawings experiments to create interface to users.
7 Paradigms First implementation of features with paradigms.
Need Drawings experiments to create interface to users.
8 Lemmas First implementation of features with lemmas.
Need Drawings experiments to create interface to users.
Prototype Milestone 2 Version for testing with huge dictionaries and complete edition test with basic features.

[edit] Month Objectivies

  • Create Beta Interfaces
    • For Lemma Show/Edit/Remove
    • For Paradigm Show/Edit/Remove

[edit] Real Actions

  • Suport is already, done.
  • Now need do Create the Interface.
  • Improve: Filter, Short and Interact tools
  • Many dificults to undestand how to get the information to recreate words and paradigms.

[edit] TODO

  • Next Month make Crud Work.


[edit] Prototype 3 - Real Driver Test

[edit] Time Lime

Week Stage Description
9 Paradigms With feedback of the community, adjusting the interface and implementation, and probably adding new features.
10 Lemmas With feedback of the community, adjusting the interface and implementation, and probably adding new features.
11 Pré-Release Security time to improve integration functionalities
Prototype Release Candidate
12 Makeup Fix remain bugs, final adjustments and documentation in Wiki
Final Release

[edit] Real Actions

[edit] TODO

[edit] Requeriments

[edit] Install

  • Linux Installer (Version 0.7.4) can be found in:

https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/alessiojr/DicsToolsSuite/Installer/

[edit] To Compile

  • Netbeans 6.9
    • Open dicsElementBeans and Build
    • Open Project DicsToolsSuite and Build
    • Just Rin DicsToolsSuite

[edit] To Run

  • 700Mb of RAM to load 3 big dictionaries

[edit] Real Result

[edit] About the refactoring of Dixtools

  • First was a division of original Dixtools Project in two minnor projects - I have decided to separate to be able to reuse code between independent projects that use the basic classes and can provide planning between projects.
    • DicsElementsbeans Project - This project is responsible for containing model classes of XML elements and classes related to load into memory and persistence.
    • DixTools Project - This project has other features developed in dixtools, being dependent on the project dicsElementsBeans.
  • DicsElementsBeans received an update that took longer than I expected to stay stable. In addition to loading the objects in memory now it can save these objects in memory and retrieve them without loss of information using technology JPA.
    • To persist and load an object from the database should follow the same standards of JPA. In addition it was created a method on all objects in order to make the persistence of all objects recursively.
      dictionary.persisteAll(em);
    • Validation tests can be done by importing an XML data, saving, closing the applicatio and loading again, saving XML. The user can make a comparison with the command diff.
      See PersistenceTest.java on DicsElementsbeans <test Packages>
    • Each bean class has been refactored to take this kind of treatment. Data types have been small changes, taking care not to cause problems in functionality that already existed.
      • On reflection, several classes of dixtools received minor updates like:
        • Encapsulation of variables that were public
        • Exchange of variable types as "ArrayList" for "Lists" ArrayList because it could not be used in JPA.
    • Persistence Tests were made with Postgree and JavaDB. JPA is independent of the database, but some features of dixtools complicate this advantage. The Code is prepared to use JavaDB when using Postgree minor updates are needed. I found no way to automate these changes.
      • In dics.elements.beans.DixElements.java we have to change the Clobs fields by Text fields
        @Lob @Column(name = "processingComments", columnDefinition="CLOB" columnDefinition="Text")
  • This step has been completed, but after development and due to the short time there were problems with its use on application. The information created by an XML is huge, preventing use of a database for easy installation. Postgree showed a good performance but requires a prior installation.
    • DataBase Advanteges (Future)
      • In a collaborative web application dictionaries can be stored and multiple users can manipulate them simultaneously handled by rules as transactions.
      • Using features of the database you can create Views, consolidated tables that would facilitate much processing. Each View can be modeled for a type of features such as:
        • Identifying paradigms most appropriate list and cross words, changes in words. In memory it takes a considerable performance.
  • We made many performace tests and follow the results of experiments using real dictionary with JavaDB and Postgree, Advantages and Disadvantages.
    • JavaDB
      • Advantages
        • Easy Installation, can be sent along with the installer.
        • It does not cause impact to the novice user.
        • Recommended for quick installation and handling of small dictionaries.
      • Disadvantages
        • The performance was not satisfactory for average dictionaries.
    • Postgree
      • Advantages
        • Great performance for manipulating elements.
        • It had very good results even with large dictionaries.
        • I believe to be nominated for a web application, where the dictionary can be stored for manipulation.
      • Disadvantages
        • Installation complicated for novice users.
        • The only process that is slower than manipulation in memory is to save the whole dictionary in the database.

[edit] About the structure of the database

  • The ER structure of the tables follow the DixTools classes structure and represents all possible states of XML.
  • One other detail is that there is a Table DixElement which all elements inherit attributes. Each table, "E", "Dictionary"," Alphabet", etc.. has a reference to DixElements with generics attributes.
  • To handle the elements as simply way is not necessary to know the structure of the database, just use the classes of dicsElementsBeans. Not using SQL.

[edit] DicsToolsSuite Structure

  • The Project DicsToolsSuite was built on the Netbeans platform and is premised on the construction of independent modules (plugins) that communicate via the Platform. Following the description of each module.
  • RSyntaxTextArea
    • Encapsulates the libraries related to RSyntaxTextArea
  • SwingxLib
    • Encapsulates the libraries related to SwingX
  • JavaDB Client Library
    • Encapsulates the libraries related to JavaDB and Postgree
  • DicsElementsBeansLibrary -
    • Encapsulates the libraries related to JavaDB and Postgree
    • In this module are also some utility classes.
      • DixLogger.java - Concentra features Log
      • Installer.jar - Has the instructions executed in the first time in each run.
      • JavaDBSupport - Setup for Embedder JavaDB.
  • DixDbController -
    • Actions related to open and close the dictionary database. (Disabled)
  • Services - Classes responsible for overseeing the activities basic control of the dictionaries. The main classes are the interface DixServicesEventListener that lists events for the controller and DixController class that does the actual control of the dictionaries.
  • BilingualEditor - GUI module that are functionally described below.
  • DixEditor - GUI module that are functionally described below.
  • ProjectStatus - GUI module that are functionally described below.
  • SdefEditor - GUI module that are functionally described below.
  • SdefsViewer - GUI module that are functionally described below.
  • WordEditor - GUI module that are functionally described below.
  • WordList -

[edit] How it Works

[edit] In Actinos Videos

  • TODO: I will put videos demonstrating the operation and how things work and how to do basic activities.

[edit] Start Screen

(thumbnail)
Skeleton interface and their areas

[edit] 1 - Selected Dictionaries:

  • This area is where the user must load the dictionaries for handling
  • On this screen there is the Status of dictionaries which are handling
  • It is important that the user import Dictionary bilingual and monolingual in their respective spaces.
  • To load a dictionary Click on Load
  • After Manipulate Dictionary Click on Export

[edit] 2 - Dictionaries Status

  • The area that contains three Tabs. Each has the details of the corresponding dictionary loaded.

[edit] 3 - Components Area

  • Area that contains the modules that are available for manipulation of the dictionaries loaded.
  • Each module will be described in the following screenshots.

[edit] 4 - Comunication area : Display area of error messages and user information

[edit] 5 - Toolbar and MemoryStatus: Display the Memory Used by the application

[edit] 6 - Status Bar: area that will be displayed processing messages

[edit] 7 - Database Area: Disabled temporarily, all manipulation is occurring in memory.

[edit] Load/Export Panel

(thumbnail)
Load/Export Panel

[edit] Features

  • Show whether and which dictionary is loaded for manipulation and its respective slot.
  • The Load button loads the dictionary to memory.
  • The Export button saves the changes made in memory into a new text file.

[edit] Bugs

  • <No reports>

[edit] ToDo

  • Dixtools has a lot of options to save a dictionary dix file, regarding the formatting:
  • I need to know which are useful and how to describes them to the user:
    • STD_NONALIGNED_XML =
    • STD_ALIGNED_BIDIX =
    • STD_ALIGNED_MONODIX =
    • STD_ALIGNED = STD_ALIGNED_BIDIX;
    • STD_COMPACT=
    • STD_1_LINE =
    • STD_NOW_1_LINE =

[edit] Details/Statistics of dictionaries

(thumbnail)
Details/Statistics of dictionaries

[edit] Features

  • Three tabs are displayed
  • Each of these tabs describes characteristics of the dictionaries loaded.
  • Simple statistics are presented.

[edit] Bugs

  • <No Reports>

[edit] ToDo

  • Is there any suggestion of information to be displayed?
  • In future can be displayed some statistics on file in SVN.

[edit] DixViewer Component

(thumbnail)
DixViewer Component

[edit] Features

For this module were developed three types of visualization

  • Simple Text (Basic):
    • Shows the XML without highlighted mode, as a simple text editor.
  • Simple Syntax (Basic + Syntax)
    • Similar to Simple Text, but there is emphasis on XML forms.
  • Viewer with extra features (see figure beside):
    • Editor displays the lines with this number; (Some components shows the line number related with the word or item)
    • Search Method.
    • Each of these editors has three tabs to choose the dictionary related

[edit] Bugs

  • <No Reports>

[edit] ToDo

  • Was developed three types of XML display, probably the first two have not much use.
  • One idea would be to use the netbeans components for XML editing.
  • AutoComplete, and other more advanced editing is desired.
  • It's only a viewer, possibly an online link between XML and data in memory, with a display more effective.
  • You can sync the memory space of text more frequently, but several validations should be made.

[edit] Bilingual Editor

(thumbnail)
Bilingual Editor

[edit] Features

  • The main goal of Bilingual Explorer:
    • Be an interface for viewing the entries of bilingual translations
    • Provide a simplified form of editing addition of new entries.
  • When the bilingual dictionary is loaded, it automatically and displays the coresponding list of words in the dictionary.

The columns are:

  • Line Number - Displays the line number that this word is in XML.
  • Left lemma - Word associated to the left dictionary
  • Right lemma - Word associated to the right dictionary
  • Direction - If there is a restriction in Direction(must?)
  • Author - Author of that entry

[edit] Bugs

[edit] ToDo

  • I believe that show some information about tags is import the bilingual dictionary.
    • Thinking following Jim's ideas, we could show some key tags, in the form of comment. If exists < s="n" > we can show "Noun".
    • View these details would be optional in the interface.
    • In the figure, where it is written: "íman" would be written "íman<sn="n"/>" or "Íman<Noun>".
      • "Noun" was in the commentary associated with sdef "n"
  • Editing the Bilingual Entry
    • Left Lemma and right Lemma can be obtained through the autocomplete feature. Searching the Lemma on dictionaries Left and Right, this could keep them cohesive. We can serach only Lemmas? or Surface forms? You may need to stop adding to the list of Sdefs

[edit] WordList Viewer Editor

(thumbnail)
WordList Viewer Editor

[edit] Features

  • Its purpose is to allow the user to search for all forms identified by a dictionary.
  • In the current implementation displays the following columns.
    • Surface Form -
    • Lemma -
    • Paradigm -

[edit] Bugs

[edit] ToDo

  • I wish to create an informative column showing the tags generated by identifying the Surface Form, even if not understood by some. I think I do a swap of tags for their comment would be an interesting idea.
  • Methods to insert or edit a word must be achieved via other interfaces.

[edit] Symbols Viewer/Editor

(thumbnail)
Symbols Viewer/Editor

[edit] Features

  • Show/Edit/Add Symbols Definitions

[edit] Bugs

[edit] ToDo

  • Change the Interfaces



[edit] Add / Edit Monolingual Words

(thumbnail)
Add / Edit Monolingual Words

[edit] A Lot To Do

[edit] Features

  • Allow user to insert new words in a dictionary Monolingual
  • Scenario 1:
    • Step 1: User research and list of WordList detarminada not find a word.
    • Step 2: It looks for any word that is inflected similarly in WordList.
    • Step 3: The interface for adding words he enters the root of the word and selects a paradigm that best fits.
    • Step 4: Check the grid if the word fits the paradigm and click add new word.

[edit] Bugs

  • ReDesign the Interface
  • Show more details of the Generated Surfaces Forms.

[edit] ToDo

  • Create types of templates to insert new words in bilingual and monolingual dictionaries
  • Create new templates / scenarios by analyzing the best specifications of Jim.

I will detail these scenarios later.

  • Scenario 2:
    • Step 1: First, we ask them what kind of word: noun (ball, cat), adjective (big, small), verb (eat, sleep), adverb (quickly, yesterday) [Other categories exist, but *should* have been taken care of by the linguist - they can be ignored]
    • Step 2: Next, once the user has chosen the type of word, we let them enter the words: Source: |________| Target: |________|
    • Step 3:

[edit] Conclusion and Plans

[edit] Results (My Opinion)

  • The final product delivered was not appropriate in all the project's initial expectations, including my expectations.
  • Analyzing the software delivered, the functions are the basic of a software proposed. I believe it is possible from this point complete with the desired features, but I could not coordinate my time to do it.
  • I did several tests to reduce bug on features delivered.
  • The new features detailed in e-mail was started but was not stable to consider it delivered.

[edit] Causes of Problems

  • I lost a lot of time to support the database, causing problems for the delivery on time. Including the feature was removed, but believe it will be very useful in the near future.
  • We also lost a lot of time creating the basis for getting use Netbeans platform, but this effort was valid. It gives stability and control to create new features. It also promotes a range of facilities and APIs to create the best designers.
  • I had a lack of ability to deal with conflicting opinions.
  • I often did work that should not have done, causing rework and loss of time.
  • The means of communication by e-mail was not effective, I believe I did not ask the right questions, and did not understand the answer, causing serious problems of communication.
  • Personal problems that were complicated for me but do not justify the mistake.

[edit] Post Job

  • The project was not completely finished and had failures, but am willing to continue building and evolving ideas.
  • I believe that is not disposable, it just was not finished.
  • Features that I'll finish:
    • Interfaces proposal by Jim to add words directly from the Bilingual dictionary and check the corespondences in Monolingual dictionaries.
    • Simple interface to add paradigms.
    • Improving interfaces to browse the paradigms more suited to a root.

[edit] What's the GSOC for me

I agree that lacking performance, but I really struggled and wanted to have the best possible performance. I Really spent much time on the project, I had no vacation and I was always involved, studying, reading and understanding code APIs and techniques. Sometimes inspiration for developing and having good ideas is more important than anything. I think I was imperfect, often got stuck trying to resolve things I could not but did not share.
Despite the problems It was one of the most interesting experiences I Ever Had, Trying to study alone structures that did not know so well. It was the first experience in working directly Contributing to Developing free software and one of the greatest difficulties was to develop a software alone. I've always been accustomed to working in development teams, this meant that the work does not stop. Even in the academic research discutions always put me forward and do not reflect this in the project.
In the end, I have the awareness that I worked a lot, but it should have worked harder. The only flaw that I can share with the mentors was that:
I misunderstood the first conversations, I tried to mainly focus on conversations with Mikel, but it was hard to find it online and while it does not find him, I did not talk with others trying to solve everything himself. The problem was not Mikel, but I tried to solve by yourself. I only understand in the end, human things.

Personal tools