Documentation of Matxin 1.0
General architecture
The objectives are open-source, interoperability between various systems and being in tune with the development of Apertium and Freeling. In order to do this, for the analysis of Spanish, FreeLing is used (as it gives a deeper analysis than the analysis of Apertium) and in the transfer and generation, the transducers from Apertium are used.
The design is based on the classic transfer architecture of machine translation, with three basic components: analysis of Spanish, transfer from Spanish to Basque and generation of Basque. It is based on previous work of the IXA group in building a prototype Matxin and in the design of Apertium. Two modules are added on top of the basic architecture, de-formatting and re-formatting which have the aim of maintaining the format of texts for translation and allowing surf-and-translate.
According to the initial design, no semantic disambiguation is done, but within the lexica a number of named entities, collocations and other multiword terms are added which makes this less important.
As the design was object-oriented, three main objects were defined, sentence, chunk and node. The chunk can be thought of as a phrase but always as the output of the analyser, and the node to the word, however taking into account that multiwords.
There follows a short description of each stage.
De-formatter and re-formatter
Analyser
The dependency analyser has been developed by the UPC and has been added to the existing modules of FreeLing (tokeniser, morphological analysis, disambiguation and chunking).
The analyser is called Txala, and annotates the dependency relations between nodes within a chunk and between chunks in a sentence. This information is obtained in the output format (see section 3) in an indirect way however in place of specific attributes, it is expressed implicitly in the form of the hierarchy of the tag, (for example, a node structure within another means that the node inside is dependent on the node outside).
As well as adding this functionality, the output of the analyser has been adapted to the interchange format which is described in section 3.
Information from the analysis
The result of the analysis is made up of three elements or objects (as previously described):
- Nodes: These tag words or multiwords and have the following information: lexical form, lemma, part-of-speech, and morphological inflection information.
- Chunks: These give information of (pseudo) phrase, type, syntactic information and dependency between nodes.
- Sentence: Gives the type of sentence and the dependency between the chunks of itself.
Example
For the phrase,
- "porque habré tenido que comer patatas"
The output would be made up of the following chunks:
- subordinate_conjunction: porque[cs]
- verb_chain:
- haber[vaif1s]+tener[vmpp0sm]+que[s]+comer[vmn]
- noun_chain: patatas[ncfp]
Transfer
In the transfer stage, the same objects and interchange format is maintained. The transfer stages are as follows:
- Lexical transfer
- Structural transfer in the sentence
- Structural transfer within the chunk
Lexical transfer
Firstly, the lexical transfer is done using part of a bilingual dictionary provided by Elhuyar, which is compiled into a lexical transducer in the Apertium format.
Structural transfer within the sentence
Owing to the different syntactic structure of the phrases in each language, some information is transferred between chunks, and chunks can be created or removed.
In the previous example, during this stage, the information for person and number of the object (third person plural) and the type of subordination (causal) are introduced into the verb chain from the other chunks.
Structural transfer within the chunk
This is a complex process inside verb chains and easier in noun chains. A finite-state grammar (see section 4) has been developed for verb transfer.
To start out with, the design of the grammar was compiled by the Apertium dictionaries or by means of the free software FSA package, however this turned out to be untenable and the grammar was converted into a set of regular expressions that will be read and processed by a standard program that will also deal with the transfer of noun chains.
Example
For the previously mentioned phrase:
- "porque habré tenido que comer patatas"
The output from the analysis module was:
- subordinate_conjunction: porque[cs]
- verb_chain:
- haber[vaif1s]+tener[vmpp0sm]+que[s]+comer[vmn]
- noun_chain: patatas[ncfp]
and the output from the transfer will be:
- verb_chain:
- jan(main) [partPerf] / behar(per) [partPerf] / izan(dum) [partFut] / edun(aux) [indPres][subj1s][obj3p]+lako[causal]
- noun_chain: patata[noun]+[abs][pl]
Generation
This also keeps the same objects and formats. The stages as as follows:
- Syntactic generation
- Morphological generation
Syntactic generation
The main job of syntactic generation is to re-order the words in a chunk as well as the chunks in a phrase.
The order inside the chunk is effected through a small grammar which gives the element order inside Basque phrases and is expressed by a set of regular expressions.
The order of the chunks in the phrase is decided by a rule-based recursive process.
Morphological generation
Once the word order inside each chunk is decided, we proceed to the generation from the last word in the chunk with its own morphological information or that inherited from the transfer phase. This is owing to the fact that in Basque, normally the morphological inflectional information (case, number and other attributes) is assigned to the set of the phrase, adding it as a suffix to the end of the last word in the phrase. In the verbal chains as well as the last word, it is also necessary to perform additional morphological generation in other parts of the phrase.
This generation is performed using a morphological dictionary generated by IXA from the EDBL database which is compiled into a lexical transducer using the programs from Apertium and following their specifications and formats.
Example
For the previously given phrase, the stages are:
- "porque habré tenido que comer patatas"
The output from analysis is:
- subordinate_conjunction: porque[cs]
- verb_chain:
- haber[vaif1s]+tener[vmpp0sm]+que[s]+comer[vmn]
- noun_chain: patatas[ncfp]
The output from the transfer stage is:
- verb_chain:
- jan(main) [partPerf] / behar(per) [partPerf] / izan(dum) [partFut] / edun(aux) [indPres][subj1s][obj3p]+lako[causal]
- noun_chain: patata[noun]+[abs][pl]
After the generation phase, the final result will be:
- "patatak jan behar izango ditudalako"[1]
Although the details of the modules and the linguistic data is presented in section 4 it is necessary to underline that the design is modular, being organised in the basic modules of analysis, transfer and generation, and with clear separation of data and algorithms. And within the data, the dictionaries and the grammars are also clearly separated.
Intercommunication between modules
- ↑ Note: The actual translation produced by the free version of Matxin is "Jan behar izango dut patata"