Difference between revisions of "Matxin 1.0 New Language Pair HOWTO"
m (Francis Tyers moved page Matxin New Language Pair HOWTO to Matxin 1.0 New Language Pair HOWTO) |
|||
(85 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
[[Création d'une nouvelle paire avec Matxin|En français]] |
|||
{{TOCD}} |
{{TOCD}} |
||
This page intends to give a step-by-step walk-through of how to create a new translator in the [[Matxin]] platform. |
This page intends to give a step-by-step walk-through of how to create a new translator in the [[Matxin]] platform. |
||
Line 7: | Line 9: | ||
This page does not give instructions on installing [[Matxin]], but presumes that the following packages are correctly installed. |
This page does not give instructions on installing [[Matxin]], but presumes that the following packages are correctly installed. |
||
* [[lttoolbox]] (from SVN) |
* [[lttoolbox]] (from SVN) |
||
* [[Freeling]] (from SVN) |
|||
* [[Matxin]] (from SVN) |
* [[Matxin]] (from SVN) |
||
* a text editor |
* a text editor |
||
* The <code>fl-*</code> tools for Freeling (for the moment these can be found in <code>apertium-tools/freeling</code> in [[SVN|apertium SVN]]) |
|||
And either: |
|||
* [[Freeling]] (from SVN) and the <code>fl-*</code> tools for Freeling (for the moment these can be found in <code>apertium-tools/freeling</code> in [[Using SVN|apertium SVN]]), if you're planning on using FreeLing for the analysis/tagging/dependency steps |
|||
or |
|||
* [[Apertium and Constraint Grammar|vislcg3]] (from SVN), if you're planning on using lttoolbox/apertium-tagger and [[Constraint Grammar]] for the analysis/tagging/dependency steps |
|||
==Overview== |
==Overview== |
||
Line 17: | Line 22: | ||
As mentioned in the lead, this page intends to give a step-by-step guide to creating a new language pair with [[Matxin]] from scratch. No programming knowledge is required, all that needs to be defined are some dictionaries and grammars. The Matxin platform is described in detail in [[Documentation of Matxin]] and on the [http://matxin.sourceforge.net Matxin homepage]. This page will only focus on the creation of a new language pair, and will avoid theoretical and methodological issues. |
As mentioned in the lead, this page intends to give a step-by-step guide to creating a new language pair with [[Matxin]] from scratch. No programming knowledge is required, all that needs to be defined are some dictionaries and grammars. The Matxin platform is described in detail in [[Documentation of Matxin]] and on the [http://matxin.sourceforge.net Matxin homepage]. This page will only focus on the creation of a new language pair, and will avoid theoretical and methodological issues. |
||
The language pair for the tutorial will be Breton to English. This has been chosen as the two languages have fairy divergent word order (Breton is fairly free, allowing VSO, OVS and SVO, where English is fairly uniformly SVO) which can show some of the advantage which Matxin has over Apertium. |
The language pair for the tutorial will be Breton to English. This has been chosen as the two languages have fairy divergent word order (Breton is fairly free, allowing VSO, OVS and SVO, where English is fairly uniformly SVO) which can show some of the advantage which Matxin has over Apertium. The sentence that we're going to use for this purpose is, |
||
:''Ur yezh indezeuropek eo ar brezhoneg.'' |
|||
:[<sub>{{sc|compl}}</sub> A language indo-european] [<sub>{{sc|verb}}</sub> is] [<sub>{{sc|subj}}</sub> the breton]. |
|||
:Breton is an Indo-European language. |
|||
There are two main issues with transfer in this case, the first is re-ordering, from <code>OVS → SVO</code>, the second is removing the definite article preceding a language name <code>ar brezhoneg → Breton</code>. |
|||
==Getting started== |
==Getting started== |
||
This HOWTO expects the user to have a basic familiarity with the Apertium platform, if you don't, feel free to have a look over the pages: [[lttoolbox]], [[bilingual dictionary]] and [[Apertium New Language Pair HOWTO]]. |
|||
===Analysis=== |
|||
For the analysis/disambiguation/dependency steps, there are two main options open: |
|||
# use the Apertium tool [[lttoolbox]] for morphological disambiguation, [[Constraint Grammar]] for rule-based disambiguation and syntactic tagging, [[apertium-tagger]] for statistical disambiguation and finally Constraint Grammar for dependency analysis |
|||
# use [[FreeLing]] for all these steps |
|||
Typically the choice will rest on what resources you have available, and the strengths and weaknesses of each of these tools. |
|||
==Analysis with Apertium and Constraint Grammar== |
|||
(to be written...) |
|||
==Analysis with FreeLing== |
|||
The analysis process in Matxin is done by [[Freeling]], an free / open-source suite of language analysers. The analysis is done in four stages, requiring four (or more) sets of separate files. The first is the morphological dictionary, which is basically a full-form list (e.g. [[Speling format]]) compiled into a BerkeleyDB format. There are then files for word-category disambiguation and for specifying chunking and dependency rules. There are two more stages that come before morphological analysis, tokenisation and sentence splitting, but for the purposes of this tutorial they will be considered along with morphological analysis. |
The analysis process in Matxin is done by [[Freeling]], an free / open-source suite of language analysers. The analysis is done in four stages, requiring four (or more) sets of separate files. The first is the morphological dictionary, which is basically a full-form list (e.g. [[Speling format]]) compiled into a BerkeleyDB format. There are then files for word-category disambiguation and for specifying chunking and dependency rules. There are two more stages that come before morphological analysis, tokenisation and sentence splitting, but for the purposes of this tutorial they will be considered along with morphological analysis. |
||
Normally a single program is used to do all the different stages of analysis, taking as input plain or deformatted text, and outputting a dependency analysis, the behaviour of this program is controlled by a file called <code>config.cfg</code>. In Matxin this program is called <code>Analyzer</code>, however in the following stages, we'll be using separate tools and will leave creating the config file until the last minute as it can get |
Normally a single program is used to do all the different stages of analysis, taking as input plain or deformatted text, and outputting a dependency analysis, the behaviour of this program is controlled by a file called <code>config.cfg</code>. In Matxin this program is called <code>Analyzer</code>, however in the following stages, we'll be using separate tools and will leave creating the config file until the last minute as it can get complicated. |
||
As companion reading to this section the [http://garraf.epsevg.upc.es/freeling/index.php?option=com_content&task=view&id=18&Itemid=47 Freeling documentation] is highly recommended. This tutorial skips over features of Freeling which are not necessary for making a basic MT system with Matxin. |
As companion reading to this section the [http://garraf.epsevg.upc.es/freeling/index.php?option=com_content&task=view&id=18&Itemid=47 Freeling documentation] is highly recommended. This tutorial skips over features of Freeling which are not necessary for making a basic MT system with Matxin, and note that it is also possible to use other analysers as input into the chunking / dependency parsing parts of Freeling, for more information see [[Freeling|here]]. |
||
===Morphological=== |
|||
In order to create your morphological analyser in Freeling you basically need to make a full-form list. If there is already an Apertium dictionary for the language, you can use the scripts in [[SVN|apertium SVN]] (module <code>apertium-tools/freeling</code>) to generate a dictionary from scratch, if not, then either build it from scratch, or build a dictionary in [[lttoolbox]] and then generate the list. |
In order to create your morphological analyser in Freeling you basically need to make a full-form list. If there is already an Apertium dictionary for the language, you can use the scripts in [[Using SVN|apertium SVN]] (module <code>apertium-tools/freeling</code>) to generate a dictionary from scratch, if not, then either build it from scratch, or build a dictionary in [[lttoolbox]] and then generate the list. |
||
For the purposes of this exercise, you can just key in a small dictionary manually. We'll call the dictionary <code>matxin-br-en.br.dicc</code>, and it will contain |
For the purposes of this exercise, you can just key in a small dictionary manually. We'll call the dictionary <code>matxin-br-en.br.dicc</code>, and it will contain |
||
Line 68: | Line 91: | ||
</pre> |
</pre> |
||
And now for the word tokeniser, which we'll put in <code>matxin-br-en.tok.dat</code> |
Of course, other end of sentence punctuation such as '?' and '!' could also be put in there. And now for the word tokeniser, which we'll put in <code>matxin-br-en.tok.dat</code> |
||
<pre> |
<pre> |
||
Line 95: | Line 118: | ||
</pre> |
</pre> |
||
The <code>-1</code> after each analysis is the ''a priori'' probability of the analysis and is calculated from a previously tagged corpus. As we don't have a previously tagged corpus, this is unset. |
|||
====Category disambiguation==== |
|||
===Category disambiguation=== |
|||
After we have working morphological analysis, the next stage is to create a part-of-speech tagger. Freeling offers various ways to do this, both HMM-based and Relax Constraint Grammar (RelaxCG) based are supported. We're going to demonstrate how to create a RelaxCG tagger as it is easier and does not require tagger training. |
After we have working morphological analysis, the next stage is to create a part-of-speech tagger. Freeling offers various ways to do this, both HMM-based and Relax Constraint Grammar (RelaxCG) based are supported. We're going to demonstrate how to create a RelaxCG tagger as it is easier and does not require tagger training. |
||
Line 110: | Line 135: | ||
</pre> |
</pre> |
||
The file (which we will call <code>matxin-br-en.br |
The file (which we will call <code>matxin-br-en.br.relax</code> is made up of two sections, the first <code>SETS</code> defines any sets of tags or lemmas, much like the <code>LIST</code> and <code>SET</code> in VISL [[Constraint Grammar]] taggers. The second section defines a series of weighted constraints, in the format of 'weight', followed by space, followed by the tag followed by another space and then the context. The context is defined as a series of positions relative to the tag in question. |
||
So, using this file we should be able to get disambiguated output: |
So, using this file we should be able to get disambiguated output: |
||
Line 116: | Line 141: | ||
<pre> |
<pre> |
||
$ echo "Ur yezh eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ |
$ echo "Ur yezh eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ |
||
fl-tagger matxin-br-en.br |
fl-tagger matxin-br-en.br.relax | iconv -f latin1 |
||
Ur un DI0CN0 -1 |
Ur un DI0CN0 -1 |
||
yezh yezh NCFSV0 -1 |
yezh yezh NCFSV0 -1 |
||
Line 125: | Line 150: | ||
</pre> |
</pre> |
||
===Chunking=== |
|||
So, after tagging the next stage is chunk parsing. This is somewhat like the chunking available in Apertium (see [[Chunking]]), however no transfer takes place, it just groups words into chunks. The grammar is quite familiar, the left side shows the non-terminal, and the right side can be either terminal (in the case of a tag, e.g. <code>NCM*</code>) or non-terminal (in the case of <code>n-m</code>). This extremely simple grammar will chunk the tagged input into the constituents (noun phrases <code>sn</code> and <code>verb-eo</code>) for later use by the dependency parser. It should be fairly straight forward, <code>|</code> is an or statement, and <code>+</code> marks the governor, or head of the chunk. The end of each rule is marked with a full stop <code>.</code> and comments are placed with <code>%</code>. |
|||
<pre> |
|||
n-m ==> NCM* . |
|||
n-f ==> NCF* . |
|||
adj ==> AQ* . |
|||
def ==> DA0CN0 . |
|||
indef ==> DI0CN0 . |
|||
verb-eo ==> VMIP3S0(eo). %% A specific chunk type for the 'eo' form of bezañ |
|||
verb ==> VL* . |
|||
punt ==> Fp . |
|||
sn ==> def, +n-f, adj | def, +n-f | +n-f, adj | +n-f . |
|||
sn ==> def, +n-m, adj | def, +n-m | +n-m, adj | +n-m . |
|||
sn ==> indef, +n-m, adj | indef, +n-m | +n-m, adj | +n-m . |
|||
sn ==> indef, +n-f, adj | indef, +n-f | +n-f, adj | +n-f . |
|||
@START S. |
|||
</pre> |
|||
The <code>@START</code> directive states that the start node of the sentence should be labelled <code>S</code>. So, the output of this grammar will be, |
|||
<pre> |
|||
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ |
|||
fl-tagger matxin-br-en.br.relax | fl-chunker matxin-br-en.br.gram | iconv -f latin1 |
|||
S_[ |
|||
sn_[ |
|||
indef_[ |
|||
+(Ur un DI0CN0) |
|||
] |
|||
+n-f_[ |
|||
+(yezh yezh NCFSV0) |
|||
] |
|||
adj_[ |
|||
+(indezeuropek indezeuropek AQ0CN0) |
|||
] |
|||
] |
|||
verb-eo_[ |
|||
+(eo bezañ VMIP3S0) |
|||
] |
|||
sn_[ |
|||
def_[ |
|||
+(ar an DA0CN0) |
|||
] |
|||
+n-m_[ |
|||
+(brezhoneg brezhoneg NCMSV0) |
|||
] |
|||
] |
|||
punt_[ |
|||
+(. . Fp) |
|||
] |
|||
] |
|||
</pre> |
|||
Note the sentence is chunked into <code>sn verb-eo sn</code>. It might be worth playing around a bit with the grammar to get a better feel for it. |
|||
===Dependency parsing=== |
|||
The next stage is to create a dependency grammar. The dependency grammar describes and labels dependencies between constituents. It is made up of two main sections, <code><GRPAR></code> which fixes up the parse provided by the chunker. In the example, the verb is moved to the top of the sentence, above the complement, and <code><GRLAB></code> which labels parts of the parse. |
|||
Note that in Breton, sentences with ''eo'' (a form of ''bezañ'' 'to be') always have the structure Object—Verb—Subject, so require special attention. We thus label the left side as the predicate complement and the right side as the subject. |
|||
<pre> |
|||
<GRPAR> |
|||
1 - - (sn,verb-eo) top_right RELABEL - % (Ur yezh keltiek (eo)) |
|||
</GRPAR> |
|||
<GRLAB> |
|||
verb-eo attr-pred d.label=sn d.side=left p.label=verb-eo %% Label dependent on the left of the verb-eo as attr-pred |
|||
verb-eo ncsubj d.label=sn d.side=right p.label=verb-eo %% Label dependent on the right of verb-eo as ncsubj |
|||
</GRLAB> |
|||
</pre> |
|||
In the above grammar, <code>d.side</code> stands for the side of the dependent, and <code>p.label</code> stands for the label of the parent. This file is comprehensively documented in the section [http://garraf.epsevg.upc.es/freeling/doc/userman/html/node28.html Dependency parser rule file] in the Freeling documentation. |
|||
We'll put the file in <code>matxin-br-en.br.dep</code>, and the resulting output of the parse is, |
|||
<pre> |
|||
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ |
|||
fl-tagger matxin-br-en.br.relax | fl-parser matxin-br-en.br.gram matxin-br-en.br.dep | iconv -f latin1 |
|||
verb-eo/top/(eo bezañ VMIP3S0) [ |
|||
sn/attr-pred/(yezh yezh NCFSV0) [ |
|||
indef/modnorule/(Ur un DI0CN0) |
|||
adj/modnorule/(indezeuropek indezeuropek AQ0CN0) |
|||
] |
|||
sn/ncsubj/(brezhoneg brezhoneg NCMSV0) [ |
|||
def/modnorule/(ar an DA0CN0) |
|||
] |
|||
punt/modnomatch/(. . Fp) |
|||
] |
|||
</pre> |
|||
===Configuration file=== |
|||
So now we have a more or less working analysis stage, we need to get this analysis into a form that can be used by Matxin. This will involve writing a configuration file that specifies all of the modules that we've used above in one place. Below is a minimal configuration file for the modules we've used above. All of the options are mandatory, if one is left out, cryptic error messages may occur, so it is best to just copy and paste this, change the paths and then add to it as we go along. |
|||
<pre> |
|||
#### Language |
|||
Lang=br |
|||
## Valid input/output formats are: plain, token, splitted, morfo, tagged, parsed |
|||
InputFormat=plain |
|||
OutputFormat=dep |
|||
# Consider each newline as a sentence end |
|||
AlwaysFlush=no |
|||
# Tokeniser options |
|||
TokenizerFile="/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.tok.dat" |
|||
# Splitter options |
|||
SplitterFile="/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.spt.dat" |
|||
# Morphological analysis options |
|||
SuffixFile="" |
|||
SuffixAnalysis=no |
|||
MultiwordsDetection=no |
|||
LocutionsFile="" |
|||
DecimalPoint="." |
|||
ThousandPoint="," |
|||
QuantitiesFile="" |
|||
NumbersDetection=no |
|||
PunctuationDetection=no |
|||
PunctuationFile="" |
|||
DatesDetection=no |
|||
QuantitiesDetection=no |
|||
DictionarySearch=yes |
|||
DictionaryFile=/home/fran/MATXIN/source/matxin-br-en/br-en.br.db |
|||
ProbabilityAssignment=no |
|||
ProbabilityFile="" |
|||
# NER options |
|||
NERecognition=none |
|||
NPDataFile="" |
|||
# Tagger options |
|||
Tagger=relax |
|||
TaggerRelaxFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.relax |
|||
TaggerRelaxMaxIter=500 |
|||
TaggerRelaxScaleFactor=670.0 |
|||
TaggerRelaxEpsilon=0.001 |
|||
TaggerRetokenize=no |
|||
TaggerForceSelect=tagger |
|||
# Parser options |
|||
GrammarFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.gram |
|||
# Dependency parser options |
|||
DepParser=txala |
|||
DepTxalaFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.dep |
|||
</pre> |
|||
When we've set up the config file we can use the <code>Analyzer</code> program from the Matxin package. The final output from the <code>Analyzer</code> will be: |
|||
<pre> |
|||
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | Analyzer -f matxin-br-en.br-en.cfg | iconv -f latin1 |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus> |
|||
<SENTENCE ord='1' alloc='0'> |
|||
<CHUNK ord='2' alloc='21' type='verb-eo' si='top'> |
|||
<NODE ord='4' alloc='21' form='eo' lem='bezañ' mi='VMIP3S0'> |
|||
</NODE> |
|||
<CHUNK ord='1' alloc='0' type='sn' si='attr-pred'> |
|||
<NODE ord='2' alloc='3' form='yezh' lem='yezh' mi='NCFSV0'> |
|||
<NODE ord='1' alloc='0' form='Ur' lem='un' mi='DI0CN0'> |
|||
</NODE> |
|||
<NODE ord='3' alloc='8' form='indezeuropek' lem='indezeuropek' mi='AQ0CN0'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ord='3' alloc='24' type='sn' si='ncsubj'> |
|||
<NODE ord='6' alloc='27' form='brezhoneg' lem='brezhoneg' mi='NCMSV0'> |
|||
<NODE ord='5' alloc='24' form='ar' lem='an' mi='DA0CN0'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ord='4' alloc='36' type='punt' si='modnomatch'> |
|||
<NODE ord='7' alloc='36' form='.' lem='.' mi='Fp'> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
Which is an XML representation (see [[Documentation of Matxin]]) of the dependency analysis we saw earlier. |
|||
==Transfer== |
|||
===Lexical transfer=== |
|||
The next stage in the process is lexical transfer, this takes source language lexical forms and returns target language lexical forms. There are three files involved in lexical transfer, |
|||
* The bilingual dictionary (<code>matxin-br-en.br-en.dix</code>), which uses the familiar [[lttoolbox]] dictionary format, although slightly differently as a result of the [[Parole tags|Parole-style tags]] |
|||
* The noun semantic dictionary (<code>matxin-br-en.br.sem_info</code>), a tab separated file |
|||
* The chunk-type dictionary (<code>matxin-br-en.br.chunk_type</code>), a tab separated file |
|||
The module which performs lexical transfer is called <code>LT</code>, and the format the dictionaries described above will be explained below. |
|||
====Bilingual dictionary==== |
|||
The basic format of the bilingual dictionary is the same as in Apertium, |
|||
<pre> |
|||
<dictionary> |
|||
<alphabet/> |
|||
<sdefs> |
|||
<sdef n="mi" c="Morphological information"/> |
|||
<sdef n="parol" c="PAROLE style tag"/> |
|||
</sdefs> |
|||
<section id="main" type="standard"> |
|||
</section> |
|||
</dictionary> |
|||
</pre> |
|||
Unlike in Apertium where the symbols are used to carry morphological information (e.g. <code><sdef n="adj"/></code> and <code><s n="adj"/></code> for Adjective), in Matxin the symbols are used to define attributes to the node element. For example, <code><s n="parol"/></code> makes an attribute in the node called <code>parol</code> with the value of the tags which follow it. The following tags are usually encased in square brackets <code>[]</code>. |
|||
So, in order to transfer the parole tag of singular feminine noun (<code>NCFSV0</code>) in Breton to the appropriate representation in English for the rest of the transfer stages, e.g. singular noun with no gender <code> parol="NC" mi="[NUMS]" </code>, we can use the following entry in the bilingual dictionary, |
|||
<pre> |
|||
<e><p><l>yezh<s n="parol"/>NCFSV0</l><r>language<s n="parol"/>NC<s n="mi"/>[NUMS]</r></p></e> |
|||
</pre> |
|||
which will be output as |
|||
<pre> |
|||
<NODE ref="2" alloc="3" UpCase="none" lem="language" parol="NC" mi="[NUMS]"> |
|||
</pre> |
|||
by the lexical transfer module. As these patterns are repeated frequently, typically they are put into paradigms, but instead of being morphological paradigms, as in Apertium, they are lexical transfer paradigms, so for example for nouns we might have, |
|||
<pre> |
|||
<pardef n="NC_STD"> |
|||
<e><p><l><s n="parol"/>NCFSV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMS]</r></p></e> |
|||
<e><p><l><s n="parol"/>NCFPV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMP]</r></p></e> |
|||
<e><p><l><s n="parol"/>NCMSV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMS]</r></p></e> |
|||
<e><p><l><s n="parol"/>NCMPV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMP]</r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
Transferring both feminine and masculine singular and plural nouns in Breton to their English equivalents. The corresponding entries for our two nouns in the main section would be, |
|||
<pre> |
|||
<e><p><l>yezh</l><r>language</r></p><par n="NC_STD"/></e> |
|||
<e><p><l>brezhoneg</l><r>Breton</r></p><par n="NC_STD"/></e> |
|||
</pre> |
|||
Using the information for nouns and the XML generated after lexical transfer below you should be able to create a full bilingual dictionary for our test phrase. |
|||
====Noun semantic dictionary==== |
|||
The noun semantic dictionary is a simple file which allows basic semantic tagging of lemmas. They works somewhat like lists in Apertium transfer files, and allows the categorisation of nouns into semantic classes, for example ''language'', ''animacy'', ''material'', ''communication medium'' etc. The file is a tab separated list of lemma and semantic tag, for example for our phrase we might want to tag ''brezhoneg'' 'Breton' as <code>[HZK]</code> (an abbreviation of ''hizkuntzak'' 'languages'). |
|||
So make the file <code>matxin-br-en.br.sem_info</code> with the following contents: |
|||
<pre> |
|||
##[HZK]: Languages / Yezhoù (hizkuntzak) |
|||
brezhoneg [HZK+] |
|||
</pre> |
|||
You can add other languages such as ''euskareg'' 'Basque' and ''kembraeg'' 'Welsh'. The symbol <code>+</code> means that this feature is positive, the feature can also be followed by <code>-</code> for negative or <code>?</code> for uncertain. The lexical transfer module doesn't seem to use this information directly, but it needs the file to be in place. |
|||
====Chunk-type transfer dictionary==== |
|||
The final dictionary required for the lexical transfer stage is the chunk-type transfer dictionary. This transfers chunk types (e.g. <code>sn</code> ''sintagma nominal'' 'Noun phrase' and <code>sp</code> ''sintagma preposicional'' 'Prepositional phrase') into the target language chunk types. As we maintain the same chunk types between languages we can just have an empty file for this. Although it is probably worth adding a comment with the format, for example make a file called <code>matxin-br-en.br-en.chunk_type</code> with the following contents, |
|||
<pre> |
|||
#Category (SL) #Category (TL) #Description (SL) #Description (TL) |
|||
sn sn #Noun phrase Noun phrase |
|||
</pre> |
|||
====Configuration file==== |
|||
Now we come to editting the configuration file, we open the file, add the following options at the bottom of the file and save it. |
|||
<pre> |
|||
# Transfer options |
|||
TransDictFile=/home/fran/MATXIN/source/matxin-br-en/br-en.autobil.bin |
|||
ChunkTypeDict=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br-en.chunk_type |
|||
NounSemFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.sem_info |
|||
</pre> |
|||
====Output of lexical transfer==== |
|||
<pre> |
|||
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | Analyzer -f matxin-br-en.br-en.cfg | iconv -f latin1 | \ |
|||
LT -f matxin-br-en.br-en.cfg |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<corpus> |
|||
<SENTENCE ref="1" alloc="0"> |
|||
<CHUNK ref="2" type="verb-eo" alloc="21" si="top"> |
|||
<NODE ref="4" alloc="21" slem="bezañ" smi="VMIP3S0" UpCase="none" lem="be" parol="VM" mi="[IND][PRES][P3][NUMS]"/> |
|||
<CHUNK ref="1" type="sn" alloc="0" si="attr-pred"> |
|||
<NODE ref="2" alloc="3" slem="yezh" smi="NCFSV0" UpCase="none" lem="language" parol="NC" mi="[NUMS]"> |
|||
<NODE ref="1" alloc="0" slem="un" smi="DI0CN0" UpCase="none" lem="a" parol="DI"/> |
|||
<NODE ref="3" alloc="8" slem="indezeuropek" smi="AQ0CN0" UpCase="none" lem="Indo-European" parol="AQ" mi="[PST]"/> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref="3" type="sn" alloc="24" si="ncsubj"> |
|||
<NODE ref="6" alloc="27" slem="brezhoneg" smi="NCMSV0" UpCase="none" lem="Breton" parol="NC" mi="[NUMS]"> |
|||
<NODE ref="5" alloc="24" slem="an" smi="DA0CN0" UpCase="none" lem="the" parol="DA"/> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref="4" type="punt" alloc="36" si="modnomatch"> |
|||
<NODE ref="7" alloc="36" slem="." smi="Fp" UpCase="none" lem="." parol="Fp"/> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
Note: The above output has been post-processed by <code>xmllint --format -</code> to give more human readable formatting. |
|||
===Intra-chunk transfer=== |
|||
The purpose of the intra-chunk transfer is to move information from nodes in a chunk to the chunk, for example to do concordance between a subject noun phrase and the verb in a verb phrase. We're just going to copy the morphological information straight over, so make a file called <code>matxin-br-en.br-en.intra1</code> and put the following in. |
|||
<pre> |
|||
# 1/2 3/4 5 |
|||
mi!=''/mi /mi no-overwrite |
|||
</pre> |
|||
The file is tab and forward-slash separated and has five columns: |
|||
# Node specification: Defines which nodes to take information from, in this case <code>mi!=<nowiki>''</nowiki></code> where the morphological information is non-null. |
|||
# Source attribute: Which source attribute to copy, in this case the morphological information. |
|||
# Chunk condition: Restricts the chunk to which the information can be moved. In this case there is no restriction, but for example, you might want to only move the information when the subject is a common noun, in which case <code>si='ncsubj'</code> would do the trick. |
|||
# Destination attribute: The attribute in which the information should be put. |
|||
# Write mode: Can be one of three, either ''no-overwrite'' (do not overwrite previous information), ''overwrite'' (overwrite previous information), ''concat'' (concatenate information to any previously existing). |
|||
After you've made the file, add the details to the bottom of the configuration file as follows, |
|||
<pre> |
|||
IntraMoveFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br-en.intra1 |
|||
</pre> |
|||
After this has been added, we're ready to run the intra-chunk syntactic transfer module. This is done with the <code>ST_intra</code> program, which can be called as follows: |
|||
<pre> |
|||
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | Analyzer -f matxin-br-en.br-en.cfg | iconv -f latin1 | LT -f matxin-br-en.br-en.cfg \ |
|||
ST_intra -f matxin-br-en.br-en.cfg |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<corpus> |
|||
<SENTENCE ref="1" alloc="0"> |
|||
<CHUNK ref="2" type="verb-eo" alloc="21" si="top" mi="[IND][PRES][P3][NUMS]"> |
|||
<NODE ref="4" alloc="21" UpCase="none" lem="be" parol="VM" mi="[IND][PRES][P3][NUMS]" /> |
|||
<CHUNK ref="1" type="sn" alloc="0" si="attr-pred" mi="[NUMS]"> |
|||
<NODE ref="2" alloc="3" UpCase="none" lem="language" parol="NC" mi="[NUMS]"> |
|||
<NODE ref="1" alloc="0" UpCase="none" lem="a" parol="DI"/> |
|||
<NODE ref="3" alloc="8" UpCase="none" lem="Indo-European" parol="AQ" mi="[PST]"/> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref="3" type="sn" alloc="24" si="ncsubj" mi="[NUMS]"> |
|||
<NODE ref="6" alloc="27" UpCase="none" lem="Breton" parol="NC" mi="[NUMS]"> |
|||
<NODE ref="5" alloc="24" UpCase="none" lem="the" parol="DA"/> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref="4" type="punt" alloc="36" si="modnomatch"> |
|||
<NODE ref="7" alloc="36" UpCase="none" lem="." parol="Fp"/> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
The morphological information has been moved from the node head of the chunk to the chunk itself. |
|||
===Inter-chunk transfer=== |
|||
The next stage is similar to the previous stage, but deals with movement between chunks themselves. In our example we don't need to use this. |
|||
==Generation== |
|||
===Intra-chunk=== |
|||
As Breton and English differ in the internal structure of noun phrases, the next thing we want to do is transfer the Breton structure into English, this will involve changing {{sc|det nom adj}} to {{sc|det adj nom}}. Another intra-chunk process we want to take care of is the removal of the definite article before a noun which has the semantic tag for language (<code>[HZK+]</code>). |
|||
====Dependency parsing==== |
|||
[[Category:Matxin]] |
[[Category:Matxin]] |
||
[[Category:Documentation]] |
|||
[[Category:HOWTO]] |
|||
[[Category:Documentation in English]] |
Latest revision as of 13:05, 11 May 2016
This page intends to give a step-by-step walk-through of how to create a new translator in the Matxin platform.
Prerequisites[edit]
- Main article: Matxin
This page does not give instructions on installing Matxin, but presumes that the following packages are correctly installed.
And either:
- Freeling (from SVN) and the
fl-*
tools for Freeling (for the moment these can be found inapertium-tools/freeling
in apertium SVN), if you're planning on using FreeLing for the analysis/tagging/dependency steps
or
- vislcg3 (from SVN), if you're planning on using lttoolbox/apertium-tagger and Constraint Grammar for the analysis/tagging/dependency steps
Overview[edit]
As mentioned in the lead, this page intends to give a step-by-step guide to creating a new language pair with Matxin from scratch. No programming knowledge is required, all that needs to be defined are some dictionaries and grammars. The Matxin platform is described in detail in Documentation of Matxin and on the Matxin homepage. This page will only focus on the creation of a new language pair, and will avoid theoretical and methodological issues.
The language pair for the tutorial will be Breton to English. This has been chosen as the two languages have fairy divergent word order (Breton is fairly free, allowing VSO, OVS and SVO, where English is fairly uniformly SVO) which can show some of the advantage which Matxin has over Apertium. The sentence that we're going to use for this purpose is,
- Ur yezh indezeuropek eo ar brezhoneg.
- [compl A language indo-european] [verb is] [subj the breton].
- Breton is an Indo-European language.
There are two main issues with transfer in this case, the first is re-ordering, from OVS → SVO
, the second is removing the definite article preceding a language name ar brezhoneg → Breton
.
Getting started[edit]
This HOWTO expects the user to have a basic familiarity with the Apertium platform, if you don't, feel free to have a look over the pages: lttoolbox, bilingual dictionary and Apertium New Language Pair HOWTO.
For the analysis/disambiguation/dependency steps, there are two main options open:
- use the Apertium tool lttoolbox for morphological disambiguation, Constraint Grammar for rule-based disambiguation and syntactic tagging, apertium-tagger for statistical disambiguation and finally Constraint Grammar for dependency analysis
- use FreeLing for all these steps
Typically the choice will rest on what resources you have available, and the strengths and weaknesses of each of these tools.
Analysis with Apertium and Constraint Grammar[edit]
(to be written...)
Analysis with FreeLing[edit]
The analysis process in Matxin is done by Freeling, an free / open-source suite of language analysers. The analysis is done in four stages, requiring four (or more) sets of separate files. The first is the morphological dictionary, which is basically a full-form list (e.g. Speling format) compiled into a BerkeleyDB format. There are then files for word-category disambiguation and for specifying chunking and dependency rules. There are two more stages that come before morphological analysis, tokenisation and sentence splitting, but for the purposes of this tutorial they will be considered along with morphological analysis.
Normally a single program is used to do all the different stages of analysis, taking as input plain or deformatted text, and outputting a dependency analysis, the behaviour of this program is controlled by a file called config.cfg
. In Matxin this program is called Analyzer
, however in the following stages, we'll be using separate tools and will leave creating the config file until the last minute as it can get complicated.
As companion reading to this section the Freeling documentation is highly recommended. This tutorial skips over features of Freeling which are not necessary for making a basic MT system with Matxin, and note that it is also possible to use other analysers as input into the chunking / dependency parsing parts of Freeling, for more information see here.
Morphological[edit]
In order to create your morphological analyser in Freeling you basically need to make a full-form list. If there is already an Apertium dictionary for the language, you can use the scripts in apertium SVN (module apertium-tools/freeling
) to generate a dictionary from scratch, if not, then either build it from scratch, or build a dictionary in lttoolbox and then generate the list.
For the purposes of this exercise, you can just key in a small dictionary manually. We'll call the dictionary matxin-br-en.br.dicc
, and it will contain
ul un DI0CN0 un un DI0CN0 ur un DI0CN0 yezhoù yezh NCFPV0 yezh yezh NCFSV0 yezh AQ0CN0 indezeuropek indezeuropek AQ0CN0 eo bezañ VMIP3S0 al an DA0CN0 ar an DA0CN0 an an DA0CN0 brezhoneg brezhoneg NCMSV0 prezhoneg brezhoneg NCMSV0 vrezhoneg brezhoneg NCMSV0 . . Fp
The file is space separated with three or more columns. The first is for the surface form of the word, further columns are for a list of lemmas and Parole-style analyses.
After we've keyed this in, we can compile it to BerkleyDB format using the tool indexdict
from the Freeling utilities. It is worth noting that Freeling currently only supports the latin1
character encoding, so if you're working in UTF-8, convert the dictionary to latin1 first.
$ cat matxin-br-en.br.dicc | iconv -f utf-8 -t latin1 | indexdict br-en.br.db
Now you should have two files, matxin-br-en.br.dicc
, which is the dictionary source, and br-en.br.db
which is the dictionary in BerkleyDB format. We cannot however use this analyser without specifying a tokeniser and splitter. These files define how words and sentences will be tokenised. For now we'll use a minimal configuration file for the splitter, so put the following in the file matxin-br-en.spt.dat
<SentenceEnd> . 0 </SentenceEnd>
Of course, other end of sentence punctuation such as '?' and '!' could also be put in there. And now for the word tokeniser, which we'll put in matxin-br-en.tok.dat
<Macros> ALPHANUM [^\]<>[(\.,";:?!'`)^@~|}{_/\\+=&$#*+%\s\-] OTHERS [\]<>[(\.,";:?!'`)^@~|}{_/\\+=&$#*+%\-] </Macros> <RegExps> WORD 0 {ALPHANUM}+ OTHERS_C 0 {OTHERS}+ </RegExps>
The macros define regular expressions which are used to tokenise the input into words and punctuation. The regular expression WORD
is defined as a sequence of one or more ALPHANUM
which in turn is defined as anything except a punctuation character.
So now if we want to morphologically analyse a sentence, we just do:
$ echo "Ur yezh eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | iconv -f latin1 Ur un DI0CN0 -1 -1 yezh yezh NCFSV0 -1 yezh AQ0CN0 -1 eo bezañ VMIP3S0 -1 -1 ar an DA0CN0 -1 -1 brezhoneg brezhoneg NCMSV0 -1 -1 . . Fp -1
The -1
after each analysis is the a priori probability of the analysis and is calculated from a previously tagged corpus. As we don't have a previously tagged corpus, this is unset.
Category disambiguation[edit]
After we have working morphological analysis, the next stage is to create a part-of-speech tagger. Freeling offers various ways to do this, both HMM-based and Relax Constraint Grammar (RelaxCG) based are supported. We're going to demonstrate how to create a RelaxCG tagger as it is easier and does not require tagger training.
Our tagger will be very simple as we only have one ambiguity, yezh 'language' can be a noun or an adjective. As adjectives come after the noun in Breton, we'll weight adjectives after determiners very low,
SETS CONSTRAINTS %% after a determiner down-weight adjective -8.0 AQ* (-1 D*);
The file (which we will call matxin-br-en.br.relax
is made up of two sections, the first SETS
defines any sets of tags or lemmas, much like the LIST
and SET
in VISL Constraint Grammar taggers. The second section defines a series of weighted constraints, in the format of 'weight', followed by space, followed by the tag followed by another space and then the context. The context is defined as a series of positions relative to the tag in question.
So, using this file we should be able to get disambiguated output:
$ echo "Ur yezh eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ fl-tagger matxin-br-en.br.relax | iconv -f latin1 Ur un DI0CN0 -1 yezh yezh NCFSV0 -1 eo bezañ VMIP3S0 -1 ar an DA0CN0 -1 brezhoneg brezhoneg NCMSV0 -1 . . Fp -1
Chunking[edit]
So, after tagging the next stage is chunk parsing. This is somewhat like the chunking available in Apertium (see Chunking), however no transfer takes place, it just groups words into chunks. The grammar is quite familiar, the left side shows the non-terminal, and the right side can be either terminal (in the case of a tag, e.g. NCM*
) or non-terminal (in the case of n-m
). This extremely simple grammar will chunk the tagged input into the constituents (noun phrases sn
and verb-eo
) for later use by the dependency parser. It should be fairly straight forward, |
is an or statement, and +
marks the governor, or head of the chunk. The end of each rule is marked with a full stop .
and comments are placed with %
.
n-m ==> NCM* . n-f ==> NCF* . adj ==> AQ* . def ==> DA0CN0 . indef ==> DI0CN0 . verb-eo ==> VMIP3S0(eo). %% A specific chunk type for the 'eo' form of bezañ verb ==> VL* . punt ==> Fp . sn ==> def, +n-f, adj | def, +n-f | +n-f, adj | +n-f . sn ==> def, +n-m, adj | def, +n-m | +n-m, adj | +n-m . sn ==> indef, +n-m, adj | indef, +n-m | +n-m, adj | +n-m . sn ==> indef, +n-f, adj | indef, +n-f | +n-f, adj | +n-f . @START S.
The @START
directive states that the start node of the sentence should be labelled S
. So, the output of this grammar will be,
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ fl-tagger matxin-br-en.br.relax | fl-chunker matxin-br-en.br.gram | iconv -f latin1 S_[ sn_[ indef_[ +(Ur un DI0CN0) ] +n-f_[ +(yezh yezh NCFSV0) ] adj_[ +(indezeuropek indezeuropek AQ0CN0) ] ] verb-eo_[ +(eo bezañ VMIP3S0) ] sn_[ def_[ +(ar an DA0CN0) ] +n-m_[ +(brezhoneg brezhoneg NCMSV0) ] ] punt_[ +(. . Fp) ] ]
Note the sentence is chunked into sn verb-eo sn
. It might be worth playing around a bit with the grammar to get a better feel for it.
Dependency parsing[edit]
The next stage is to create a dependency grammar. The dependency grammar describes and labels dependencies between constituents. It is made up of two main sections, <GRPAR>
which fixes up the parse provided by the chunker. In the example, the verb is moved to the top of the sentence, above the complement, and <GRLAB>
which labels parts of the parse.
Note that in Breton, sentences with eo (a form of bezañ 'to be') always have the structure Object—Verb—Subject, so require special attention. We thus label the left side as the predicate complement and the right side as the subject.
<GRPAR> 1 - - (sn,verb-eo) top_right RELABEL - % (Ur yezh keltiek (eo)) </GRPAR> <GRLAB> verb-eo attr-pred d.label=sn d.side=left p.label=verb-eo %% Label dependent on the left of the verb-eo as attr-pred verb-eo ncsubj d.label=sn d.side=right p.label=verb-eo %% Label dependent on the right of verb-eo as ncsubj </GRLAB>
In the above grammar, d.side
stands for the side of the dependent, and p.label
stands for the label of the parent. This file is comprehensively documented in the section Dependency parser rule file in the Freeling documentation.
We'll put the file in matxin-br-en.br.dep
, and the resulting output of the parse is,
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ fl-tagger matxin-br-en.br.relax | fl-parser matxin-br-en.br.gram matxin-br-en.br.dep | iconv -f latin1 verb-eo/top/(eo bezañ VMIP3S0) [ sn/attr-pred/(yezh yezh NCFSV0) [ indef/modnorule/(Ur un DI0CN0) adj/modnorule/(indezeuropek indezeuropek AQ0CN0) ] sn/ncsubj/(brezhoneg brezhoneg NCMSV0) [ def/modnorule/(ar an DA0CN0) ] punt/modnomatch/(. . Fp) ]
Configuration file[edit]
So now we have a more or less working analysis stage, we need to get this analysis into a form that can be used by Matxin. This will involve writing a configuration file that specifies all of the modules that we've used above in one place. Below is a minimal configuration file for the modules we've used above. All of the options are mandatory, if one is left out, cryptic error messages may occur, so it is best to just copy and paste this, change the paths and then add to it as we go along.
#### Language Lang=br ## Valid input/output formats are: plain, token, splitted, morfo, tagged, parsed InputFormat=plain OutputFormat=dep # Consider each newline as a sentence end AlwaysFlush=no # Tokeniser options TokenizerFile="/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.tok.dat" # Splitter options SplitterFile="/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.spt.dat" # Morphological analysis options SuffixFile="" SuffixAnalysis=no MultiwordsDetection=no LocutionsFile="" DecimalPoint="." ThousandPoint="," QuantitiesFile="" NumbersDetection=no PunctuationDetection=no PunctuationFile="" DatesDetection=no QuantitiesDetection=no DictionarySearch=yes DictionaryFile=/home/fran/MATXIN/source/matxin-br-en/br-en.br.db ProbabilityAssignment=no ProbabilityFile="" # NER options NERecognition=none NPDataFile="" # Tagger options Tagger=relax TaggerRelaxFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.relax TaggerRelaxMaxIter=500 TaggerRelaxScaleFactor=670.0 TaggerRelaxEpsilon=0.001 TaggerRetokenize=no TaggerForceSelect=tagger # Parser options GrammarFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.gram # Dependency parser options DepParser=txala DepTxalaFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.dep
When we've set up the config file we can use the Analyzer
program from the Matxin package. The final output from the Analyzer
will be:
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | Analyzer -f matxin-br-en.br-en.cfg | iconv -f latin1 <?xml version='1.0' encoding='UTF-8' ?> <corpus> <SENTENCE ord='1' alloc='0'> <CHUNK ord='2' alloc='21' type='verb-eo' si='top'> <NODE ord='4' alloc='21' form='eo' lem='bezañ' mi='VMIP3S0'> </NODE> <CHUNK ord='1' alloc='0' type='sn' si='attr-pred'> <NODE ord='2' alloc='3' form='yezh' lem='yezh' mi='NCFSV0'> <NODE ord='1' alloc='0' form='Ur' lem='un' mi='DI0CN0'> </NODE> <NODE ord='3' alloc='8' form='indezeuropek' lem='indezeuropek' mi='AQ0CN0'> </NODE> </NODE> </CHUNK> <CHUNK ord='3' alloc='24' type='sn' si='ncsubj'> <NODE ord='6' alloc='27' form='brezhoneg' lem='brezhoneg' mi='NCMSV0'> <NODE ord='5' alloc='24' form='ar' lem='an' mi='DA0CN0'> </NODE> </NODE> </CHUNK> <CHUNK ord='4' alloc='36' type='punt' si='modnomatch'> <NODE ord='7' alloc='36' form='.' lem='.' mi='Fp'> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus>
Which is an XML representation (see Documentation of Matxin) of the dependency analysis we saw earlier.
Transfer[edit]
Lexical transfer[edit]
The next stage in the process is lexical transfer, this takes source language lexical forms and returns target language lexical forms. There are three files involved in lexical transfer,
- The bilingual dictionary (
matxin-br-en.br-en.dix
), which uses the familiar lttoolbox dictionary format, although slightly differently as a result of the Parole-style tags - The noun semantic dictionary (
matxin-br-en.br.sem_info
), a tab separated file - The chunk-type dictionary (
matxin-br-en.br.chunk_type
), a tab separated file
The module which performs lexical transfer is called LT
, and the format the dictionaries described above will be explained below.
Bilingual dictionary[edit]
The basic format of the bilingual dictionary is the same as in Apertium,
<dictionary> <alphabet/> <sdefs> <sdef n="mi" c="Morphological information"/> <sdef n="parol" c="PAROLE style tag"/> </sdefs> <section id="main" type="standard"> </section> </dictionary>
Unlike in Apertium where the symbols are used to carry morphological information (e.g. <sdef n="adj"/>
and
for Adjective), in Matxin the symbols are used to define attributes to the node element. For example,
makes an attribute in the node called parol
with the value of the tags which follow it. The following tags are usually encased in square brackets []
.
So, in order to transfer the parole tag of singular feminine noun (NCFSV0
) in Breton to the appropriate representation in English for the rest of the transfer stages, e.g. singular noun with no gender parol="NC" mi="[NUMS]"
, we can use the following entry in the bilingual dictionary,
<e><p><l>yezh<s n="parol"/>NCFSV0</l><r>language<s n="parol"/>NC<s n="mi"/>[NUMS]</r></p></e>
which will be output as
<NODE ref="2" alloc="3" UpCase="none" lem="language" parol="NC" mi="[NUMS]">
by the lexical transfer module. As these patterns are repeated frequently, typically they are put into paradigms, but instead of being morphological paradigms, as in Apertium, they are lexical transfer paradigms, so for example for nouns we might have,
<pardef n="NC_STD"> <e><p><l><s n="parol"/>NCFSV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMS]</r></p></e> <e><p><l><s n="parol"/>NCFPV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMP]</r></p></e> <e><p><l><s n="parol"/>NCMSV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMS]</r></p></e> <e><p><l><s n="parol"/>NCMPV0</l><r><s n="parol"/>NC<s n="mi"/>[NUMP]</r></p></e> </pardef>
Transferring both feminine and masculine singular and plural nouns in Breton to their English equivalents. The corresponding entries for our two nouns in the main section would be,
<e><p><l>yezh</l><r>language</r></p><par n="NC_STD"/></e> <e><p><l>brezhoneg</l><r>Breton</r></p><par n="NC_STD"/></e>
Using the information for nouns and the XML generated after lexical transfer below you should be able to create a full bilingual dictionary for our test phrase.
Noun semantic dictionary[edit]
The noun semantic dictionary is a simple file which allows basic semantic tagging of lemmas. They works somewhat like lists in Apertium transfer files, and allows the categorisation of nouns into semantic classes, for example language, animacy, material, communication medium etc. The file is a tab separated list of lemma and semantic tag, for example for our phrase we might want to tag brezhoneg 'Breton' as [HZK]
(an abbreviation of hizkuntzak 'languages').
So make the file matxin-br-en.br.sem_info
with the following contents:
##[HZK]: Languages / Yezhoù (hizkuntzak) brezhoneg [HZK+]
You can add other languages such as euskareg 'Basque' and kembraeg 'Welsh'. The symbol +
means that this feature is positive, the feature can also be followed by -
for negative or ?
for uncertain. The lexical transfer module doesn't seem to use this information directly, but it needs the file to be in place.
Chunk-type transfer dictionary[edit]
The final dictionary required for the lexical transfer stage is the chunk-type transfer dictionary. This transfers chunk types (e.g. sn
sintagma nominal 'Noun phrase' and sp
sintagma preposicional 'Prepositional phrase') into the target language chunk types. As we maintain the same chunk types between languages we can just have an empty file for this. Although it is probably worth adding a comment with the format, for example make a file called matxin-br-en.br-en.chunk_type
with the following contents,
#Category (SL) #Category (TL) #Description (SL) #Description (TL) sn sn #Noun phrase Noun phrase
Configuration file[edit]
Now we come to editting the configuration file, we open the file, add the following options at the bottom of the file and save it.
# Transfer options TransDictFile=/home/fran/MATXIN/source/matxin-br-en/br-en.autobil.bin ChunkTypeDict=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br-en.chunk_type NounSemFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br.sem_info
Output of lexical transfer[edit]
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | Analyzer -f matxin-br-en.br-en.cfg | iconv -f latin1 | \ LT -f matxin-br-en.br-en.cfg <?xml version="1.0" encoding="UTF-8"?> <corpus> <SENTENCE ref="1" alloc="0"> <CHUNK ref="2" type="verb-eo" alloc="21" si="top"> <NODE ref="4" alloc="21" slem="bezañ" smi="VMIP3S0" UpCase="none" lem="be" parol="VM" mi="[IND][PRES][P3][NUMS]"/> <CHUNK ref="1" type="sn" alloc="0" si="attr-pred"> <NODE ref="2" alloc="3" slem="yezh" smi="NCFSV0" UpCase="none" lem="language" parol="NC" mi="[NUMS]"> <NODE ref="1" alloc="0" slem="un" smi="DI0CN0" UpCase="none" lem="a" parol="DI"/> <NODE ref="3" alloc="8" slem="indezeuropek" smi="AQ0CN0" UpCase="none" lem="Indo-European" parol="AQ" mi="[PST]"/> </NODE> </CHUNK> <CHUNK ref="3" type="sn" alloc="24" si="ncsubj"> <NODE ref="6" alloc="27" slem="brezhoneg" smi="NCMSV0" UpCase="none" lem="Breton" parol="NC" mi="[NUMS]"> <NODE ref="5" alloc="24" slem="an" smi="DA0CN0" UpCase="none" lem="the" parol="DA"/> </NODE> </CHUNK> <CHUNK ref="4" type="punt" alloc="36" si="modnomatch"> <NODE ref="7" alloc="36" slem="." smi="Fp" UpCase="none" lem="." parol="Fp"/> </CHUNK> </CHUNK> </SENTENCE> </corpus>
Note: The above output has been post-processed by xmllint --format -
to give more human readable formatting.
Intra-chunk transfer[edit]
The purpose of the intra-chunk transfer is to move information from nodes in a chunk to the chunk, for example to do concordance between a subject noun phrase and the verb in a verb phrase. We're just going to copy the morphological information straight over, so make a file called matxin-br-en.br-en.intra1
and put the following in.
# 1/2 3/4 5 mi!=''/mi /mi no-overwrite
The file is tab and forward-slash separated and has five columns:
- Node specification: Defines which nodes to take information from, in this case
mi!=''
where the morphological information is non-null. - Source attribute: Which source attribute to copy, in this case the morphological information.
- Chunk condition: Restricts the chunk to which the information can be moved. In this case there is no restriction, but for example, you might want to only move the information when the subject is a common noun, in which case
si='ncsubj'
would do the trick. - Destination attribute: The attribute in which the information should be put.
- Write mode: Can be one of three, either no-overwrite (do not overwrite previous information), overwrite (overwrite previous information), concat (concatenate information to any previously existing).
After you've made the file, add the details to the bottom of the configuration file as follows,
IntraMoveFile=/home/fran/MATXIN/source/matxin-br-en/matxin-br-en.br-en.intra1
After this has been added, we're ready to run the intra-chunk syntactic transfer module. This is done with the ST_intra
program, which can be called as follows:
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | Analyzer -f matxin-br-en.br-en.cfg | iconv -f latin1 | LT -f matxin-br-en.br-en.cfg \ ST_intra -f matxin-br-en.br-en.cfg <?xml version="1.0" encoding="UTF-8"?> <corpus> <SENTENCE ref="1" alloc="0"> <CHUNK ref="2" type="verb-eo" alloc="21" si="top" mi="[IND][PRES][P3][NUMS]"> <NODE ref="4" alloc="21" UpCase="none" lem="be" parol="VM" mi="[IND][PRES][P3][NUMS]" /> <CHUNK ref="1" type="sn" alloc="0" si="attr-pred" mi="[NUMS]"> <NODE ref="2" alloc="3" UpCase="none" lem="language" parol="NC" mi="[NUMS]"> <NODE ref="1" alloc="0" UpCase="none" lem="a" parol="DI"/> <NODE ref="3" alloc="8" UpCase="none" lem="Indo-European" parol="AQ" mi="[PST]"/> </NODE> </CHUNK> <CHUNK ref="3" type="sn" alloc="24" si="ncsubj" mi="[NUMS]"> <NODE ref="6" alloc="27" UpCase="none" lem="Breton" parol="NC" mi="[NUMS]"> <NODE ref="5" alloc="24" UpCase="none" lem="the" parol="DA"/> </NODE> </CHUNK> <CHUNK ref="4" type="punt" alloc="36" si="modnomatch"> <NODE ref="7" alloc="36" UpCase="none" lem="." parol="Fp"/> </CHUNK> </CHUNK> </SENTENCE> </corpus>
The morphological information has been moved from the node head of the chunk to the chunk itself.
Inter-chunk transfer[edit]
The next stage is similar to the previous stage, but deals with movement between chunks themselves. In our example we don't need to use this.
Generation[edit]
Intra-chunk[edit]
As Breton and English differ in the internal structure of noun phrases, the next thing we want to do is transfer the Breton structure into English, this will involve changing det nom adj to det adj nom. Another intra-chunk process we want to take care of is the removal of the definite article before a noun which has the semantic tag for language ([HZK+]
).