Users guide and notes Jacob
These are my notes for making the English-Esperanto translator but I might be usefull to the same kind of people like me who knows next to nothing about linguistics.
Ive installed standtard Ubuntu packages and theyre working fine:
Contents
Using Apertium
echo "Jeg vil gå en tur" | apertium da-sv Jag vill gå en tur
or
$ echo "Jeg vil gå en tur" | apertium -d apertium-sv-da da-sv Jag vill gå en tur
don't use the command apertium-translator, its old and deprecated!
how to add a missing word
You will need to add the word in both the source language monodix AND on the translation dictionary.
Example: I want to add "treeview" which is an English noun.
First I check if its in the English monodict apertium-eo-en.en.dix. If it isnt we'll need to add it.
First we need to find the regular noun paradigm in english The paradigm is 'house__n'. Why 'house' ? Just because it's a memorable example.
Understanding the files
<e r="LR"><p><l>kataluno<s n="n"/><s n="f"/></l><r>Catalan<s n="n"/></r></p></e> <e r="LR"><p><l>kataluno<s n="n"/><s n="m"/></l><r>Catalan<s n="n"/></r></p></e> <e r="RL"><p><l>kataluno<s n="n"/><s n="GD"/></l><r>Catalan<s n="n"/></r></p></e> <e r="LR"><p><l>katoliko<s n="n"/><s n="f"/></l><r>Catholic<s n="n"/></r></p></e> <e r="LR"><p><l>katoliko<s n="n"/><s n="m"/></l><r>Catholic<s n="n"/></r></p></e> <e r="RL"><p><l>katoliko<s n="n"/><s n="GD"/></l><r>Catholic<s n="n"/></r></p></e> 13.04 It all the same! francis.tyers: yep that says: translating left-to-right: katoliko<n><f> → Catholic<n> and katoliko<n><m> → Catholic<n> 13.05 mig: You could write "katalunino" to say a female Catalan person, but most people wouldnt care and would write "kataluno" francis.tyers: translating from right-to-left, Catholic<n> → katoliko<n><GD> (GD = gender to be determined) mig: ah LR = left-to-right the directions 13.06 francis.tyers: yeah mig: i have undersood francis.tyers: left-to-right = esperanto to english
Why words needs also to be in the monolingual dictionary
treeview is not in the english dictionary
mig: ah couldnt it just suppose it to be a noun , then :-)
13.53 francis.tyers: nope
mig: or take it from the apertium-eo-en.eo.dix francis.tyers: everything to be translated needs to be in the analyser how would it know the number ? how would it know treeview is singular and treeviews is plural ?
13.54 it could guess, but then how would it be able to distinguish between "to treeview" and "he treeviews" (which don't exist)
mig: so I need also to add the word to apertium-eo-en.en.dix.
and it has the same declination as all other verbs ?
mig: infibitive no, its quite skew :-) declination= ?
13.10 francis.tyers: conjugation
mig: I promise to learn the lingustic words within the week. francis.tyers: haha :D an idea mig: yes, all declinations (ways of conjugation) are all the same in Esperanto
Why can't for example en-ca, en-es and en-eo all share the SAME English dictionary? > Then we could all contribute to this gian dict for the advantage of > all 3 projects? >
For each project if we want to add it to one dictionary, we need to add it to all of them. For example, if you want to add a word to es-en, you need to add it to all three dictionaries (en, en-es, es) -- in the appropriate form. Otherwise you get the @ # * symbols.
Because of this, and because not every has the time to edit, or speaks all of the languages, we find it more convienient to work with them separately, as language pairs, and then merge when/where possible. You'll note that most of the paradigm names, for example, are shared.
Although the ideal is for each dictionary to be "isolated", it isn't always like that. For example, there are some things it makes sense to distinguish in some language pairs and not in others.
#include clause
Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest? A: See apertium-en-es/apertium-en-es.en.metadix.xml:
<?xml version="1.0" encoding="UTF-8"?> <dictionary> <alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet> <!-- symbols --> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.symbols.xml"/> <!-- paradigms --> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.en.pardefs.xml"/>
And then in apertium-en-es.symbols.xml:
<?xml version="1.0" encoding="UTF-8"?> <sdefs> <sdef n="comp" /> <sdef n="detnt" /> <sdef n="predet" /> <sdef n="past" /> <sdef n="atn" />
TODO
- go through http://wiki.apertium.org/wiki/Monodix_basics and review the file (the apertium-eo-en.eo.dix file) - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix - make some wiki notes.
File from traduku.net
convert it into EN : EO
then tag the EO side and strip out the nouns and adjectives those are most important to start with then grab a corpus (wikipedia, or euro parl or something)
22.15 and order them by frequency of the english word
mig: why reorder? francis.tyers@gmail.com: higher frequency words are more important
22.16 if you translate "the" correctly, you cover ~50% of the text, if you translate "gable" correctly you cover maybe 0.5% 22.17 mig: yes, yes, but why bother if all words get in?
francis.tyers@gmail.com: because someone has to add the inflection for the english side the esperanto side is regular, but the english is not always regular
22.18 mig: OK, so reording is important because we probably wont make all 110000.
francis.tyers@gmail.com: yep but the good news is we don't need to make 110000 we have 93% coverage with ~7,000 words so we can get 99% coverage with probably 20,000