Difference between revisions of "Users guide and notes Jacob"
| Hectoralos (talk | contribs)  | |||
| (4 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
| {{TOCD}} | |||
| These are my notes for making the English-Esperanto translator but I might be  | These are my notes for making the English-Esperanto translator but I might be useful to the same kind of people like me who knows next to nothing about linguistics. | ||
| Ive installed standtard Ubuntu packages and theyre working fine: | Ive installed standtard Ubuntu packages and theyre working fine: | ||
| Line 84: | Line 85: | ||
| ==Why  | ==Why do pairs with the same language (e.g. English) not share the English monodix?== | ||
| Why can't for example en-ca, en-es and en-eo all share the SAME English dictionary? | Why can't for example en-ca, en-es and en-eo all share the SAME English dictionary? | ||
| > Then we could all contribute to this  | > Then we could all contribute to this giant dict for the advantage of | ||
| > all 3 projects? | > all 3 projects? | ||
| > | > | ||
| Line 103: | Line 104: | ||
| always like that. For example, there are some things it makes sense to | always like that. For example, there are some things it makes sense to | ||
| distinguish in some language pairs and not in others. | distinguish in some language pairs and not in others. | ||
| ==#include clause== | ==#include clause== | ||
| Line 212: | Line 212: | ||
|  - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix  |  - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix  | ||
|  - make some wiki notes. |  - make some wiki notes. | ||
| <pre> | |||
| <jacobn> Ok, Ill try the web doc translator more, find the systematics, report a bug and attach files etc. | |||
| </pre> | |||
| Line 234: | Line 238: | ||
|   we have 93% coverage with ~7,000 words |   we have 93% coverage with ~7,000 words | ||
|   so we can get 99% coverage with probably 20,000 |   so we can get 99% coverage with probably 20,000 | ||
| [[Category:English and Esperanto]] | |||
Latest revision as of 05:45, 29 December 2011
These are my notes for making the English-Esperanto translator but I might be useful to the same kind of people like me who knows next to nothing about linguistics.
Ive installed standtard Ubuntu packages and theyre working fine:
Using Apertium[edit]
echo "Jeg vil gå en tur" | apertium da-sv Jag vill gå en tur
or
$ echo "Jeg vil gå en tur" | apertium -d apertium-sv-da da-sv Jag vill gå en tur
don't use the command apertium-translator, its old and deprecated!
how to add a missing word[edit]
You will need to add the word in both the source language monodix AND on the translation dictionary.
Example: I want to add "treeview" which is an English noun.
First I check if its in the English monodict apertium-eo-en.en.dix. If it isnt we'll need to add it.
First we need to find the regular noun paradigm in english The paradigm is 'house__n'. Why 'house' ? Just because it's a memorable example.
Understanding the files[edit]
<e r="LR"><p><l>kataluno<s n="n"/><s n="f"/></l><r>Catalan<s n="n"/></r></p></e> <e r="LR"><p><l>kataluno<s n="n"/><s n="m"/></l><r>Catalan<s n="n"/></r></p></e> <e r="RL"><p><l>kataluno<s n="n"/><s n="GD"/></l><r>Catalan<s n="n"/></r></p></e> <e r="LR"><p><l>katoliko<s n="n"/><s n="f"/></l><r>Catholic<s n="n"/></r></p></e> <e r="LR"><p><l>katoliko<s n="n"/><s n="m"/></l><r>Catholic<s n="n"/></r></p></e> <e r="RL"><p><l>katoliko<s n="n"/><s n="GD"/></l><r>Catholic<s n="n"/></r></p></e> 13.04 It all the same! francis.tyers: yep that says: translating left-to-right: katoliko<n><f> → Catholic<n> and katoliko<n><m> → Catholic<n> 13.05 mig: You could write "katalunino" to say a female Catalan person, but most people wouldnt care and would write "kataluno" francis.tyers: translating from right-to-left, Catholic<n> → katoliko<n><GD> (GD = gender to be determined) mig: ah LR = left-to-right the directions 13.06 francis.tyers: yeah mig: i have undersood francis.tyers: left-to-right = esperanto to english
Why words needs also to be in the monolingual dictionary[edit]
treeview is not in the english dictionary
mig: ah couldnt it just suppose it to be a noun , then :-)
13.53 francis.tyers: nope
mig: or take it from the apertium-eo-en.eo.dix francis.tyers: everything to be translated needs to be in the analyser how would it know the number ? how would it know treeview is singular and treeviews is plural ?
13.54 it could guess, but then how would it be able to distinguish between "to treeview" and "he treeviews" (which don't exist)
mig: so I need also to add the word to apertium-eo-en.en.dix.
and it has the same declination as all other verbs ?
mig: infibitive no, its quite skew :-) declination= ?
13.10 francis.tyers: conjugation
mig: I promise to learn the lingustic words within the week. francis.tyers: haha :D an idea mig: yes, all declinations (ways of conjugation) are all the same in Esperanto
[edit]
Why can't for example en-ca, en-es and en-eo all share the SAME English dictionary? > Then we could all contribute to this giant dict for the advantage of > all 3 projects? >
For each project if we want to add it to one dictionary, we need to add it to all of them. For example, if you want to add a word to es-en, you need to add it to all three dictionaries (en, en-es, es) -- in the appropriate form. Otherwise you get the @ # * symbols.
Because of this, and because not every has the time to edit, or speaks all of the languages, we find it more convienient to work with them separately, as language pairs, and then merge when/where possible. You'll note that most of the paradigm names, for example, are shared.
Although the ideal is for each dictionary to be "isolated", it isn't always like that. For example, there are some things it makes sense to distinguish in some language pairs and not in others.
#include clause[edit]
Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest?
Using <xi:include/>[edit]
A: You could use <xi:include/> as is done in apertium-en-es.
There are, however, a lot of limitations, which make it cumbersome:
- The apertium tools doesen't support <xi:include/> directly, so instead of working on the files directly they will have to be preprocessed and then the result of this can be used in Apertium.
- You will therefore have to make significant changes to the Makefile
- The included files need to have a en enclosing tag, like <sdefs> below. If not then you'll have to invent one.
See apertium-en-es/apertium-en-es.en.metadix.xml:
<?xml version="1.0" encoding="UTF-8"?>
Here's how to do:
<dictionary>
  <alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
        <!-- symbols -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"  href="apertium-en-es.symbols.xml"/>
        <!-- paradigms -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.en.pardefs.xml"/>
And then in apertium-en-es.symbols.xml:
<?xml version="1.0" encoding="UTF-8"?>
  <sdefs>
    <sdef n="comp" />
    <sdef n="detnt" />
    <sdef n="predet" />
    <sdef n="past" />
    <sdef n="atn" />
Using shell tools, like cat, head and tail[edit]
If you just want something simple like the #include in C/C++ then it might be much easier for you to just use cat, head and tail Unix shell commands. Imagine that you want to add in the end of the files, at the third last line:
#include your file here </section> </dictionary>
then you could just change your Makefile like (original has been prefixed by #):
$(PREFIX1).automorf.bin: $(BASENAME).$(LANG1).dix tradukunet.$(LANG1).dix
        (head -n -3 $(BASENAME).$(LANG1).dix; cat tradukunet.$(LANG1).dix; tail -n -3 $(BASENAME).$(LANG1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp lr tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(LANG1).dix
#       lt-comp lr $(BASENAME).$(LANG1).dix $@
$(PREFIX1).autobil.bin: $(BASENAME).$(PREFIX1).dix tradukunet.$(PREFIX1).dix
        (head -n -3 $(BASENAME).$(PREFIX1).dix; cat tradukunet.$(PREFIX1).dix; tail -n -3 $(BASENAME).$(PREFIX1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp lr tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(PREFIX1).dix
#       lt-comp lr $(BASENAME).$(PREFIX1).dix $@
$(PREFIX1).autogen.bin: $(BASENAME).$(LANG2).dix tradukunet.$(LANG2).dix
        (head -n -3 $(BASENAME).$(LANG2).dix; cat tradukunet.$(LANG2).dix; tail -n -3 $(BASENAME).$(LANG2).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp rl tmp.dix $@ 
#       apertium-validate-dictionary $(BASENAME).$(LANG2).dix
#       lt-comp rl $(BASENAME).$(LANG2).dix $@
$(PREFIX1).autopgen.bin: $(BASENAME).post-$(LANG2).dix
        apertium-validate-dictionary $(BASENAME).post-$(LANG2).dix
        lt-comp lr $(BASENAME).post-$(LANG2).dix $@
$(PREFIX2).automorf.bin: $(BASENAME).$(LANG2).dix tradukunet.$(LANG2).dix
        (head -n -3 $(BASENAME).$(LANG2).dix; cat tradukunet.$(LANG2).dix; tail -n -3 $(BASENAME).$(LANG2).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp lr tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(LANG2).dix
#       lt-comp lr $(BASENAME).$(LANG2).dix $@
$(PREFIX2).autobil.bin: $(BASENAME).$(PREFIX1).dix  tradukunet.$(PREFIX1).dix
        (head -n -3 $(BASENAME).$(PREFIX1).dix; cat tradukunet.$(PREFIX1).dix; tail -n -3 $(BASENAME).$(PREFIX1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp rl tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(PREFIX1).dix
#       lt-comp rl $(BASENAME).$(PREFIX1).dix $@
$(PREFIX2).autogen.bin: $(BASENAME).$(LANG1).dix tradukunet.$(LANG1).dix
        (head -n -3 $(BASENAME).$(LANG1).dix; cat tradukunet.$(LANG1).dix; tail -n -3 $(BASENAME).$(LANG1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp rl tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(LANG1).dix
#       lt-comp rl $(BASENAME).$(LANG1).dix $@
Here the included files are files are called tradukunet.*.dix.
TODO[edit]
- go through http://wiki.apertium.org/wiki/Monodix_basics and review the file (the apertium-eo-en.eo.dix file) - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix - make some wiki notes.
<jacobn> Ok, Ill try the web doc translator more, find the systematics, report a bug and attach files etc.
File from traduku.net[edit]
convert it into EN : EO
then tag the EO side and strip out the nouns and adjectives those are most important to start with then grab a corpus (wikipedia, or euro parl or something)
22.15 and order them by frequency of the english word
mig: why reorder? francis.tyers@gmail.com: higher frequency words are more important
22.16 if you translate "the" correctly, you cover ~50% of the text, if you translate "gable" correctly you cover maybe 0.5% 22.17 mig: yes, yes, but why bother if all words get in?
francis.tyers@gmail.com: because someone has to add the inflection for the english side the esperanto side is regular, but the english is not always regular
22.18 mig: OK, so reording is important because we probably wont make all 110000.
francis.tyers@gmail.com: yep but the good news is we don't need to make 110000 we have 93% coverage with ~7,000 words so we can get 99% coverage with probably 20,000

